Fixing Missing `stop_times.txt` Records in Python

Fixing missing stop_times.txt records in Python requires a deterministic validation-and-reconciliation pipeline that cross-references trips.txt and stops.txt, flags referential gaps, and applies spec-compliant patching or reporting rules. The most reliable approach uses vectorized joins to detect orphaned trip_id/stop_sequence combinations, enforces strict GTFS column typing, and outputs either a patched feed or a structured anomaly report. Always validate against the GTFS Static Reference before deploying automated fixes, as silent interpolation can break downstream routing engines.

Why Records Go Missing

Missing records rarely occur randomly. Transit data pipelines typically drop rows due to three structural failures:

Schedule truncation: Agencies export peak-only trips but leave trips.txt entries for off-peak or seasonal services.
ETL misalignment: Column shifts, delimiter mismatches, or encoding errors truncate CSV exports during ingestion.
Referential drift: stop_id or trip_id values in stop_times.txt reference deprecated or renamed entities in parent tables.

Before patching, establish a baseline integrity check. Understanding the Mastering stops.txt and stop_times.txt Relationships ensures you don’t accidentally create phantom stops or violate sequence continuity. A robust pipeline must verify required columns exist, detect duplicate (trip_id, stop_sequence) pairs, identify trips present in trips.txt but absent from stop_times.txt, and flag stop_id references that lack parent records.

Validation Pipeline Architecture

A production-grade reconciliation workflow follows a strict order of operations:

Schema enforcement: Cast IDs to strings and sequences to nullable integers to prevent silent type coercion.
Referential anti-join: Isolate trip_id values that exist in trips.txt but have zero matching rows in stop_times.txt.
Sequence gap detection: Group by trip_id, sort by stop_sequence, and identify jumps greater than 1.
Foreign key validation: Cross-check every stop_id against stops.txt to catch orphaned references.
Diagnostic aggregation: Compile counts, affected IDs, and severity levels into a machine-readable report.

Applying consistent Python Parsing & Data Normalization practices at ingestion prevents downstream join failures and reduces memory overhead when processing multi-gigabyte feeds.

Production-Ready Python Implementation

The following script uses pandas for fast, memory-aware joins and explicit dtype enforcement. It outputs a diagnostic CSV and optionally generates placeholder records for missing sequences.

import pandas as pd
import numpy as np
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def audit_and_patch_stop_times(feed_dir: str, patch: bool = False) -> pd.DataFrame:
    feed_path = Path(feed_dir)
    
    # Load core GTFS tables with explicit dtypes
    trips = pd.read_csv(feed_path / "trips.txt", dtype={"trip_id": str, "route_id": str})
    stops = pd.read_csv(feed_path / "stops.txt", dtype={"stop_id": str})
    stop_times = pd.read_csv(
        feed_path / "stop_times.txt",
        dtype={"trip_id": str, "stop_id": str, "stop_sequence": "Int64"}
    )
    
    # Validate required columns
    required_st = {"trip_id", "stop_id", "arrival_time", "departure_time", "stop_sequence"}
    missing_cols = required_st - set(stop_times.columns)
    if missing_cols:
        raise ValueError(f"stop_times.txt missing required columns: {missing_cols}")
        
    # 1. Detect orphaned trips (in trips.txt but missing from stop_times.txt)
    trip_coverage = stop_times[["trip_id"]].drop_duplicates()
    orphaned = trips.merge(trip_coverage, on="trip_id", how="left", indicator=True)
    missing_trip_ids = orphaned.loc[orphaned["_merge"] == "left_only", "trip_id"].tolist()
    
    # 2. Detect invalid stop_id references
    valid_stops = set(stops["stop_id"])
    invalid_refs = stop_times[~stop_times["stop_id"].isin(valid_stops)]
    
    # 3. Detect sequence gaps per trip
    stop_times_sorted = stop_times.sort_values(["trip_id", "stop_sequence"])
    gaps = stop_times_sorted.groupby("trip_id")["stop_sequence"].diff().gt(1)
    gap_rows = stop_times_sorted[gaps]
    
    # Compile diagnostics
    diagnostics = pd.DataFrame({
        "issue_type": ["missing_trip_ids", "invalid_stop_refs", "sequence_gaps"],
        "count": [len(missing_trip_ids), len(invalid_refs), len(gap_rows)],
        "sample_ids": [
            missing_trip_ids[:5],
            invalid_refs[["trip_id", "stop_id"]].head(5).to_dict("records"),
            gap_rows[["trip_id", "stop_sequence"]].head(5).to_dict("records")
        ]
    })
    
    if patch:
        logging.info("Generating patched stop_times.txt with placeholder sequences...")
        # Safe patching: only fill missing sequences for trips that already have partial data
        patched_rows = []
        for trip_id, group in stop_times_sorted.groupby("trip_id"):
            seqs = group["stop_sequence"].tolist()
            if not seqs:
                continue
            expected = list(range(min(seqs), max(seqs) + 1))
            missing = [s for s in expected if s not in seqs]
            for m in missing:
                patched_rows.append({
                    "trip_id": trip_id,
                    "stop_id": "PLACEHOLDER",
                    "stop_sequence": m,
                    "arrival_time": "00:00:00",
                    "departure_time": "00:00:00"
                })
                
        if patched_rows:
            patch_df = pd.DataFrame(patched_rows)
            patch_df.to_csv(feed_path / "stop_times_patched.txt", index=False)
            logging.info(f"Appended {len(patch_df)} placeholder rows to stop_times_patched.txt")
            
    diagnostics.to_csv(feed_path / "stop_times_diagnostics.csv", index=False)
    logging.info("Diagnostics exported to stop_times_diagnostics.csv")
    return diagnostics

# Usage:
# audit_and_patch_stop_times("/path/to/gtfs_feed", patch=True)

Patching Strategy & Spec Compliance

Automated patching should never be the default. GTFS routing engines rely on strict temporal and spatial continuity. When generating placeholder records:

Never invent coordinates: Use PLACEHOLDER for stop_id and 00:00:00 for times to force manual review.
Preserve sequence order: Missing rows must fill exact integer gaps. Never shift existing stop_sequence values.
Flag for QA: Append a _patched suffix to the output filename and require agency sign-off before merging.

Refer to the official GTFS Static Reference for mandatory field constraints and time format rules (HH:MM:SS). If your pipeline requires advanced merge logic, consult the pandas merge documentation to optimize join strategies for large transit datasets.

Deployment Checklist

Run diagnostics on a staging copy; never mutate the original feed.
Verify trip_id and stop_id casting matches your agency’s schema.
Confirm stop_sequence uses nullable integers (Int64) to handle sparse data.
Validate patched output with a GTFS validator (e.g., MobilityData Validator) before ingestion.
Log all patch operations with timestamps and row counts for audit trails.

Missing records degrade schedule accuracy and break isochrone generation. A deterministic Python pipeline that prioritizes detection over silent correction keeps routing engines stable and maintains data lineage for transit operators.

Fixing Missing stop_times.txt Records in Python