Automating Feed Updates with gtfs-kit

Q: What is the difference between gtfs-kit and partridge for feed automation?

gtfs-kit bundles schema validation, coordinate distance calculations, and a built-in .validate() DataFrame alongside its parsing layer, making it well-suited for single-feed pipelines that need validation out of the box. partridge is lighter-weight and enforces strict foreign-key filtering at load time, which is preferable when you need to extract a subset of service_ids without loading the entire feed into memory.

Transit agencies update their GTFS static feeds on irregular cadences — sometimes weekly, sometimes mid-week with no notice. Manual ingestion introduces latency, silent schema drift, and operational fragility the moment an agency changes a column type or drops a required file. A deterministic, automated pipeline that fetches, validates, normalizes, and archives schedule data on each invocation is the engineering baseline for any production mobility platform. This page walks through a complete Python implementation built around gtfs-kit, covering the full sequence from content-based deduplication through columnar export and retention management.

Prerequisites

Confirm these requirements before starting:

GTFS files needed: agency.txt, routes.txt, trips.txt, stop_times.txt, stops.txt, calendar.txt or calendar_dates.txt
Python 3.9+ inside an isolated virtual environment
gtfs-kit — primary parsing and validation engine
requests — HTTP client with streaming support
pandas and pyarrow — tabular manipulation and columnar export
Standard library: hashlib, logging, datetime, zipfile, pathlib

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install gtfs-kit requests pandas pyarrow

Assumed feed conditions: the agency publishes a publicly accessible ZIP at a stable URL. Feeds behind authentication require adding token headers to the requests.get() call but the rest of the pipeline is identical.

Concept and Spec Background

The GTFS Schedule specification defines a ZIP archive containing UTF-8 CSV files with rigid primary-key and foreign-key relationships. The core dependency graph looks like this:

Any automation pipeline must respect these foreign-key constraints during transformation. Dropping rows from stops.txt without cascading to stop_times.txt produces orphaned references that fail downstream routing engines.

The key spec rules that drive this implementation:

Rule	Implication for automation
Times in `stop_times.txt` may exceed `23:59:59`	Cannot use `datetime.time`; use `pd.to_timedelta()`
`agency_id` is optional when only one agency is present	Joins on `agency_id` must handle nullable columns
`calendar_dates.txt` can override or replace `calendar.txt`	Service-date resolution requires checking both files
Shape coordinates are optional (`shapes.txt` may be absent)	Do not hard-fail if `shapes.txt` is missing
`stop_sequence` values need not be consecutive	Sort by `stop_sequence` rather than assuming dense integers

Step-by-Step Implementation

The pipeline follows a strict idempotent sequence: fetch → validate → normalize → export → archive. Each stage is isolated to allow independent retry logic and safe rollback on partial failure.

Step 1: Secure Fetch & Content-Based Versioning

Transit agencies frequently update feeds without changing the base URL. Content-based versioning using SHA-256 hashing ensures only materially changed datasets trigger downstream processing. This avoids redundant validation and I/O overhead on unchanged content.

import requests
import hashlib
from pathlib import Path
import logging

logger = logging.getLogger("gtfs_pipeline")

def fetch_feed(url: str, cache_dir: Path) -> tuple[Path, str]:
    """Download GTFS ZIP and return (local_path, sha256_hex).
    
    Returns the existing cached path without re-writing if the
    content hash matches a previously downloaded file.
    """
    cache_dir.mkdir(parents=True, exist_ok=True)

    try:
        response = requests.get(url, timeout=30, stream=True)
        response.raise_for_status()
        content: bytes = response.content
    except requests.RequestException as exc:
        logger.error("Failed to fetch feed from %s: %s", url, exc)
        raise

    content_hash: str = hashlib.sha256(content).hexdigest()
    feed_path: Path = cache_dir / f"feed_{content_hash[:12]}.zip"

    if not feed_path.exists():
        feed_path.write_bytes(content)
        logger.info("New feed downloaded: %s", feed_path.name)
    else:
        logger.info(
            "Feed content unchanged (hash prefix: %s). Skipping downstream steps.",
            content_hash[:12],
        )

    return feed_path, content_hash

The short hash prefix in the filename acts as a human-readable fingerprint during incident review. Store the full hash in a side-car JSON manifest alongside each archive to support cryptographic verification during audits.

Step 2: Validation & Schema Alignment

Raw GTFS archives from real agencies contain malformed dates, orphaned foreign keys, coordinates outside valid WGS-84 bounds, and deprecated fields. gtfs-kit’s .validate() method checks referential integrity, coordinate bounds, and required column presence before any transformation occurs. Blocking on critical errors here prevents silently corrupted data from reaching downstream routing graphs.

For an in-depth comparison of validation strategies, GTFS Validation Rules and Common Schema Errors covers the full spec-level rule set and how tools differ in their coverage.

import gtfs_kit
import logging
from pathlib import Path

logger = logging.getLogger("gtfs_pipeline")

def validate_feed(feed_path: Path) -> gtfs_kit.Feed:
    """Load and validate a GTFS feed; raise ValueError on critical schema failures.
    
    Returns a fully parsed gtfs_kit.Feed ready for normalization.
    """
    try:
        feed: gtfs_kit.Feed = gtfs_kit.read_feed(str(feed_path), dist_units="km")
    except Exception as exc:
        logger.critical("Failed to parse GTFS archive at %s: %s", feed_path, exc)
        raise

    # .validate() returns a DataFrame with columns:
    #   type (str: 'error' | 'warning'), message (str), table (str), rows (list[int])
    issues = feed.validate()

    if not issues.empty:
        warnings = issues[issues["type"] == "warning"]
        critical = issues[issues["type"] == "error"]

        if not warnings.empty:
            logger.warning(
                "Validation warnings (%d): %s",
                len(warnings),
                warnings["message"].tolist(),
            )
        if not critical.empty:
            logger.error(
                "Critical validation errors (%d): %s",
                len(critical),
                critical["message"].tolist(),
            )
            raise ValueError(
                f"Feed at {feed_path} contains {len(critical)} blocking schema violation(s)."
            )

    logger.info("Validation passed for %s", feed_path.name)
    return feed

Step 3: Normalization & Schedule Alignment

Normalization standardizes identifiers, enforces explicit dtypes, and resolves representation differences before the data is written to storage. Two issues dominate real feeds: coordinates stored as object dtype that silently accept strings, and times stored as HH:MM:SS strings that cannot participate in arithmetic without conversion.

Mastering stops.txt and stop_times.txt relationships covers the referential constraints in detail. For feeds that publish frequencies.txt instead of explicit stop times, Handling Frequency-Based vs Timetable Schedules explains how to expand headway rows into per-departure stop_times.txt records before this normalization step.

Timezone handling and schedule normalization is a related concern: GTFS times are local-to-agency and must be anchored to an agency_timezone before converting to UTC for cross-agency comparisons.

import pandas as pd
import gtfs_kit
import logging

logger = logging.getLogger("gtfs_pipeline")

def normalize_feed(feed: gtfs_kit.Feed) -> dict[str, pd.DataFrame]:
    """Extract, type-cast, and clean core GTFS tables.
    
    Returns a dict mapping table name → normalized DataFrame.
    All numeric columns use nullable pandas dtypes to preserve
    NaN distinction from 0.
    """
    normalized: dict[str, pd.DataFrame] = {}

    # ── stops ──────────────────────────────────────────────────────────────
    stops: pd.DataFrame = feed.stops.copy()
    stops["stop_lat"] = pd.to_numeric(stops["stop_lat"], errors="coerce").astype("Float64")
    stops["stop_lon"] = pd.to_numeric(stops["stop_lon"], errors="coerce").astype("Float64")
    # Drop rows where coordinates failed to parse — these cannot be mapped
    before = len(stops)
    stops.dropna(subset=["stop_lat", "stop_lon"], inplace=True)
    if len(stops) < before:
        logger.warning("Dropped %d stop(s) with unparseable coordinates.", before - len(stops))
    stops["stop_id"] = stops["stop_id"].astype(str).str.strip()
    stops["stop_name"] = stops["stop_name"].astype(str).str.strip()
    normalized["stops"] = stops[["stop_id", "stop_name", "stop_lat", "stop_lon"]]

    # ── routes ─────────────────────────────────────────────────────────────
    routes: pd.DataFrame = feed.routes.copy()
    routes["route_short_name"] = (
        routes["route_short_name"].astype(str).str.strip()
    )
    routes["route_long_name"] = (
        routes.get("route_long_name", pd.Series(dtype=str))
        .astype(str)
        .str.strip()
    )
    # route_type is an integer enum per GTFS spec (0=tram, 1=metro, 2=rail, 3=bus…)
    routes["route_type"] = (
        pd.to_numeric(routes["route_type"], errors="coerce").astype("Int64")
    )
    normalized["routes"] = routes[
        ["route_id", "route_short_name", "route_long_name", "route_type"]
    ]

    # ── stop_times ─────────────────────────────────────────────────────────
    stop_times: pd.DataFrame = feed.stop_times.copy()
    # Times can legitimately exceed 23:59:59 for trips crossing midnight;
    # pd.to_timedelta handles "25:30:00" correctly as 1 day + 1.5 hours.
    stop_times["arrival_time"] = pd.to_timedelta(
        stop_times["arrival_time"], errors="coerce"
    )
    stop_times["departure_time"] = pd.to_timedelta(
        stop_times["departure_time"], errors="coerce"
    )
    stop_times["stop_sequence"] = (
        pd.to_numeric(stop_times["stop_sequence"], errors="coerce").astype("Int64")
    )
    normalized["stop_times"] = stop_times[
        ["trip_id", "stop_id", "arrival_time", "departure_time", "stop_sequence"]
    ]

    # ── trips ──────────────────────────────────────────────────────────────
    trips: pd.DataFrame = feed.trips.copy()
    for col in ["route_id", "service_id", "trip_id"]:
        if col in trips.columns:
            trips[col] = trips[col].astype(str).str.strip()
    normalized["trips"] = trips

    total_rows = sum(len(df) for df in normalized.values())
    logger.info(
        "Normalized %d tables, %d total rows.",
        len(normalized),
        total_rows,
    )
    return normalized

Step 4: Parquet Export & Archival

Columnar Parquet storage reduces I/O overhead for downstream routing and analytics engines. Snappy compression cuts file size by roughly 60% relative to raw CSV while remaining fast to decompress. The archival routine enforces a rolling retention window to prevent unbounded disk growth.

For feeds exceeding 500 MB uncompressed, see the Memory-Efficient Processing for Large Feeds guide, which covers chunked reading and out-of-core Parquet writing to avoid RAM exhaustion during this export step.

import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from datetime import datetime, timezone
import logging

logger = logging.getLogger("gtfs_pipeline")

def export_and_archive(
    normalized: dict[str, "pd.DataFrame"],
    archive_dir: Path,
    retention_days: int = 30,
) -> dict[str, Path]:
    """Write normalized tables to Snappy-compressed Parquet files.
    
    Prunes files older than retention_days from archive_dir.
    Returns a mapping of table name → output path for the current run.
    """
    archive_dir.mkdir(parents=True, exist_ok=True)
    timestamp: str = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    outputs: dict[str, Path] = {}

    for table_name, df in normalized.items():
        arrow_table: pa.Table = pa.Table.from_pandas(df, preserve_index=False)
        output_path: Path = archive_dir / f"{table_name}_{timestamp}.parquet"
        pq.write_table(arrow_table, str(output_path), compression="snappy")
        logger.info("Exported %s → %s (%d rows)", table_name, output_path.name, len(df))
        outputs[table_name] = output_path

    # Rolling retention: remove files whose mtime is beyond the cutoff
    cutoff_ts: float = datetime.now(timezone.utc).timestamp() - (retention_days * 86_400)
    for stale_file in archive_dir.glob("*.parquet"):
        if stale_file.stat().st_mtime < cutoff_ts:
            stale_file.unlink()
            logger.debug("Retention cleanup: removed %s", stale_file.name)

    return outputs

Step 5: Assembling the Pipeline

Wire the four stages into a single callable that can be invoked by any scheduler:

from pathlib import Path
import logging
import json

logging.basicConfig(
    level=logging.INFO,
    format='{"time": "%(asctime)s", "level": "%(levelname)s", "msg": %(message)s}',
)
logger = logging.getLogger("gtfs_pipeline")

FEED_URL = "https://example-transit.agency/gtfs/static.zip"
CACHE_DIR = Path("/var/cache/gtfs/raw")
ARCHIVE_DIR = Path("/var/cache/gtfs/parquet")
MANIFEST_PATH = Path("/var/cache/gtfs/manifest.json")

def run_pipeline(url: str = FEED_URL) -> None:
    """Full idempotent pipeline: fetch → validate → normalize → export → archive."""
    # Load previous hash to detect unchanged feeds early
    manifest: dict = {}
    if MANIFEST_PATH.exists():
        manifest = json.loads(MANIFEST_PATH.read_text())

    feed_path, content_hash = fetch_feed(url, CACHE_DIR)

    if manifest.get("last_hash") == content_hash:
        logger.info("Feed unchanged since last run. Pipeline exiting early.")
        return

    feed = validate_feed(feed_path)
    normalized = normalize_feed(feed)
    outputs = export_and_archive(normalized, ARCHIVE_DIR)

    manifest["last_hash"] = content_hash
    manifest["last_run_tables"] = {k: str(v) for k, v in outputs.items()}
    MANIFEST_PATH.write_text(json.dumps(manifest, indent=2))
    logger.info("Pipeline complete. Manifest updated at %s", MANIFEST_PATH)

if __name__ == "__main__":
    run_pipeline()

Validation and Verification

After each pipeline run, assert structural invariants before the manifest is written:

import pandas as pd
import pyarrow.parquet as pq
from pathlib import Path

def verify_outputs(outputs: dict[str, Path]) -> None:
    """Assert minimum row counts and referential integrity across exported tables."""
    frames: dict[str, pd.DataFrame] = {
        name: pq.read_table(str(path)).to_pandas()
        for name, path in outputs.items()
    }

    # Stops and routes must be non-empty
    assert len(frames["stops"]) > 0, "stops table is empty after normalization"
    assert len(frames["routes"]) > 0, "routes table is empty after normalization"

    # All stop_times must reference a known stop_id
    known_stop_ids: set = set(frames["stops"]["stop_id"])
    orphaned_stop_refs = ~frames["stop_times"]["stop_id"].isin(known_stop_ids)
    orphan_count = orphaned_stop_refs.sum()
    assert orphan_count == 0, (
        f"{orphan_count} stop_times rows reference unknown stop_ids after normalization"
    )

    # Coordinates must be within WGS-84 valid ranges
    assert frames["stops"]["stop_lat"].between(-90, 90).all(), "stop_lat out of WGS-84 range"
    assert frames["stops"]["stop_lon"].between(-180, 180).all(), "stop_lon out of WGS-84 range"

    # No null arrival_time in stop_times (departure can be null at last stop)
    null_arrivals = frames["stop_times"]["arrival_time"].isna().sum()
    assert null_arrivals == 0, f"{null_arrivals} stop_times rows have null arrival_time"

    print(
        f"Verification passed: {len(frames['stops'])} stops, "
        f"{len(frames['routes'])} routes, "
        f"{len(frames['stop_times'])} stop_times."
    )

Run verify_outputs(outputs) immediately after export_and_archive() returns. Tie assertion failures to alerting so the routing engine does not ingest a corrupt snapshot.

Failure Modes and Edge Cases

Real agency feeds introduce several patterns that break naive pipelines:

Times exceeding 23:59:59: GTFS permits 25:30:00 for trips running past midnight. Casting these to datetime.time raises ValueError. Always use pd.to_timedelta() and anchor to a calendar date only when building absolute timestamps.
Duplicate stop_sequence values within a trip_id: Some agency exports contain duplicate sequence numbers when a bus loops. Sort by stop_sequence and deduplicate with a groupby().cumcount() tiebreaker before joining with shapes.
Missing agency_id with multi-agency feeds: The spec makes agency_id optional when only one agency appears in agency.txt. When a feed unexpectedly contains two agencies without populating agency_id, joins on that column silently expand (Cartesian product). Validate agency.txt row count before joining.
Empty calendar.txt with only calendar_dates.txt: Some agencies encode all service solely as explicit date exceptions. If your normalization reads only calendar.txt for service dates, no trips will pass the service filter. Check both files.
shape_id referenced in trips.txt but absent from shapes.txt: The spec does not require shapes. Enforce a soft-fail — log the missing shapes but continue normalization rather than raising.
Column name collisions after strip: Some feeds ship column headers with trailing spaces ("stop_id "). gtfs-kit normalizes these on load, but if you ever bypass gtfs_kit.read_feed() and use raw pandas CSV loading, apply .str.strip() to df.columns immediately after reading.
Feed ZIP encoding: Agencies occasionally publish ZIPs with non-UTF-8 encoded filenames or BOM-prefixed CSVs. gtfs-kit handles common cases; if parsing fails with a UnicodeDecodeError, try re-reading with encoding="utf-8-sig" passed through gtfs_kit.read_feed()'s underlying pd.read_csv() calls.

Performance and Scale Notes

For feeds under approximately 200 MB compressed, the in-memory approach above is appropriate. Metropolitan feeds — particularly those from large multi-modal networks — require additional strategies:

Profile before loading. Use Python’s zipfile.ZipFile to read member sizes without extracting. If stop_times.txt exceeds 500 MB uncompressed, switch to chunked reading as covered in Memory-Efficient Processing for Large Feeds.
Parquet row groups. When writing large stop_times tables, pass row_group_size=500_000 to pq.write_table(). Downstream query engines (DuckDB, Spark, Athena) can skip row groups based on trip_id or stop_id predicates, reducing scan cost dramatically.
Multi-agency batching. Running this pipeline for dozens of regional providers in a single process creates I/O contention. For multi-agency automation patterns, see Batch Processing Strategies for Multi-Agency Feeds, which covers parallel worker pools with bounded concurrency.
Schema drift detection. Between runs, compare column sets and row-count ratios for each table. A sudden drop in stop_times volume (more than 20%) frequently indicates an agency publishing a partial feed or a ZIP corruption, not an actual service reduction. Emit a warning metric and hold the previous snapshot until the anomaly is resolved.

Production Orchestration

A standalone script is the starting point, not the destination. For production deployments:

Scheduler: Wrap run_pipeline() in Apache Airflow, Prefect, or a hardened cron job. Use the manifest hash to make the pipeline idempotent — re-running the same URL multiple times must produce no side effects if the feed content has not changed.
Structured logging: The JSON format string in the pipeline assembly step above emits machine-readable records compatible with Datadog, Loki, and CloudWatch. Always include the feed URL and content hash in log context.
Circuit breaking: If validate_feed() raises on three consecutive runs for the same agency, suppress downstream routing engine updates and trigger a human alert. Propagating a corrupted feed to a routing engine is worse than serving stale data.
Secret management: Store feed URLs that include API keys in environment variables or a secrets manager; do not hardcode them in the script or commit them to version control.

Frequently Asked Questions

Does gtfs-kit handle feeds with times past 24:00:00?

Yes. gtfs-kit preserves the GTFS convention of times exceeding 24:00:00 for trips that run past midnight. When you convert these to pandas timedelta using pd.to_timedelta(), values like 25:30:00 parse correctly as 1 day, 1 hour, 30 minutes — which is the right behavior for overnight trip arithmetic. Do not cast to datetime64 without first normalizing to a calendar date anchor.

How do I skip a feed that has not changed since the last run?

Compute a SHA-256 hash of the downloaded ZIP bytes and compare it against the hash stored from the previous run. If the hashes match, exit early without calling gtfs_kit.read_feed(). This avoids redundant validation, normalization, and I/O overhead even when the agency republishes the same content at the same URL.

What is the difference between gtfs-kit and partridge for feed automation?

gtfs-kit bundles schema validation, coordinate distance calculations, and a built-in .validate() DataFrame alongside its parsing layer, making it well-suited for single-feed pipelines that need validation out of the box. partridge is lighter-weight and enforces strict foreign-key filtering at load time, which is preferable when you need to extract a subset of service_ids without loading the entire feed into memory.

Handling Frequency-Based vs Timetable Schedules — expand frequencies.txt headways into per-departure stop times before normalization
Memory-Efficient Processing for Large Feeds — chunked reading and out-of-core Parquet strategies for feeds over 500 MB
Batch Processing Strategies for Multi-Agency Feeds — parallel worker pools for running this pipeline across dozens of agencies
GTFS Validation Rules and Common Schema Errors — full spec-level validation rule reference and tool comparison
Parsing GTFS with Pandas and Partridge — alternative parsing approach with strict foreign-key enforcement at load time

Up: Python Parsing & Data Normalization | Home