Parsing GTFS with Pandas and Partridge

Transit data pipelines require deterministic parsing, relational integrity, and memory-aware transformations. The General Transit Feed Specification (GTFS) distributes schedule and geographic data as a collection of CSV files compressed into a single ZIP archive. While straightforward in concept, real-world feeds contain inconsistent time formats, orphaned references, and calendar edge cases that routinely break naive CSV readers. Parsing GTFS with Pandas and Partridge provides a production-tested approach that combines Partridge’s lazy-loading validation engine with Pandas’ vectorized transformation capabilities. This workflow is engineered for transit analysts, urban tech developers, Python GIS engineers, and mobility platform teams who need reliable, repeatable data normalization before downstream routing, analytics, or API publishing. For teams building foundational data infrastructure, this methodology aligns closely with broader Python Parsing & Data Normalization best practices, ensuring that raw agency exports become structured, query-ready datasets.

Prerequisites & Environment Setup

Before implementing the parsing pipeline, ensure your environment meets the following baseline requirements:

  • Python 3.9+: Required for modern pandas type hints, zoneinfo standard library support, and improved memory management.
  • Core Libraries: pandas>=2.0.0, partridge>=1.1.0, pyarrow (strongly recommended for Parquet serialization), requests (for feed retrieval).
  • GTFS Reference Familiarity: Understanding of mandatory vs. optional files, relational keys (stop_id, trip_id, service_id), and calendar logic.
  • Memory Considerations: Metropolitan feeds can exceed 500MB uncompressed. Partridge mitigates this through lazy loading, but downstream Pandas operations still require adequate RAM or explicit chunking strategies.

Install dependencies via pip:

bash
pip install pandas partridge pyarrow requests

Core Workflow Architecture

A robust GTFS parsing pipeline follows a deterministic sequence: acquisition, validation, transformation, and export. Each stage isolates failure modes and ensures that malformed data never propagates into analytical models or routing engines.

1. Feed Acquisition & Local Caching

Feeds are typically hosted at static URLs or retrieved via agency APIs. Download the ZIP to a local cache directory to avoid repeated network calls during development and to guarantee reproducible runs. Partridge accepts both local filesystem paths and file-like objects, but local caching improves debugging and enables version-controlled snapshots.

For agencies that publish feeds on predictable schedules, consider implementing automated retrieval and checksum validation. The Automating Feed Updates with GTFS-Kit workflow demonstrates how to wrap feed ingestion in idempotent scripts that track version hashes and trigger downstream processing only when substantive changes occur.

python
import os
import hashlib
import requests

FEED_URL = "https://example-transit-agency.gov/gtfs.zip"
CACHE_DIR = "./data/feeds"
FEED_PATH = os.path.join(CACHE_DIR, "latest.gtfs.zip")

def fetch_feed(url, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    response = requests.get(url, stream=True)
    response.raise_for_status()
    
    # Write to temp file first, then move to avoid partial reads
    temp_path = path + ".tmp"
    with open(temp_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    
    # Verify SHA-256 if agency provides it, otherwise move directly
    os.replace(temp_path, path)
    return path

2. Schema Validation & Selective Loading

GTFS files are highly interdependent. stop_times.txt references trips.txt, which references calendar.txt or calendar_dates.txt. Partridge validates these relationships during load time, raising explicit errors on missing foreign keys or malformed CSV headers. By passing a view dictionary, you can restrict the feed to a specific date range, route subset, or service pattern, dramatically reducing memory footprint and accelerating downstream transformations.

Refer to the Step-by-Step Guide to Parsing GTFS with Partridge for detailed configuration of feed views, date filtering, and relational integrity checks. The official GTFS Schedule Reference outlines the exact column requirements and data types expected for each file.

python
import partridge as ptg
from datetime import date

# Define a view to load only trips active on a specific date
view = {
    "calendar_dates.txt": {"date": date(2024, 6, 15)},
    "trips.txt": {"service_id": None},  # Inherits from calendar_dates
    "stop_times.txt": {"trip_id": None} # Inherits from trips
}

feed = ptg.load_feed(FEED_PATH, view=view)
print(f"Loaded {len(feed.trips)} trips and {len(feed.stops)} stops.")

Partridge’s lazy evaluation means DataFrames are only materialized when accessed. This prevents out-of-memory crashes when working with multi-million-row metropolitan feeds.

3. Temporal Normalization & Schedule Resolution

Raw GTFS data requires structural adjustments before analysis. The most common pain point is time representation. GTFS uses HH:MM:SS format, but departure times frequently exceed 24:00:00 to represent overnight service (e.g., 25:30:00 for 1:30 AM the next day). Pandas’ native datetime parsers will reject these values unless explicitly handled.

Additionally, agencies publish schedules using two distinct paradigms: fixed timetables and frequency-based headways. Understanding how to resolve both is critical for accurate service-level calculations. The Handling Frequency-Based vs Timetable Schedules guide details how to expand frequencies.txt into discrete trip instances and merge them with timetable data.

python
import pandas as pd
import numpy as np

def normalize_gtfs_times(df: pd.DataFrame, time_cols: list[str]) -> pd.DataFrame:
    """Convert GTFS HH:MM:SS (including >24h) to timedelta for arithmetic."""
    for col in time_cols:
        if col not in df.columns:
            continue
        # Split hours, minutes, seconds
        parts = df[col].astype(str).str.split(":", expand=True)
        parts.columns = ["h", "m", "s"]
        
        # Convert to numeric, handling potential NaNs
        h = pd.to_numeric(parts["h"], errors="coerce").fillna(0).astype(int)
        m = pd.to_numeric(parts["m"], errors="coerce").fillna(0).astype(int)
        s = pd.to_numeric(parts["s"], errors="coerce").fillna(0).astype(int)
        
        df[col] = pd.to_timedelta(h, unit="h") + pd.to_timedelta(m, unit="m") + pd.to_timedelta(s, unit="s")
    return df

# Apply to stop_times
feed.stop_times = normalize_gtfs_times(feed.stop_times, ["arrival_time", "departure_time"])

For comprehensive temporal operations, consult the official Pandas Time Series Documentation, which covers timedelta arithmetic, timezone localization, and resampling techniques applicable to transit headway analysis.

4. Memory-Aware Transformation & Export

Once validated and normalized, the dataset should be exported to a columnar format optimized for analytical workloads. Parquet, backed by Apache Arrow, preserves schema types, supports compression, and enables predicate pushdown for downstream queries.

python
import pyarrow as pa
import pyarrow.parquet as pq

def export_feed_to_parquet(feed: ptg.Feed, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    
    # Export core tables
    tables = {
        "stops": feed.stops,
        "routes": feed.routes,
        "trips": feed.trips,
        "stop_times": feed.stop_times,
        "calendar": feed.calendar,
        "calendar_dates": feed.calendar_dates
    }
    
    for name, df in tables.items():
        if df is None or df.empty:
            continue
        # Ensure categorical columns are optimized
        for col in df.select_dtypes(include=["object"]).columns:
            if df[col].nunique() / len(df) < 0.5:  # Low cardinality threshold
                df[col] = df[col].astype("category")
                
        table = pa.Table.from_pandas(df, preserve_index=False)
        pq.write_table(table, os.path.join(output_dir, f"{name}.parquet"), compression="zstd")
        
export_feed_to_parquet(feed, "./data/normalized")

This export strategy reduces storage footprint by 60–80% compared to raw CSV while accelerating read operations in DuckDB, Polars, or cloud data warehouses.

Production Hardening & Pipeline Integration

A parsing script that works locally will fail under production conditions without explicit error boundaries and observability. Implement structured logging to capture validation failures, orphaned references, and type coercion warnings. Route these logs to a centralized monitoring system rather than printing to stdout.

When orchestrating at scale, wrap the pipeline in a task runner like Apache Airflow or Prefect. Schedule daily feed ingestion, validate schema drift, and trigger downstream routing engine updates only when service changes exceed a defined threshold. For multi-agency deployments, batch processing strategies should isolate each feed into independent execution contexts to prevent cross-contamination of stop_id namespaces or agency_id collisions.

Finally, maintain a data quality categorization framework. Tag records as VALID, WARNING (e.g., missing wheelchair_boarding), or ERROR (e.g., circular route geometry). This enables analysts to filter noise without discarding entire feeds. By adhering to deterministic parsing, explicit type casting, and columnar export patterns, Parsing GTFS with Pandas and Partridge becomes a repeatable foundation for transit analytics, mobility APIs, and urban planning models.