Parsing GTFS with Pandas and Partridge

Q: Why does Partridge load a subset of trips instead of the full feed?

Partridge applies a view dictionary that filters rows using GTFS foreign-key relationships. When you filter calendar_dates.txt to a single date, Partridge cascades that filter through trips.txt and stop_times.txt automatically, so only trips active on that date are materialized.

Q: How do I handle GTFS departure times greater than 24:00:00?

Convert time strings to pandas timedelta rather than datetime. Split on ':' to extract hours, minutes, and seconds, then sum the components using pd.to_timedelta(). Values like 25:30:00 become timedelta(hours=25, minutes=30), which correctly represents 1:30 AM the next service day.

Q: What Parquet compression codec should I use for GTFS data?

Use zstd. It achieves better compression ratios than snappy on GTFS data (which contains many repeated string identifiers) while maintaining fast decompression speeds suitable for interactive analytics tools like DuckDB.

Real-world GTFS feeds routinely contain malformed time strings, orphaned foreign keys, and calendar edge cases that crash naive CSV readers before a single row of analysis is possible. This page covers a production-tested pipeline that pairs Partridge’s lazy-loading validation engine with Pandas’s vectorized transformation capabilities, giving transit analysts, Python GIS engineers, and mobility platform teams a deterministic path from raw agency ZIP to query-ready columnar data.

Prerequisites

Before starting, verify your environment meets these baseline requirements:

Python 3.9+ — required for zoneinfo (standard library timezone support) and modern pandas type hint APIs.
Required files inside the GTFS ZIP — stops.txt, routes.txt, trips.txt, stop_times.txt, and at least one of calendar.txt / calendar_dates.txt.
Core libraries — install before proceeding:

pip install "pandas>=2.0.0" "partridge>=1.1.0" pyarrow requests

Memory headroom — metropolitan feeds decompress to 500 MB or more. Partridge defers materialization through lazy loading, but Pandas operations on the full stop_times table still require available RAM. For feeds that exceed available memory, see the memory-efficient processing guidance.

Dependency	Minimum version	Purpose
`pandas`	2.0.0	DataFrame operations, type inference, timedelta arithmetic
`partridge`	1.1.0	Lazy-loading GTFS parser with FK validation
`pyarrow`	12.0+	Parquet serialization with column-level compression
`requests`	2.28+	Streaming HTTP download of feed ZIPs

Concept and Spec Background

GTFS Static distributes schedule data as a ZIP containing multiple CSV files linked by shared identifiers. The key relational constraints are:

Every trip_id in stop_times.txt must exist in trips.txt.
Every service_id in trips.txt must exist in calendar.txt or calendar_dates.txt.
Every stop_id in stop_times.txt must exist in stops.txt.
Every route_id in trips.txt must exist in routes.txt.

Violating any of these constraints produces orphaned records that silently corrupt route-level aggregations. Partridge encodes these foreign-key relationships so that a filter applied to one table cascades automatically to all dependent tables.

The second major spec complexity is the time format. GTFS stop times use HH:MM:SS, but values exceed 24:00:00 for trips that cross midnight without a service-day boundary reset — 25:30:00 means 1:30 AM on the following calendar day. Pandas datetime parsers reject these values; timedelta is the correct type.

GTFS file relationships — spec reference:

File	Primary key	Foreign keys	Role in pipeline
`stops.txt`	`stop_id`	—	Geographic stop definitions
`routes.txt`	`route_id`	`agency_id`	Route metadata
`trips.txt`	`trip_id`	`route_id`, `service_id`, `shape_id`	Trip-to-route binding
`stop_times.txt`	(`trip_id`, `stop_sequence`)	`trip_id`, `stop_id`	Largest table; drives all schedule queries
`calendar.txt`	`service_id`	—	Weekday service patterns
`calendar_dates.txt`	(`service_id`, `date`)	`service_id`	Exception dates (holiday service)

Step-by-Step Implementation

Step 1 — Acquire and Cache the Feed ZIP

Download the GTFS ZIP to a local directory before loading. Local caching avoids repeated network calls during development and enables checksum-based change detection for automated feed update workflows.

import os
import hashlib
import requests

FEED_URL = "https://example-transit-agency.gov/gtfs.zip"
CACHE_DIR = "./data/feeds"
FEED_PATH = os.path.join(CACHE_DIR, "latest.gtfs.zip")

def fetch_feed(url: str, dest_path: str) -> str:
    """Stream a GTFS ZIP to disk. Returns the local path on success."""
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    temp_path = dest_path + ".download"

    with requests.get(url, stream=True, timeout=60) as response:
        response.raise_for_status()
        with open(temp_path, "wb") as fh:
            for chunk in response.iter_content(chunk_size=65_536):
                fh.write(chunk)

    # Atomic replace — never leave a partial file at the live path
    os.replace(temp_path, dest_path)
    return dest_path


def sha256_of_file(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as fh:
        for block in iter(lambda: fh.read(65_536), b""):
            h.update(block)
    return h.hexdigest()


feed_path = fetch_feed(FEED_URL, FEED_PATH)
print(f"Feed cached: {feed_path}  sha256={sha256_of_file(feed_path)[:16]}…")

Step 2 — Load with Schema Validation

Partridge’s load_feed() accepts a view dictionary that acts as a cascading row filter. Filtering calendar_dates.txt to a single date automatically restricts trips.txt and stop_times.txt to only the rows that reference active service IDs on that date. Tables are not read from disk until you access the corresponding attribute.

For detailed configuration of service-day filtering and the full range of Partridge view options, see the step-by-step Partridge parsing guide.

import partridge as ptg
from datetime import date

TARGET_DATE = date(2024, 9, 17)  # Tuesday — representative weekday

# A view restricts each file to rows that satisfy the condition.
# Partridge resolves FK chains: filter calendar_dates → trips → stop_times.
inferred_service_ids = ptg.read_service_ids_by_date(feed_path)
active_service_ids = inferred_service_ids.get(TARGET_DATE, set())

if not active_service_ids:
    raise ValueError(f"No active service IDs found for {TARGET_DATE}")

view = {
    "trips.txt": {"service_id": active_service_ids},
}

feed = ptg.load_feed(feed_path, view=view)

print(f"Stops:      {len(feed.stops):>8,}")
print(f"Routes:     {len(feed.routes):>8,}")
print(f"Trips:      {len(feed.trips):>8,}")
print(f"Stop times: {len(feed.stop_times):>8,}")

Partridge raises ValueError on missing mandatory columns and KeyError on unresolvable foreign-key references — both of which surface immediately rather than producing silently corrupt DataFrames.

Step 3 — Normalize Time Columns

Convert arrival_time and departure_time from raw strings to pd.Timedelta. This handles values above 24:00:00 correctly and enables direct arithmetic (computing headways, dwell times, and travel times between stops).

import pandas as pd


def parse_gtfs_time(series: pd.Series) -> pd.Series:
    """
    Convert a GTFS HH:MM:SS series (including values >24:00) to timedelta.
    Null/empty strings become NaT.
    """
    # Mask truly missing values before splitting
    mask_valid = series.notna() & (series.str.strip() != "")
    parts = series[mask_valid].str.split(":", expand=True).astype(int)

    seconds_total = (
        parts.iloc[:, 0] * 3600  # hours
        + parts.iloc[:, 1] * 60  # minutes
        + parts.iloc[:, 2]       # seconds
    )

    result = pd.Series(pd.NaT, index=series.index, dtype="timedelta64[ns]")
    result[mask_valid] = pd.to_timedelta(seconds_total, unit="s")
    return result


stop_times = feed.stop_times.copy()
stop_times["arrival_time"]   = parse_gtfs_time(stop_times["arrival_time"])
stop_times["departure_time"] = parse_gtfs_time(stop_times["departure_time"])

# Derive dwell time at each stop (seconds)
stop_times["dwell_s"] = (
    (stop_times["departure_time"] - stop_times["arrival_time"])
    .dt.total_seconds()
    .clip(lower=0)
)

print(stop_times[["trip_id", "stop_sequence", "arrival_time", "departure_time", "dwell_s"]].head())

For feeds that mix timetable and frequency-based schedules, expand frequencies.txt into discrete departure instances before applying this normalization — frequency-based trips carry synthetic times derived from the headway_secs column.

For downstream operations involving cross-timezone comparisons or UTC storage, apply timezone normalization using the agency timezone declared in agency.txt after converting to timedelta.

Step 4 — Assert Referential Integrity

After normalization, verify that no orphaned records exist before writing to a persistent store. Silent orphans produce wrong trip counts, missing stop coverage, and corrupted headway calculations.

def assert_referential_integrity(feed: ptg.Feed, stop_times: pd.DataFrame) -> None:
    """Raise AssertionError on any broken foreign-key reference."""
    valid_trip_ids = set(feed.trips["trip_id"])
    valid_stop_ids = set(feed.stops["stop_id"])
    valid_route_ids = set(feed.routes["route_id"])

    orphan_trips = set(stop_times["trip_id"]) - valid_trip_ids
    orphan_stops = set(stop_times["stop_id"]) - valid_stop_ids
    orphan_routes = set(feed.trips["route_id"]) - valid_route_ids

    assert not orphan_trips,  f"Orphaned trip_ids in stop_times: {orphan_trips}"
    assert not orphan_stops,  f"Orphaned stop_ids in stop_times: {orphan_stops}"
    assert not orphan_routes, f"Orphaned route_ids in trips: {orphan_routes}"

    # Verify stop_sequence is monotonically increasing per trip
    seq_check = (
        stop_times
        .sort_values(["trip_id", "stop_sequence"])
        .groupby("trip_id")["stop_sequence"]
        .is_monotonic_increasing
    )
    bad_trips = seq_check[~seq_check].index.tolist()
    assert not bad_trips, f"Non-monotonic stop_sequence in {len(bad_trips)} trip(s): {bad_trips[:5]}"

    print("Referential integrity: OK")


assert_referential_integrity(feed, stop_times)

Step 5 — Export to Parquet

Export the validated and normalized DataFrames to columnar Parquet files. Parquet preserves schema types (including timedelta64), supports column-level compression, and enables predicate pushdown in DuckDB, Polars, and cloud data warehouses. For strategies covering very large feeds, see memory-efficient processing for large feeds.

import pyarrow as pa
import pyarrow.parquet as pq


def write_parquet(df: pd.DataFrame, output_path: str) -> None:
    """Write a DataFrame to Parquet with zstd compression and categorical optimization."""
    # Convert low-cardinality string columns to categorical before Arrow conversion
    for col in df.select_dtypes(include=["object"]).columns:
        if df[col].nunique() / max(len(df), 1) < 0.5:
            df[col] = df[col].astype("category")

    table = pa.Table.from_pandas(df, preserve_index=False)
    pq.write_table(table, output_path, compression="zstd", compression_level=9)
    size_mb = os.path.getsize(output_path) / 1_048_576
    print(f"  {os.path.basename(output_path)}: {len(df):,} rows → {size_mb:.1f} MB")


OUTPUT_DIR = "./data/normalized"
os.makedirs(OUTPUT_DIR, exist_ok=True)

tables_to_export = {
    "stops":          feed.stops,
    "routes":         feed.routes,
    "trips":          feed.trips,
    "stop_times":     stop_times,            # use the normalized copy
    "calendar":       feed.calendar,
    "calendar_dates": feed.calendar_dates,
}

print("Exporting Parquet files:")
for name, df in tables_to_export.items():
    if df is None or df.empty:
        print(f"  {name}.parquet: empty — skipped")
        continue
    write_parquet(df.copy(), os.path.join(OUTPUT_DIR, f"{name}.parquet"))

zstd at level 9 typically achieves 65–80 % size reduction compared to raw CSV for GTFS data, which contains highly repetitive string identifiers across millions of stop-time rows.

Validation and Verification

After export, verify the pipeline produced correct results with a lightweight sanity-check suite:

import pyarrow.parquet as pq

def verify_parquet_output(output_dir: str, expected_min_trips: int = 1) -> None:
    stop_times_path = os.path.join(output_dir, "stop_times.parquet")
    trips_path      = os.path.join(output_dir, "trips.parquet")
    stops_path      = os.path.join(output_dir, "stops.parquet")

    st  = pq.read_table(stop_times_path).to_pandas()
    tr  = pq.read_table(trips_path).to_pandas()
    stp = pq.read_table(stops_path).to_pandas()

    # Row count sanity
    assert len(tr) >= expected_min_trips, f"Expected ≥{expected_min_trips} trips, got {len(tr)}"
    assert len(st) > len(tr),            "stop_times should have more rows than trips"

    # Time columns preserved as timedelta
    assert pd.api.types.is_timedelta64_dtype(st["arrival_time"]),   "arrival_time must be timedelta"
    assert pd.api.types.is_timedelta64_dtype(st["departure_time"]), "departure_time must be timedelta"

    # No nulls in mandatory FK columns
    for col in ["trip_id", "stop_id", "stop_sequence"]:
        null_count = st[col].isna().sum()
        assert null_count == 0, f"stop_times.{col} has {null_count} nulls"

    # All stop_times trip_ids resolve
    orphans = set(st["trip_id"]) - set(tr["trip_id"])
    assert not orphans, f"{len(orphans)} orphaned trip_ids in stop_times"

    print(f"Verification passed — {len(tr):,} trips, {len(st):,} stop times, {len(stp):,} stops.")


verify_parquet_output(OUTPUT_DIR, expected_min_trips=50)

Failure Modes and Edge Cases

Times exceeding 24:00 — Any agency running overnight service will produce times like 25:10:00 or 26:00:00. Feeding these to pd.to_datetime() raises OutOfBoundsDatetime. Always use timedelta as shown in Step 3.
Missing calendar.txt (calendar-dates-only feeds) — Some agencies publish only calendar_dates.txt with no calendar.txt. Accessing feed.calendar returns an empty DataFrame. Guard every access with a null/empty check.
stop_times.txt with no arrival_time — The spec allows arrival_time to be empty for intermediate stops when the agency only publishes departure times. Validate that at least the first and last stop of each trip have non-null arrival times.
Duplicate stop_sequence values — A minority of agency feeds accidentally publish duplicate sequence numbers within a single trip. The monotonicity assertion in Step 4 catches this; resolution requires per-trip re-sequencing or manual feed correction.
Non-UTF-8 encoding — Partridge expects UTF-8. Windows-1252 or Latin-1 encoded feeds raise UnicodeDecodeError at load time. Pre-process with chardet detection and re-encode before loading.
shape_id present in trips.txt but shapes.txt missing — Partridge will surface this as a missing file warning rather than an error. Downstream shape-snapping operations will silently skip affected trips unless you explicitly check feed.shapes for emptiness.
Overlapping calendar_dates.txt exception types — Some feeds have both an exception_type=1 (added) and exception_type=2 (removed) record for the same (service_id, date) pair. The GTFS spec forbids this; Partridge does not reject it. Filter to the most restrictive interpretation (removal wins).

Performance and Scale Notes

For feeds above 500 MB uncompressed, the default full-load approach runs out of RAM on machines with less than 16 GB available. Apply these strategies in order of impact:

1. Date-scope the view aggressively. A single-date Partridge view can reduce stop_times rows by 70–85 % for agencies with complex calendar patterns. Load only the service day(s) you need.

2. Drop unused columns before materialization. After calling feed.stop_times, immediately select only required columns before any copy or merge operation:

required_cols = ["trip_id", "stop_id", "stop_sequence", "arrival_time", "departure_time"]
stop_times = feed.stop_times[required_cols].copy()

3. Use categorical dtypes for ID columns. String columns like trip_id, stop_id, and route_id repeat millions of times. Converting to category after load typically halves stop_times memory use:

id_cols = ["trip_id", "stop_id", "route_id"]
for col in id_cols:
    if col in stop_times.columns:
        stop_times[col] = stop_times[col].astype("category")

4. For multi-agency batches, wrap each feed in its own Python subprocess or Prefect/Airflow task. This isolates memory and prevents stop_id namespace collisions across agencies. See batch processing strategies for multi-agency feeds for namespace-prefixing patterns.

5. Query Parquet directly without loading into Pandas. Once exported, use DuckDB to run SQL directly on the Parquet files with predicate pushdown — no Python process holds the full dataset in memory:

import duckdb

result = duckdb.query("""
    SELECT route_id, COUNT(DISTINCT trip_id) AS trips_today
    FROM './data/normalized/trips.parquet'
    GROUP BY route_id
    ORDER BY trips_today DESC
    LIMIT 20
""").to_df()
print(result)

For feeds approaching 1 GB uncompressed, the optimizing Pandas memory usage for transit feeds guide covers chunk-based stop_times iteration and Arrow-backed DataFrames.

Frequently Asked Questions

Why does Partridge load a subset of trips instead of the full feed?

Partridge applies a view dictionary that filters rows using GTFS foreign-key relationships. When you filter calendar_dates.txt to a single date, Partridge cascades that filter through trips.txt and stop_times.txt automatically, so only trips active on that date are materialized. This is the primary mechanism for keeping memory use manageable on large metropolitan feeds.

How do I handle GTFS departure times greater than 24:00:00?

Convert time strings to pd.Timedelta rather than datetime. Split on : to extract hours, minutes, and seconds, then sum the components using pd.to_timedelta(). Values like 25:30:00 become timedelta(hours=25, minutes=30), which correctly represents 1:30 AM the next service day and supports all arithmetic operations.

What Parquet compression codec should I use for GTFS data?

Use zstd. It achieves better compression ratios than snappy on GTFS data (which contains many repeated string identifiers) while maintaining fast decompression speeds suitable for interactive analytics tools like DuckDB. Level 9 is a good default; level 3 gives faster writes at slightly larger file sizes for pipeline stages where write speed matters.

What should I do if a feed is missing calendar.txt entirely?

Access feed.calendar and check df.empty before using it. Feeds that rely solely on calendar_dates.txt — common for agencies with many holiday exceptions — produce an empty calendar DataFrame. Build your service-ID resolution exclusively from calendar_dates.txt in these cases, querying for rows where exception_type == 1 on the target date.

Step-by-Step Guide to Parsing GTFS with Partridge — detailed Partridge configuration, view options, and service-day filtering
Handling Frequency-Based vs Timetable Schedules — expand frequencies.txt into discrete departure instances
Memory-Efficient Processing for Large Feeds — chunked reading, Arrow-backed DataFrames, multi-agency batching
Automating Feed Updates with gtfs-kit — checksum-based change detection and idempotent ingestion scripts
Mastering stops.txt and stop_times.txt Relationships — stop-level data model constraints and referential integrity patterns

Up: Python Parsing & Data Normalization | Home