Memory-Efficient Processing for Large GTFS Feeds

Q: How large can GTFS stop_times.txt get?

Metropolitan agency feeds routinely produce stop_times.txt files with 20–80 million rows, reaching 2–6 GB uncompressed. stop_times.txt typically accounts for 60–80 percent of total feed size.

Public transit agencies in major metropolitan regions routinely publish GTFS archives that exceed several gigabytes when uncompressed. The stop_times.txt and shapes.txt tables alone can contain tens of millions of rows, making naive in-memory loading a reliable cause of pipeline failures and OOM crashes. Solving this requires a deliberate architectural shift: moving away from monolithic DataFrame instantiation toward streaming ingestion, strict type enforcement, incremental joins, and columnar persistence. This guide provides production-tested patterns for transit analysts, urban tech developers, and Python GIS engineers who must normalize, validate, and transform massive transit datasets without overprovisioning compute resources.

Pipeline Overview

The diagram below shows the five-stage sequence that keeps peak RAM bounded regardless of feed size. Each stage hands off only the minimum data needed by the next.

Prerequisites

Before implementing chunked or streaming workflows, confirm the following:

Python 3.10+ with pandas>=2.0, pyarrow>=12.0, and optionally polars>=0.19 for heavy joins
System RAM: 8 GB minimum; 16 GB+ recommended for multi-agency metro feeds
Storage: SSD-backed temporary directory for intermediate Parquet files
GTFS familiarity: Understanding of the relational model — primary and foreign keys across stops.txt, trips.txt, routes.txt, and stop_times.txt

Install the required libraries:

pip install pandas>=2.0 pyarrow>=12.0 polars

Concept and Spec Background

The GTFS specification does not prescribe any particular file size or row count limit. In practice, stop_times.txt is the heaviest file by a significant margin because every trip must enumerate an arrival and departure time at every stop it serves. A single metro bus network can generate 25 million rows in stop_times.txt while routes.txt contains only a few hundred rows. This asymmetry is the central constraint: the lookup tables (routes.txt, trips.txt, calendar.txt) are small enough to hold in RAM; the fact tables (stop_times.txt, shapes.txt) are not.

GTFS file	Typical rows (metro)	Key columns	Memory strategy
`stop_times.txt`	5 M – 80 M	`trip_id`, `stop_id`, `stop_sequence`, `arrival_time`, `departure_time`	Stream in chunks; category dtypes
`shapes.txt`	500 K – 10 M	`shape_id`, `shape_pt_lat`, `shape_pt_lon`, `shape_pt_sequence`	Stream in chunks; float32 coords
`trips.txt`	50 K – 500 K	`route_id`, `trip_id`, `service_id`, `shape_id`	Load fully; use as join lookup
`routes.txt`	50 – 5 000	`route_id`, `agency_id`, `route_type`	Load fully; join lookup
`stops.txt`	1 K – 100 K	`stop_id`, `stop_lat`, `stop_lon`	Load fully; join lookup
`calendar.txt`	1 – 5 000	`service_id`, weekday flags, `start_date`, `end_date`	Load fully; join lookup

Apply timezone normalization before persisting time columns — GTFS allows times beyond 23:59:59 for overnight service, and these must remain as strings or timedelta until you are ready to materialize absolute UTC timestamps.

Step-by-Step Implementation

Step 1: Profile Feed Structure Before Loading

Never assume uniform row counts across tables. Extract ZIP archive metadata to identify heavy files before opening anything. stop_times.txt typically accounts for 60–80 percent of total uncompressed size.

import zipfile
from pathlib import Path

def profile_feed(archive_path: str) -> dict[str, int]:
    """Return {filename: uncompressed_bytes} sorted largest first."""
    sizes: dict[str, int] = {}
    with zipfile.ZipFile(archive_path, "r") as zf:
        for info in zf.infolist():
            if info.filename.endswith(".txt"):
                sizes[info.filename] = info.file_size
    return dict(sorted(sizes.items(), key=lambda kv: kv[1], reverse=True))

# Usage
feed_profile = profile_feed("metro_gtfs.zip")
for fname, nbytes in feed_profile.items():
    print(f"{fname:<35} {nbytes / 1_048_576:.1f} MB")

Use the profile to set CHUNK_SIZE dynamically. A useful heuristic: each chunk should fit in one-quarter of your available RAM after accounting for pandas overhead (roughly 3–5× the raw CSV bytes per chunk).

Step 2: Stream Extraction and Chunked Ingestion

Avoid shutil.unpack_archive() or zipfile.extractall(). Open files directly from the archive using zipfile.ZipFile.open() and pass them to pd.read_csv() with a chunksize parameter. This prevents the OS from buffering the entire decompressed stream to disk. For teams evaluating library-level abstractions, parsing GTFS with pandas and partridge shows when partridge’s lazy-loading model outperforms manual streaming.

import pandas as pd
import zipfile

FEED_PATH = "metro_gtfs.zip"
CHUNK_SIZE = 500_000  # rows per iteration; tune based on profile_feed output

def iter_gtfs_table(
    archive_path: str,
    table_name: str,
    dtype: dict,
    chunksize: int = CHUNK_SIZE,
):
    """Yield DataFrame chunks directly from a ZIP without full extraction."""
    with zipfile.ZipFile(archive_path, "r") as zf:
        with zf.open(table_name) as fh:
            for chunk in pd.read_csv(
                fh,
                dtype=dtype,
                chunksize=chunksize,
                low_memory=False,
                encoding="utf-8-sig",  # handles BOM from some agency exports
            ):
                yield chunk

The low_memory=False flag forces pandas to scan each chunk fully before assigning column types, which is safer when streaming. Always pair it with an explicit dtype dictionary to avoid silent coercion overhead.

Step 3: Enforce Strict Data Types at Read Time

Pandas defaults to float64 for numeric columns and object for strings. Transit identifiers (route_id, trip_id, stop_id) are categorical, not continuous. Defining explicit dtype dictionaries reduces memory footprint by 40–70 percent. Downcasting integers (int32 or int16 for stop_sequence) and using "category" for repeated string identifiers yields compounding savings across large feeds. For a deep dive into dtype mapping strategies specific to transit data, see optimizing pandas memory usage for transit feeds.

STOP_TIMES_DTYPES: dict[str, str] = {
    "trip_id":              "category",
    "arrival_time":         "string",   # keep as string; > 24:00 is valid GTFS
    "departure_time":       "string",
    "stop_id":              "category",
    "stop_sequence":        "int32",
    "pickup_type":          "Int8",     # nullable integer; 0–3 per spec
    "drop_off_type":        "Int8",
    "shape_dist_traveled":  "float32",
    "timepoint":            "Int8",
}

TRIPS_DTYPES: dict[str, str] = {
    "route_id":             "category",
    "service_id":           "category",
    "trip_id":              "category",
    "trip_headsign":        "string",
    "direction_id":         "Int8",
    "shape_id":             "category",
    "wheelchair_accessible":"Int8",
    "bikes_allowed":        "Int8",
}

After ingestion, verify actual RAM consumption:

import pandas as pd

sample_chunk = next(iter_gtfs_table(FEED_PATH, "stop_times.txt", STOP_TIMES_DTYPES, chunksize=200_000))
mem_mb = sample_chunk.memory_usage(deep=True).sum() / 1_048_576
print(f"Sample chunk ({len(sample_chunk):,} rows): {mem_mb:.1f} MB")
# → "Sample chunk (200,000 rows): 18.4 MB"  (vs ~47 MB with default dtypes)

String columns often hide allocation overhead. Converting them to pd.CategoricalDtype or pyarrow string types eliminates duplicate string object allocations across chunks.

Step 4: Load Small Lookup Tables Fully, Then Stream the Heavy Table

The incremental join pattern keeps peak RAM bounded at size(lookup) + size(one_chunk) + overhead:

Load trips.txt and routes.txt completely into memory (they are small).
Stream stop_times.txt in chunks.
Merge each chunk against the in-memory lookup frames.
Write merged output to disk immediately before processing the next chunk.

import pandas as pd
import zipfile

def load_lookup(archive_path: str, table: str, dtype: dict) -> pd.DataFrame:
    """Load a small GTFS lookup table fully into memory."""
    with zipfile.ZipFile(archive_path, "r") as zf:
        with zf.open(table) as fh:
            return pd.read_csv(fh, dtype=dtype, encoding="utf-8-sig")

# Load lookups once
trips_df = load_lookup(FEED_PATH, "trips.txt", TRIPS_DTYPES)

processed_chunks: list[str] = []

for chunk_idx, chunk in enumerate(
    iter_gtfs_table(FEED_PATH, "stop_times.txt", STOP_TIMES_DTYPES)
):
    # Align category codes before merge to avoid silent key mismatch
    chunk["trip_id"] = chunk["trip_id"].astype(str)
    trips_df["trip_id"] = trips_df["trip_id"].astype(str)

    enriched = pd.merge(
        chunk,
        trips_df[["trip_id", "route_id", "direction_id", "shape_id"]],
        on="trip_id",
        how="inner",
    )

    out_path = f"/tmp/stop_times_chunk_{chunk_idx:04d}.parquet"
    enriched.to_parquet(out_path, engine="pyarrow", index=False)
    processed_chunks.append(out_path)
    del enriched  # release immediately; do not accumulate in a list

For complex aggregations such as route-level headways or dwell times, push the operation to DuckDB or Polars after writing to Parquet. Both engines support out-of-core execution and lazy evaluation that avoids materializing the full result set in RAM. When building automated refresh cycles, schedule incremental diffs against a SHA-256 content hash so unchanged feed versions skip the entire join pipeline.

Step 5: Persist to Columnar Storage

CSV is inefficient for repeated reads. Convert processed chunks to Apache Parquet immediately using pyarrow as the engine. Parquet’s columnar layout, built-in ZSTD compression, and schema preservation enable sub-second scans of specific columns in downstream queries.

import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

def write_enriched_chunk(df: pd.DataFrame, output_dir: Path, chunk_idx: int) -> Path:
    """Convert a pandas chunk to a typed Arrow table and write as Parquet."""
    output_dir.mkdir(parents=True, exist_ok=True)
    out_path = output_dir / f"stop_times_{chunk_idx:04d}.parquet"

    table = pa.Table.from_pandas(df, preserve_index=False)
    pq.write_table(
        table,
        out_path,
        compression="zstd",
        compression_level=3,   # fast write, good ratio for transit data
    )
    return out_path


def consolidate_to_partitioned_dataset(chunk_paths: list[str], output_root: Path) -> None:
    """Merge chunk Parquet files into a route_id-partitioned dataset."""
    import pyarrow.dataset as ds

    combined = ds.dataset(chunk_paths, format="parquet")
    ds.write_dataset(
        combined,
        output_root,
        format="parquet",
        partitioning=ds.partitioning(
            pa.schema([pa.field("route_id", pa.string())]),
            flavor="hive",
        ),
        existing_data_behavior="overwrite_or_ignore",
    )

Aim for 128 MB to 256 MB per output file. Over-partitioning (e.g. one file per trip_id) inflates metadata overhead and slows directory scans without improving query performance.

Validation and Verification

After processing all chunks, validate the consolidated output against the original raw data before decommissioning the source archive.

import pyarrow.dataset as ds

def verify_output_integrity(
    archive_path: str,
    parquet_root: str,
    trips_df: pd.DataFrame,
) -> bool:
    """Assert row count parity and referential integrity of merged output."""
    # Row count check: count raw stop_times rows
    raw_row_count = sum(
        len(chunk)
        for chunk in iter_gtfs_table(archive_path, "stop_times.txt", STOP_TIMES_DTYPES)
    )

    # Parquet row count (fast metadata scan — no data read)
    output_dataset = ds.dataset(parquet_root, format="parquet")
    parquet_row_count = output_dataset.count_rows()

    # Inner join will drop rows for trip_ids absent from trips.txt
    known_trip_ids = set(trips_df["trip_id"].astype(str))
    raw_inner_estimate = sum(
        (chunk["trip_id"].astype(str).isin(known_trip_ids)).sum()
        for chunk in iter_gtfs_table(archive_path, "stop_times.txt", STOP_TIMES_DTYPES)
    )

    print(f"Raw stop_times rows:           {raw_row_count:>12,}")
    print(f"Rows matched to trips (inner): {raw_inner_estimate:>12,}")
    print(f"Parquet output rows:           {parquet_row_count:>12,}")

    assert parquet_row_count == raw_inner_estimate, (
        f"Row count mismatch: expected {raw_inner_estimate}, got {parquet_row_count}"
    )
    print("Integrity check PASSED")
    return True

Additional verification checkpoints:

Duplicate stop sequences: df.duplicated(subset=["trip_id", "stop_sequence"]).sum() must be zero. Agencies occasionally publish trips with repeated stop_sequence values due to data entry errors.
Referential completeness: Any trip_id in the output that is absent from trips.txt indicates feed corruption; log and quarantine rather than silently dropping.
Coordinate bounds for shapes.txt: After processing shape data, assert all shape_pt_lat values fall within [-90, 90] and shape_pt_lon within [-180, 180]. Out-of-range coordinates are a common sign of degree/radian confusion in agency GIS exports — see coordinate reference systems for transit data for remediation.

Failure Modes and Edge Cases

Times beyond 23:59:59: GTFS explicitly allows values like 25:30:00 for service that crosses midnight. Attempting to parse these with pd.to_datetime raises ValueError. Always ingest arrival_time and departure_time as "string" dtype and convert to pd.to_timedelta when arithmetic is needed. See handling frequency-based vs timetable schedules for the full time normalization workflow.
BOM-prefixed CSV exports: Some agency toolchains prepend a UTF-8 byte-order mark to .txt files. Pass encoding="utf-8-sig" to pd.read_csv to strip the BOM silently; otherwise the first column name acquires a prefix that breaks dtype lookups.
Mixed-type stop_sequence columns: A small number of agencies publish stop_sequence as "1", "2a", "2b" for interpolated stops. Specifying dtype={"stop_sequence": "int32"} will raise an error. Detect this in the profile step and fall back to "string" when non-numeric values are present.
Category dtype misalignment during merge: When merging two DataFrames where the join key is "category" in both, pandas requires that both sides share the same underlying categories. Cast both sides to str before merging to avoid a silent empty result.
ZIP archives with directory entries: Some agency exports nest .txt files inside subdirectories within the ZIP (e.g. feed/stop_times.txt). Use info.filename from zf.infolist() rather than hardcoding bare filenames.
shape_dist_traveled as empty string: Agencies that do not populate this optional column often emit it as an empty field, which float32 dtype will reject. Use "Float32" (nullable float) or omit the column from the dtype spec and cast after ingestion.

Performance and Scale Notes

For feeds above 500 MB uncompressed, the following tuning decisions have the largest impact on throughput:

Technique	Typical RAM saving	Wall-clock impact
`"category"` dtype for `trip_id`, `stop_id`, `route_id`	50–70% reduction	Slight overhead at chunk boundaries
`float32` instead of `float64` for coordinates and distances	50% reduction	Negligible
`Int8` for pickup/drop-off type and `int32` for stop_sequence	75–87.5% reduction vs int64	Negligible
ZSTD compression for Parquet output	60–75% smaller files	+5–10% write time
Partitioning Parquet by `route_id`	No RAM saving	5–10× faster downstream filtered reads

For multi-agency pipelines, wrap each feed’s processing in a dedicated worker process with an enforced RSS memory ceiling using resource.setrlimit. This prevents one oversized agency feed from crowding out others in a shared environment. Use Airflow or Prefect to orchestrate chunked jobs with per-task memory quotas and retry-with-backoff logic for transient network failures during archive fetch. Cache intermediate Parquet layers per feed version hash to avoid recomputing expensive joins on pipeline restarts.

Monitor long-running workers with psutil.Process().memory_info().rss or tracemalloc to detect leaks. Chunked iteration itself should not accumulate memory, but pd.Categorical dictionaries can grow unboundedly when new category values appear late in a large file. Reset category caches between files by explicitly deleting DataFrames and calling gc.collect().

FAQ: Common questions about large-feed processing

How large can stop_times.txt get? Metropolitan agency feeds routinely produce stop_times.txt files with 20–80 million rows, reaching 2–6 GB uncompressed. It typically accounts for 60–80 percent of total feed size.

What chunk size should I use with pd.read_csv? 250,000 to 500,000 rows is a practical starting point. Multiply by the estimated bytes per row (roughly 60–120 bytes for stop_times.txt) to verify each chunk stays under your RAM budget, then adjust based on profile_feed() output.

When should I use Polars or DuckDB instead of pandas? Reach for Polars when you need lazy evaluation and a smaller peak-RAM profile for wide joins across multiple GTFS files. Use DuckDB when your output is already in Parquet and you want SQL semantics with out-of-core execution. pandas remains the default for teams that already have pandas-based validation logic.

Can I use partridge for large feeds? Partridge’s lazy-loading model handles referential integrity filtering well but still materializes the requested subset into memory. For feeds under 300 MB uncompressed, partridge is the easier path. Above that, the manual chunked approach here gives more predictable memory behaviour. The two approaches are compared in parsing GTFS with pandas and partridge.

Related pages

Optimizing Pandas Memory Usage for Transit Feeds — detailed dtype mapping and memory profiling techniques
Automating Feed Updates with GTFS-Kit — scheduling incremental feed fetches and version-aware processing
Handling Frequency-Based vs Timetable Schedules — time normalization for overnight and headway-based service
Mastering stops.txt and stop_times.txt Relationships — GTFS relational model and foreign-key constraints
Coordinate Reference Systems for Transit Data — WGS84 bounds checking and CRS transformation

Up: Python Parsing & Data Normalization | Home