Understanding GTFS Static Feed Structure

Q: What files are mandatory in a GTFS static feed?

The GTFS specification requires agency.txt, stops.txt, routes.txt, trips.txt, stop_times.txt, and either calendar.txt or calendar_dates.txt (or both). A feed missing any of these is not spec-compliant and will fail most validator tools.

Q: Why do GTFS time values exceed 24:00:00?

GTFS time strings are offsets from the service day start (noon minus 12 hours), not wall-clock times. A value of 25:30:00 means 1:30 AM of the next calendar day, allowing overnight trips to be modelled on a single service date without splitting the trip record.

Q: What causes orphaned stop_times records?

Orphans occur when a stop_id referenced in stop_times.txt has been deleted from stops.txt, or when the feed publisher exports files from separate database snapshots taken at different times. The fix is to cross-join on stop_id and remove or remap rows where the join returns null.

The GTFS static format is a ZIP archive of UTF-8 CSV files describing a transit network’s routes, stops, trips, and timetables. For Python engineers building routing engines, schedule normalizers, or mobility APIs, the challenge is not reading the CSVs—it is loading them with correct types, validating the relational graph, handling agency quirks like times beyond 24:00:00, and converting the whole thing into an efficient columnar store before any downstream computation begins. This page covers every step of that workflow in the context of the broader GTFS Feed Architecture & Fundamentals area.

Prerequisites

Python 3.10+ with pandas>=2.0, pyarrow>=14, and pathlib (stdlib)
Install in one line:
text
```
pip install pandas pyarrow
```
A valid GTFS static feed as a ZIP archive (test with any MobilityData sample feed)
Familiarity with relational concepts: primary keys, foreign keys, left joins, and cardinality
An understanding of why agency metadata and feed versioning matters before the feed even opens — specifically that agency_timezone in agency.txt governs all time interpretation

Concept and Spec Background

A compliant GTFS static feed is a ZIP archive containing UTF-8 encoded, comma-separated text files. The MobilityData specification defines six mandatory files. Every other file is conditional or optional.

File	Primary key(s)	Key foreign key(s)	Role
`agency.txt`	`agency_id`	—	Operator identity and canonical timezone
`stops.txt`	`stop_id`	`parent_station` (self-ref)	Spatial nodes: stops, stations, entrances
`routes.txt`	`route_id`	`agency_id`	Logical service lines grouped by mode
`trips.txt`	`trip_id`	`route_id`, `service_id`, `shape_id`	One scheduled vehicle run per row
`stop_times.txt`	(`trip_id`, `stop_sequence`) composite	`trip_id`, `stop_id`	Chronological arrival/departure sequence per trip
`calendar.txt`	`service_id`	—	Recurring weekly service pattern
`calendar_dates.txt`	(`service_id`, `date`)	`service_id`	Holiday exceptions and added/removed service
`shapes.txt`	`shape_id`	—	Optional: ordered polyline geometry for each route path
`frequencies.txt`	(`trip_id`, `start_time`)	`trip_id`	Optional: headway-based scheduling instead of exact times

The critical join path that every routing engine traverses is: routes.txt → trips.txt → stop_times.txt ↔ stops.txt. Any broken foreign key along this chain silently corrupts trip graphs. A trip_id in stop_times.txt with no matching row in trips.txt produces a phantom trip that is unreachable from any route. Validation must verify all four join points before writing to any downstream store.

Time values in stop_times.txt are HH:MM:SS offset strings, not wall-clock times. They measure seconds from the service day’s logical noon-minus-12 origin. Values above 24:00:00 (e.g. 25:30:00) represent the early hours of the next calendar day, allowing a single overnight service run to be expressed as one continuous trip record. Timezone handling and schedule normalization explains the full conversion to UTC-aware datetimes.

Coordinate reference systems for transit data covers another constraint that affects stops.txt and shapes.txt: all latitude/longitude values must be WGS84 (EPSG:4326). Feeds from agencies in non-WGS84 projections silently place stops in the wrong country until reprojected.

Step-by-Step Implementation

1. Extract and load all mandatory files with strict dtype enforcement

Auto-detected dtype from pandas will silently coerce stop_id values like "0042" to the integer 42, breaking every downstream join. Always supply an explicit dtype map.

import zipfile
import pandas as pd
from pathlib import Path
from typing import Dict

MANDATORY_DTYPES: Dict[str, Dict[str, str]] = {
    "agency.txt": {
        "agency_id": "string",
        "agency_name": "string",
        "agency_timezone": "string",
        "agency_url": "string",
    },
    "stops.txt": {
        "stop_id": "string",
        "stop_name": "string",
        "stop_lat": "float32",
        "stop_lon": "float32",
        "location_type": "Int8",
        "parent_station": "string",
    },
    "routes.txt": {
        "route_id": "string",
        "agency_id": "string",
        "route_short_name": "string",
        "route_long_name": "string",
        "route_type": "Int8",
    },
    "trips.txt": {
        "trip_id": "string",
        "route_id": "string",
        "service_id": "string",
        "direction_id": "Int8",
        "shape_id": "string",
        "trip_headsign": "string",
    },
    "stop_times.txt": {
        "trip_id": "string",
        "stop_id": "string",
        "stop_sequence": "Int32",
        "arrival_time": "string",
        "departure_time": "string",
        "pickup_type": "Int8",
        "drop_off_type": "Int8",
    },
    "calendar.txt": {
        "service_id": "string",
        "monday": "Int8", "tuesday": "Int8", "wednesday": "Int8",
        "thursday": "Int8", "friday": "Int8",
        "saturday": "Int8", "sunday": "Int8",
        "start_date": "string",
        "end_date": "string",
    },
}


def load_gtfs_feed(zip_path: str) -> Dict[str, pd.DataFrame]:
    """
    Open a GTFS static ZIP, load all mandatory files with explicit dtypes,
    and return a dict keyed by filename.

    Raises FileNotFoundError if any mandatory file is absent.
    Raises ValueError if any mandatory column is missing in a loaded file.
    """
    feed: Dict[str, pd.DataFrame] = {}

    with zipfile.ZipFile(zip_path, "r") as archive:
        available = set(archive.namelist())

        # calendar.txt and calendar_dates.txt: one or both must be present
        if "calendar.txt" not in available and "calendar_dates.txt" not in available:
            raise FileNotFoundError(
                "Feed must contain calendar.txt, calendar_dates.txt, or both."
            )

        for filename, dtype_map in MANDATORY_DTYPES.items():
            if filename == "calendar.txt" and filename not in available:
                continue  # calendar_dates.txt alone is sufficient

            if filename not in available:
                raise FileNotFoundError(f"Missing mandatory GTFS file: {filename}")

            with archive.open(filename) as fh:
                df = pd.read_csv(
                    fh,
                    dtype=dtype_map,
                    keep_default_na=False,
                    na_values=[""],
                    engine="pyarrow",
                )

            missing_cols = set(dtype_map) - set(df.columns)
            # Strip optional columns that may be absent
            optional = {"parent_station", "direction_id", "shape_id",
                        "trip_headsign", "pickup_type", "drop_off_type"}
            missing_required = missing_cols - optional
            if missing_required:
                raise ValueError(
                    f"{filename} is missing required columns: {missing_required}"
                )

            feed[filename] = df

    return feed

2. Validate the relational dependency graph

Check every foreign key join before touching any data. Orphaned records are the single most common cause of silent routing failures in production pipelines.

from typing import List


def validate_referential_integrity(
    feed: Dict[str, pd.DataFrame]
) -> Dict[str, List[str]]:
    """
    Returns a dict of {table: [error_message, ...]} for all FK violations.
    An empty dict means the feed is referentially clean.
    """
    errors: Dict[str, List[str]] = {}

    def _check_fk(
        child_table: str,
        child_col: str,
        parent_table: str,
        parent_col: str,
    ) -> None:
        child_vals = set(feed[child_table][child_col].dropna())
        parent_vals = set(feed[parent_table][parent_col].dropna())
        orphans = child_vals - parent_vals
        if orphans:
            msg = (
                f"{child_table}.{child_col} has {len(orphans)} value(s) "
                f"not present in {parent_table}.{parent_col}: "
                f"{sorted(orphans)[:5]}{'...' if len(orphans) > 5 else ''}"
            )
            errors.setdefault(child_table, []).append(msg)

    _check_fk("routes.txt", "agency_id", "agency.txt", "agency_id")
    _check_fk("trips.txt", "route_id", "routes.txt", "route_id")
    _check_fk("stop_times.txt", "trip_id", "trips.txt", "trip_id")
    _check_fk("stop_times.txt", "stop_id", "stops.txt", "stop_id")

    # Check shape_id integrity when shapes.txt is present
    if "shapes.txt" in feed and "shape_id" in feed["trips.txt"].columns:
        _check_fk("trips.txt", "shape_id", "shapes.txt", "shape_id")

    return errors

3. Check for duplicate composite keys in stop_times.txt

stop_times.txt has no single-column primary key. The composite (trip_id, stop_sequence) must be unique. Many agency export scripts produce duplicates when the same trip is modified and re-exported without deduplication.

def check_stop_times_composite_key(
    stop_times: pd.DataFrame,
) -> pd.DataFrame:
    """
    Returns duplicate rows keyed on (trip_id, stop_sequence).
    An empty DataFrame means no duplicates exist.
    """
    dupe_mask = stop_times.duplicated(
        subset=["trip_id", "stop_sequence"], keep=False
    )
    duplicates = stop_times[dupe_mask].sort_values(
        ["trip_id", "stop_sequence"]
    )
    print(
        f"Duplicate (trip_id, stop_sequence) pairs: "
        f"{dupe_mask.sum()} rows across "
        f"{duplicates[['trip_id', 'stop_sequence']].drop_duplicates().shape[0]} key pairs."
    )
    return duplicates

4. Parse and normalize time strings

stop_times.txt arrival and departure values are strings, not datetimes. Parse them defensively, converting over-24-hour values into timedeltas anchored to the service date.

import re
from datetime import date, timedelta, datetime
import zoneinfo


_TIME_RE = re.compile(r"^(\d+):([0-5]\d):([0-5]\d)$")


def gtfs_time_to_timedelta(time_str: str) -> timedelta:
    """
    Convert a GTFS HH:MM:SS string (possibly HH >= 24) to a timedelta
    offset from midnight of the service date.
    """
    m = _TIME_RE.match(time_str.strip())
    if not m:
        raise ValueError(f"Malformed GTFS time string: {time_str!r}")
    hours, minutes, seconds = int(m.group(1)), int(m.group(2)), int(m.group(3))
    return timedelta(hours=hours, minutes=minutes, seconds=seconds)


def resolve_gtfs_datetime(
    time_str: str,
    service_date: date,
    agency_tz: str,
) -> datetime:
    """
    Convert a GTFS time offset string and a service date to a
    timezone-aware UTC datetime.

    Args:
        time_str:     GTFS arrival_time or departure_time value.
        service_date: The calendar date of the service run (from calendar.txt).
        agency_tz:    IANA timezone name from agency.txt (e.g. 'America/New_York').

    Returns:
        A UTC-normalised, timezone-aware datetime.
    """
    tz = zoneinfo.ZoneInfo(agency_tz)
    offset = gtfs_time_to_timedelta(time_str)
    # Anchor to midnight of the service date in the agency's local timezone
    local_midnight = datetime(
        service_date.year, service_date.month, service_date.day,
        tzinfo=tz,
    )
    local_dt = local_midnight + offset
    return local_dt.astimezone(zoneinfo.ZoneInfo("UTC"))

Validation and Verification

After loading and running referential checks, assert the following invariants before writing to any store:

def assert_feed_valid(
    feed: Dict[str, pd.DataFrame],
    integrity_errors: Dict[str, List[str]],
) -> None:
    """
    Raise AssertionError if any integrity violations are present,
    or if key tables are empty.
    """
    assert not integrity_errors, (
        f"Referential integrity failures detected:\n"
        + "\n".join(
            f"  {tbl}: {err}"
            for tbl, errs in integrity_errors.items()
            for err in errs
        )
    )

    assert len(feed["stops.txt"]) > 0, "stops.txt contains no rows"
    assert len(feed["trips.txt"]) > 0, "trips.txt contains no rows"
    assert len(feed["stop_times.txt"]) > 0, "stop_times.txt contains no rows"

    # Every stop that appears in stop_times.txt must have valid coordinates
    used_stops = feed["stop_times.txt"]["stop_id"].unique()
    coord_check = feed["stops.txt"][
        feed["stops.txt"]["stop_id"].isin(used_stops)
    ]
    bad_lat = coord_check["stop_lat"].isna() | (
        coord_check["stop_lat"].abs() > 90
    )
    bad_lon = coord_check["stop_lon"].isna() | (
        coord_check["stop_lon"].abs() > 180
    )
    assert not bad_lat.any(), (
        f"{bad_lat.sum()} stops with invalid stop_lat used in stop_times.txt"
    )
    assert not bad_lon.any(), (
        f"{bad_lon.sum()} stops with invalid stop_lon used in stop_times.txt"
    )

    print("Feed validation passed: referential integrity clean, coordinates valid.")

For schema-level rule enforcement beyond these custom checks — detecting invalid route_type integers, missing stop_name values, and malformed agency_url fields — integrate the GTFS validation rules and common schema errors workflow, which also covers running the canonical gtfs-validator CLI.

Failure Modes and Edge Cases

Leading-zero stop_id truncation. Some agencies issue numeric stop identifiers like "0042". Without dtype={"stop_id": "string"}, pandas silently reads 0042 as integer 42, and the join against stop_times.txt fails completely with zero error messages.
Times above 24:00:00. Overnight services on the same operational day use values like 26:15:00. Python’s datetime.strptime raises ValueError on these. Always parse with a regex-based offset approach (as shown above) rather than the standard library time parser.
Missing calendar.txt with only calendar_dates.txt. Many North American agencies express their service entirely through date exceptions in calendar_dates.txt and omit calendar.txt. Both files are optional individually; the validator must accept either or both.
parent_station self-referential rows. A station complex in stops.txt may appear as both a parent (location_type=1) and a child (rows where parent_station references the parent’s stop_id). Loading with the wrong dtype causes the self-join to produce spurious matches.
Duplicate agency_id in multi-agency feeds. Some publishers merge feeds from several agencies into one ZIP. If two agencies share an agency_id, every route that references it becomes ambiguous. Validate uniqueness with assert feed["agency.txt"]["agency_id"].nunique() == len(feed["agency.txt"]).
shape_id references without shapes.txt. trips.txt may populate the shape_id column even when no shapes.txt file is present. Routing engines expecting polylines will silently fall back to straight-line geometry. Always check "shapes.txt" in feed before accessing shape data.
route_type out of range. The GTFS spec defines integers 0–7 and an extended range up to 1700. Real feeds contain values like -1, 999, or empty strings for unknown modes. Mapping GTFS route types to standard categories provides a complete lookup table and fallback heuristics.

Performance and Scale Notes

For feeds smaller than ~100 MB uncompressed, the code above runs in a few seconds. Feeds from large metropolitan agencies frequently exceed 500 MB uncompressed, and stop_times.txt alone may hold 20 million rows. At that scale, three changes are necessary.

Chunked reading for stop_times.txt. Pandas can iterate over CSV chunks without loading the full file into RAM:

def stream_stop_times(
    archive: zipfile.ZipFile,
    chunksize: int = 500_000,
) -> pd.DataFrame:
    """
    Stream stop_times.txt in chunks; accumulate into a single Parquet file.
    Returns a concatenated DataFrame (suitable for feeds up to ~5M rows).
    For larger feeds, write each chunk directly to Parquet instead.
    """
    chunks = []
    dtype_map = MANDATORY_DTYPES["stop_times.txt"]

    with archive.open("stop_times.txt") as fh:
        for chunk in pd.read_csv(
            fh,
            dtype=dtype_map,
            keep_default_na=False,
            chunksize=chunksize,
            engine="pyarrow",
        ):
            chunks.append(chunk)

    return pd.concat(chunks, ignore_index=True)

Parquet conversion for analytical queries. Once validated, convert each DataFrame to Parquet with pyarrow compression. Columnar Parquet reduces stop_times.txt query time by 60–80% for trip-based filters because only the trip_id and arrival_time columns need to be read. The memory-efficient processing for large feeds page details multi-agency batching and Parquet partitioning strategies.

import pyarrow as pa
import pyarrow.parquet as pq


def save_feed_to_parquet(
    feed: Dict[str, pd.DataFrame],
    output_dir: str,
) -> None:
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)
    for filename, df in feed.items():
        stem = filename.replace(".txt", "")
        pq.write_table(
            pa.Table.from_pandas(df),
            out / f"{stem}.parquet",
            compression="snappy",
        )
        print(f"Written {len(df):,} rows → {stem}.parquet")

Multi-agency batching. When processing dozens of feeds, open each ZIP in a temporary directory context, validate, convert to Parquet, and discard the raw DataFrames before opening the next feed. Keeping all feeds in memory simultaneously will exhaust RAM on any standard machine. For automating feed updates with gtfs-kit, the library wraps much of this loading logic but still requires explicit version tracking to avoid overwriting clean data with a malformed update.

Frequently Asked Questions

What files are mandatory in a GTFS static feed?

The GTFS specification requires agency.txt, stops.txt, routes.txt, trips.txt, stop_times.txt, and at least one of calendar.txt or calendar_dates.txt. A feed missing any of these will fail most validator tools and cannot be loaded reliably.

Why do GTFS time values exceed 24:00:00?

GTFS time strings are offsets from the logical service day start (noon minus 12 hours), not wall-clock times. A value of 25:30:00 means 1:30 AM of the next calendar day, allowing overnight trips to be modelled on a single service date without splitting the trip record at midnight.

What causes orphaned stop_times records?

Orphans appear when a stop_id referenced in stop_times.txt has been deleted from stops.txt, or when a feed publisher exports tables from separate database snapshots taken at different times. The fix is a left-join diagnostic followed by dropping or remapping rows where the join returns null.

Mastering stops.txt and stop_times.txt Relationships — deep dive into the spatial-temporal join and sequence integrity checks
Timezone Handling and Schedule Normalization — converting GTFS time offsets to UTC-aware datetimes across DST boundaries
GTFS Validation Rules and Common Schema Errors — full rule catalogue with programmatic fixes for spec violations
Mapping GTFS Route Types to Standard Categories — integer enumeration lookup table and fallback heuristics for unknown mode codes
Memory-Efficient Processing for Large Feeds — chunked reading, Parquet output, and multi-agency batching strategies

Up: GTFS Feed Architecture & Fundamentals · Home