Converting GTFS frequencies.txt to Exact Departure Times in Python

Q: What does exact_times=0 vs exact_times=1 mean for expanded departures?

When exact_times=1, generated departure times are fixed schedule points that routing engines must honour. When exact_times=0, they are approximate headway intervals; routing engines may apply stochastic delay models or override them with real-time AVL data.

Q: Why does pd.to_timedelta handle GTFS times better than datetime.strptime?

GTFS allows time strings beyond 23:59:59 (e.g. 25:30:00 for 1:30 AM the next service day). datetime.strptime raises ValueError on these; pd.to_timedelta converts them to 1 day 01:30:00, preserving chronological order across midnight boundaries.

Q: How do I avoid memory issues when expanding a large frequencies.txt?

Filter by exact_times and service_id before expansion, use a generator instead of building a full list, and process in chunks by route_id or service_id. A single high-frequency metro route can produce hundreds of departures per trip_id, so early filtering is the most effective lever.

Use pd.to_timedelta() to parse start_time, end_time, and headway_secs from frequencies.txt, then iterate departure_time = start_time + n × headway_secs for each integer n ≥ 0 where departure_time < end_time. Tag every generated row with the exact_times flag before passing it downstream. This produces a spec-compliant, routing-engine-ready departure table in under 30 lines of pandas code.

Root cause: why frequencies.txt needs explicit expansion

GTFS frequencies.txt is a compression format, not a departure schedule. A single row like (trip_id=101, start_time=06:00:00, end_time=22:00:00, headway_secs=600, exact_times=1) represents 97 distinct departure events stored as one database record. Whether you load the table with raw pandas read_csv or with partridge for strict foreign-key enforcement, the interval stays compressed until you expand it. Routing engines — OpenTripPlanner, Valhalla, R5 — and passenger-facing APIs all need those events enumerated as rows, not compressed into a single interval record.

Three ETL anti-patterns cause the most production bugs with frequency expansion:

Using datetime.strptime instead of pd.to_timedelta — GTFS time strings can exceed 23:59:59 (for example 25:30:00 represents 1:30 AM the next service day). strptime raises ValueError on these; to_timedelta converts them to Timedelta('1 days 01:30:00'), preserving correct chronological order across midnight service boundaries.
Misreading exact_times semantics — when exact_times=0 (or the column is absent, which defaults to 0 per spec), the expanded window is a headway guideline; routing engines may shift individual departures using real-time AVL data. When exact_times=1, every generated timestamp is a hard schedule constraint. Treating exact_times=0 output as fixed-schedule data causes incorrect connection windows in multi-modal routing graphs and silent mismatches when GTFS-RT TripUpdate messages arrive with different timestamps.
Skipping pre-loop validation — a headway_secs=0 row causes ZeroDivisionError; an inverted window where end_time <= start_time causes an infinite or zero-iteration loop. Both appear in real agency feeds as placeholders for cancelled service windows. The GTFS validation tooling catches these at feed-ingest time, but the expansion function must guard against them independently.

Expansion pipeline

The diagram below shows the data flow from raw frequencies.txt through validation and arithmetic expansion to the final departure table.

Production-ready Python implementation

The function below handles every GTFS time-string edge case, applies validation before expansion, respects exact_times, and uses structured logging so failures surface in pipeline dashboards.

import logging
import pandas as pd
from datetime import timedelta
from typing import Optional

logger = logging.getLogger(__name__)


def expand_frequencies_to_departures(
    frequencies_df: pd.DataFrame,
    exact_times_filter: Optional[int] = None,
) -> pd.DataFrame:
    """
    Expands a GTFS frequencies.txt DataFrame into individual departure events.

    Parameters
    ----------
    frequencies_df : pd.DataFrame
        Raw frequencies.txt loaded from a GTFS feed.
        Required columns: trip_id, start_time, end_time, headway_secs.
        Optional column: exact_times (defaults to 0 per GTFS spec).
    exact_times_filter : int or None
        Pass 0 or 1 to restrict expansion to one scheduling mode;
        None (default) expands all rows regardless of exact_times value.

    Returns
    -------
    pd.DataFrame
        Columns: trip_id (str), departure_time (timedelta), exact_times (int8)
    """
    if frequencies_df.empty:
        logger.warning("expand_frequencies_to_departures: input DataFrame is empty")
        return pd.DataFrame(columns=["trip_id", "departure_time", "exact_times"])

    freq = frequencies_df.copy()

    # GTFS times are HH:MM:SS relative to service-day midnight.
    # Values > 23:59:59 are valid (e.g. 25:30:00 = 01:30 next day).
    # pd.to_timedelta handles these correctly; datetime.strptime does not.
    freq["start_time"] = pd.to_timedelta(freq["start_time"], errors="coerce")
    freq["end_time"] = pd.to_timedelta(freq["end_time"], errors="coerce")
    freq["headway_secs"] = pd.to_numeric(freq["headway_secs"], errors="coerce")

    # Backfill missing exact_times with 0 per GTFS default
    if "exact_times" not in freq.columns:
        freq["exact_times"] = 0
    freq["exact_times"] = (
        pd.to_numeric(freq["exact_times"], errors="coerce").fillna(0).astype("int8")
    )

    # Validate: drop rows missing any required field or with invalid headway
    before_count = len(freq)
    freq = freq.dropna(subset=["trip_id", "start_time", "end_time", "headway_secs"])
    freq = freq[freq["headway_secs"] > 0]
    freq = freq[freq["end_time"] > freq["start_time"]]
    dropped = before_count - len(freq)
    if dropped:
        logger.warning(
            "Dropped %d invalid frequencies.txt rows (missing fields, "
            "non-positive headway, or end_time <= start_time)",
            dropped,
        )

    if exact_times_filter is not None:
        freq = freq[freq["exact_times"] == exact_times_filter]

    if freq.empty:
        logger.info("No rows remain after filtering; returning empty departure table")
        return pd.DataFrame(columns=["trip_id", "departure_time", "exact_times"])

    expanded: list[dict] = []
    for _, row in freq.iterrows():
        trip_id: str = str(row["trip_id"])
        start: timedelta = row["start_time"]
        end: timedelta = row["end_time"]
        headway: int = int(row["headway_secs"])
        exact_flag: int = int(row["exact_times"])

        # Integer division gives the maximum possible interval count;
        # the boundary check below ensures we never emit dep_time >= end_time.
        total_seconds = int((end - start).total_seconds())
        n_intervals = total_seconds // headway

        for i in range(n_intervals + 1):
            dep_time = start + timedelta(seconds=i * headway)
            if dep_time >= end:
                break
            expanded.append(
                {
                    "trip_id": trip_id,
                    "departure_time": dep_time,
                    "exact_times": exact_flag,
                }
            )

    result = pd.DataFrame(expanded)
    if not result.empty:
        result["exact_times"] = result["exact_times"].astype("int8")

    logger.info(
        "Expanded %d frequencies.txt rows into %d departure events",
        len(freq),
        len(result),
    )
    return result

Step-by-step walkthrough

1 — Time string parsing with `pd.to_timedelta()`

freq["start_time"] = pd.to_timedelta(freq["start_time"], errors="coerce")

pandas converts "06:30:00" to Timedelta('0 days 06:30:00') and "25:30:00" to Timedelta('1 days 01:30:00'). Using errors="coerce" turns unparseable strings into NaT, which the later dropna() catches cleanly. The alternative — datetime.strptime — raises ValueError on any value past 23:59:59, making it unsuitable for GTFS feeds that carry overnight service or post-midnight trips under the timezone normalization model.

2 — Validation before expansion

freq = freq[freq["headway_secs"] > 0]
freq = freq[freq["end_time"] > freq["start_time"]]

Filtering before the loop prevents ZeroDivisionError from a zero headway and infinite loops from an inverted time window. Some real-world agency feeds contain headway_secs=0 as a placeholder for cancelled service windows; logging the count lets QA teams catch upstream data quality regressions early.

3 — Backfilling `exact_times`

if "exact_times" not in freq.columns:
    freq["exact_times"] = 0

The GTFS spec makes exact_times optional and defines its absence as equivalent to 0. Many agencies omit the column entirely. Making the default explicit prevents KeyError and documents the assumption in code rather than leaving it to tribal knowledge.

4 — The expansion loop

total_seconds = int((end - start).total_seconds())
n_intervals = total_seconds // headway
for i in range(n_intervals + 1):
    dep_time = start + timedelta(seconds=i * headway)
    if dep_time >= end:
        break

n_intervals + 1 ensures the loop covers the maximum possible count even when the window is an exact multiple of headway_secs. The break on dep_time >= end enforces the GTFS boundary rule that the last departure must be strictly before end_time. The double-guard (loop ceiling + boundary check) is intentional: rounding in total_seconds can sometimes make n_intervals one too small, while the >= end guard catches the rare case where it is one too large.

5 — `exact_times` tag on output rows

Each output row carries the exact_times flag from its source frequencies.txt row. Downstream consumers — routing engines, GTFS-RT merge layers, passenger apps — must branch on this value. Omitting it forces consumers to re-join against the original frequencies.txt, which is fragile once the feed is updated.

Verification and output

After calling expand_frequencies_to_departures(), run these assertions before writing to storage:

departures = expand_frequencies_to_departures(freq_df)

# 1. No departure meets or exceeds its window end_time
# (requires joining back to the source windows)
merged = departures.merge(
    freq_df[["trip_id", "end_time"]].assign(
        end_time=pd.to_timedelta(freq_df["end_time"])
    ),
    on="trip_id",
    how="left",
)
assert (merged["departure_time"] < merged["end_time"]).all(), \
    "One or more departures meet or exceed end_time"

# 2. No duplicate trip_id + departure_time combinations
dupes = departures.duplicated(subset=["trip_id", "departure_time"])
assert not dupes.any(), f"{dupes.sum()} duplicate trip_id+departure_time rows found"

# 3. exact_times is always 0 or 1
assert departures["exact_times"].isin([0, 1]).all(), \
    "Unexpected exact_times values outside {0, 1}"

# 4. Overnight times are preserved as timedelta, not coerced to 00:00:00
overnight = departures[departures["departure_time"] >= pd.Timedelta(hours=24)]
print(f"Departures after midnight: {len(overnight)} rows")
print(departures.dtypes)
# departure_time    timedelta64[ns]  ← correct
# exact_times                 int8   ← correct

Expected diagnostic output for a feed with mixed day and overnight service:

Departures after midnight: 142 rows
trip_id            object
departure_time    timedelta64[ns]
exact_times             int8
dtype: object

Gotchas and edge cases

Multiple frequencies.txt rows for one trip_id: The GTFS spec allows a single trip_id to have several non-overlapping service windows (peak and off-peak headways). The function handles these correctly by processing each row independently, but overlapping windows produce duplicate departure events. Add a drop_duplicates(subset=["trip_id", "departure_time"]) step if your agency data is not pre-validated at ingest time with the GTFS validation tooling.
headway_secs varies within a trip window: Some feeds change headway mid-service day using sequential rows (06:00–09:00 at 10 min, 09:00–15:00 at 15 min). The function processes each row independently, which is correct — but ensure your end_time in row N exactly matches start_time in row N+1; otherwise a gap in departures appears. Add a continuity assertion if your pipeline depends on gapless service coverage.
Memory scaling for high-frequency metro lines: A single route running every 90 seconds for 20 hours produces 800 departure events per trip. For feeds with hundreds of high-frequency trips, the expanded list can consume several hundred megabytes. Switch to a generator that yields directly to a Parquet writer or database cursor when the full table does not fit in RAM — the memory-efficient processing guide covers chunked and generator-based strategies in detail.
Feeding expanded departures into GTFS-RT merge pipelines: When overlaying GTFS-RT TripUpdate stop-time updates onto expanded departures, trip_id must match the original frequencies.txt value exactly — not a synthetic ID generated during expansion. Routing engines identify frequency trips by their source trip_id plus a service-day offset, not by the expanded departure timestamp. Generating synthetic IDs at expansion time silently breaks real-time tracking.

Frequently asked questions

What does exact_times=0 vs exact_times=1 mean for expanded departures?

When exact_times=1, generated departure times are fixed schedule points that routing engines must honour. When exact_times=0, they are approximate headway intervals; routing engines may apply stochastic delay models or override them with real-time AVL data. Never store exact_times=0 expansions as authoritative timetable rows — flag them explicitly so downstream consumers can apply the correct scheduling model.

Why does pd.to_timedelta handle GTFS times better than datetime.strptime?

GTFS allows time strings beyond 23:59:59 (e.g. 25:30:00 for 1:30 AM the next service day). datetime.strptime raises ValueError on these; pd.to_timedelta converts them to 1 day 01:30:00, preserving chronological order across midnight boundaries. For background on how GTFS represents time across service-day boundaries, see converting local transit times to UTC.

How do I avoid memory issues when expanding a large frequencies.txt?

Filter by exact_times and service_id before expansion, use a generator instead of building a full list, and process in chunks by route_id or service_id. A single high-frequency metro route can produce hundreds of departures per trip_id, so early filtering is the most effective lever. See optimizing pandas memory usage for transit feeds for dtype downcast patterns that also apply here.

Optimizing pandas Memory Usage for Transit Feeds — dtype downcast patterns (including int8 for flag columns) that apply directly to the expanded departure table
Fixing Missing stop_times.txt Records in Python — the timetable-side counterpart, for trips that carry explicit stop_times.txt rows rather than a frequencies.txt window
Step-by-Step Guide to Parsing GTFS with Partridge — loading frequencies.txt and its related tables with strict foreign-key enforcement before you expand them
Up: Handling Frequency-Based vs Timetable Schedules — parent guide on when to use frequency expansion vs direct timetable ingestion and how to merge both into a unified departure table
Section: Python Parsing & Data Normalization — the full GTFS parse → validate → normalize → store workflow
Home