Handling Frequency-Based vs Timetable Schedules

Q: What is the difference between exact_times=0 and exact_times=1 in frequencies.txt?

When exact_times is 1, the agency guarantees departures at the precise headway interval starting from start_time — suitable for routing engines that need deterministic timestamps. When exact_times is 0 (or omitted), departures are approximate: the headway is a target interval but individual departures may vary. For schedule normalization pipelines, both modes are typically expanded to discrete timestamps, but exact_times=0 records should be marked for downstream systems that need to apply uncertainty buffers.

Q: Can a GTFS trip appear in both stop_times.txt and frequencies.txt?

Yes. The GTFS specification allows a trip_id to have rows in both files. When a trip appears in frequencies.txt, the stop_times.txt rows for that trip define the stop sequence and relative offsets, not the absolute departure times. The frequencies.txt record provides the actual service window. Treating such trips as pure timetable records without checking frequencies.txt is a common ETL error that produces phantom departures.

Transit agencies publish schedules using two fundamentally different operational paradigms. Some routes run on fixed, published departure times stored in stop_times.txt. Others operate on headway-based intervals defined in frequencies.txt, where only a service window and a repeat interval are published rather than individual departure timestamps. Many metropolitan feeds use both paradigms simultaneously — a single route may run on exact times during peak hours and switch to frequency-based operation during midday or overnight.

For transit analysts, routing engineers, and mobility platform teams, resolving this split is not optional. Routing engines, real-time prediction models, and accessibility calculators all require a unified, time-normalized departure table. Feeding them raw frequencies.txt records without expansion causes silent gaps: entire service windows appear as a single synthetic trip rather than dozens of discrete vehicle departures.

This guide provides a production-ready workflow for ingesting, differentiating, expanding, and merging both schedule types into a single analytical dataset. It pairs directly with Python Parsing & Data Normalization architectural patterns and covers every edge case that real agency feeds introduce.

Prerequisites

Before implementing the normalization pipeline, ensure your environment meets the following requirements:

Required GTFS files: stop_times.txt, frequencies.txt, trips.txt, agency.txt
Python 3.9+ with the following packages installed:

pip install pandas>=1.5.0 pyarrow numpy

The agency timezone string from agency.txt (agency_timezone field, e.g. America/New_York)
At least 8 GB RAM for feeds exceeding 500 MB uncompressed
Familiarity with vectorized time arithmetic in pandas — this workflow avoids row-by-row iteration to maintain sub-second performance on large metropolitan feeds

For feed ingestion patterns that enforce strict foreign-key integrity before you reach this stage, see the guide on parsing GTFS with pandas and partridge.

GTFS Specification Background

The GTFS specification explicitly separates exact-time and headway-based operations. Understanding the data model constraints is essential before writing any transformation code.

stop_times.txt — Timetable paradigm

Each row in stop_times.txt defines a vehicle’s arrival and departure at a single stop within a specific trip. The table’s effective composite key is (trip_id, stop_sequence). Fields that matter for schedule normalization:

Field	Type	Purpose
`trip_id`	String (FK → trips.txt)	Links the stop event to a trip
`stop_sequence`	Integer	Ordered stop position within the trip
`arrival_time`	HH:MM:SS string	May exceed 23:59:59 for overnight trips
`departure_time`	HH:MM:SS string	May exceed 23:59:59 for overnight trips

These are deterministic and ideal for commuter rail, intercity buses, and scheduled urban routes where precise timing is mandated.

frequencies.txt — Headway paradigm

frequencies.txt links to a trip_id and specifies a service window plus a repeat interval. Fields:

Field	Type	Purpose
`trip_id`	String (FK → trips.txt)	Must exist in trips.txt
`start_time`	HH:MM:SS string	Window open time
`end_time`	HH:MM:SS string	Window close time (exclusive)
`headway_secs`	Integer	Seconds between departures
`exact_times`	0 or 1	1 = exact departure guarantee; 0 = approximate headway

A critical spec constraint: when a trip_id appears in frequencies.txt, the stop_times.txt rows for that trip define only the stop sequence and relative offsets from trip start, not absolute clock times. The absolute wall-clock departures come from expanding frequencies.txt. Treating frequency trips as timetable trips without this distinction produces phantom departure timestamps derived from whatever times the agency happened to enter as offsets.

For the authoritative field definitions, consult the GTFS Schedule Reference.

The exact_times flag is the most consequential field for downstream consumers. When exact_times is 1, departures occur at start_time, start_time + headway_secs, start_time + 2×headway_secs, etc. — the agency guarantees exact timing. When exact_times is 0, the same arithmetic generates approximate service windows. Routing engines that need deterministic timestamps should flag exact_times=0 records for uncertainty modeling. The deep-dive on converting frequencies.txt to exact departure times covers the expansion arithmetic in full.

Step-by-Step Implementation

Step 1: Ingest and isolate schedule types

Load stop_times.txt and frequencies.txt with explicit dtypes, then tag each trip by its schedule paradigm. Because frequencies.txt can override stop_times.txt, trips appearing in both files must be handled as frequency trips — not timetable trips — during expansion.

import pandas as pd
import numpy as np

# Explicit dtypes prevent silent coercion of trip_id integers and stop_id strings
stop_times = pd.read_csv(
    "stop_times.txt",
    dtype={
        "trip_id": str,
        "stop_id": str,
        "stop_sequence": "int32",
        "arrival_time": str,
        "departure_time": str,
    },
    usecols=["trip_id", "stop_id", "stop_sequence", "arrival_time", "departure_time"],
)

frequencies = pd.read_csv(
    "frequencies.txt",
    dtype={
        "trip_id": str,
        "start_time": str,
        "end_time": str,
        "headway_secs": "int32",
        "exact_times": "Int8",  # nullable integer; field is optional in some feeds
    },
)

# Identify frequency-governed trip_ids
freq_trip_ids = set(frequencies["trip_id"].unique())

# Tag every stop_times row by schedule paradigm
stop_times["schedule_type"] = np.where(
    stop_times["trip_id"].isin(freq_trip_ids), "frequency", "timetable"
)

# Split for separate processing paths
timetable_stop_times = stop_times[stop_times["schedule_type"] == "timetable"].copy()

# For frequency trips: keep only one stop_times row per (trip_id, stop_sequence)
# to preserve the relative offset template; duplicates arise from multi-agency merges
freq_stop_times = (
    stop_times[stop_times["schedule_type"] == "frequency"]
    .drop_duplicates(subset=["trip_id", "stop_sequence"])
    .copy()
)

Step 2: Parse GTFS time strings safely

GTFS time strings frequently exceed 23:59:59 for overnight trips (e.g., 25:30:00 for 1:30 AM the following day). Standard datetime.strptime rejects these values. Convert them to total seconds first, then build a timedelta.

def gtfs_time_to_timedelta(time_str: str) -> pd.Timedelta:
    """Convert a GTFS HH:MM:SS string (which may exceed 23:59) to a pd.Timedelta."""
    parts = time_str.strip().split(":")
    total_seconds = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
    return pd.Timedelta(seconds=total_seconds)

def parse_time_column(series: pd.Series) -> pd.Series:
    """Vectorized GTFS time parsing. Returns a Series of pd.Timedelta."""
    # Split once and compute without a Python loop
    split = series.str.strip().str.split(":", expand=True).astype("int32")
    return pd.to_timedelta(split[0] * 3600 + split[1] * 60 + split[2], unit="s")

Step 3: Expand frequency windows to exact departures

For each row in frequencies.txt, generate the complete sequence of departure offsets and broadcast them across the stop sequence defined by freq_stop_times. This is the most memory-intensive step for feeds with many frequency rows.

def expand_frequencies(
    freq_df: pd.DataFrame,
    freq_stop_times_df: pd.DataFrame,
) -> pd.DataFrame:
    """
    Expand each frequencies.txt row into discrete per-stop departure times.
    Returns a DataFrame with the same schema as timetable_stop_times plus
    'exact_times' and 'departure_utc_offset' columns.
    """
    freq_df = freq_df.copy()
    freq_df["start_td"] = parse_time_column(freq_df["start_time"])
    freq_df["end_td"] = parse_time_column(freq_df["end_time"])
    freq_df["headway_td"] = pd.to_timedelta(freq_df["headway_secs"], unit="s")
    # Fill missing exact_times with 0 (approximate) per spec default
    freq_df["exact_times"] = freq_df["exact_times"].fillna(0).astype("int8")

    # Merge frequency windows onto the stop sequence template
    merged = freq_stop_times_df.merge(
        freq_df[["trip_id", "start_td", "end_td", "headway_td", "exact_times"]],
        on="trip_id",
        how="inner",
    )

    # For each row, calculate the first stop's relative offset from trip start
    # (arrival_time in stop_times is the offset when trips are frequency-governed)
    merged["stop_offset_td"] = parse_time_column(merged["arrival_time"])
    # The offset of stop 0 (the trip origin); we subtract it to get a relative offset
    first_stop_offset = (
        merged.groupby("trip_id")["stop_offset_td"].transform("min")
    )
    merged["relative_offset_td"] = merged["stop_offset_td"] - first_stop_offset

    # Generate departure sequences without a Python loop using groupby + apply
    expanded_chunks = []
    for (trip_id, exact_t), group in merged.groupby(["trip_id", "exact_times"]):
        start_td = group["start_td"].iloc[0]
        end_td = group["end_td"].iloc[0]
        headway_td = group["headway_td"].iloc[0]

        # Sequence of trip-start times within the service window
        trip_starts = pd.timedelta_range(start=start_td, end=end_td, freq=headway_td)

        # Cross-join trip_starts with the stop sequence
        for stop_row in group.itertuples(index=False):
            departure_offsets = trip_starts + stop_row.relative_offset_td
            chunk = pd.DataFrame({
                "trip_id": trip_id,
                "stop_id": stop_row.stop_id,
                "stop_sequence": stop_row.stop_sequence,
                "departure_time": departure_offsets,
                "schedule_type": "frequency",
                "exact_times": exact_t,
            })
            expanded_chunks.append(chunk)

    if not expanded_chunks:
        return pd.DataFrame(columns=[
            "trip_id", "stop_id", "stop_sequence",
            "departure_time", "schedule_type", "exact_times",
        ])

    return pd.concat(expanded_chunks, ignore_index=True)

Step 4: Prepare timetable trips

Convert timetable departure strings to timedelta using the same parser so both DataFrames share a common departure_time dtype before merging.

# Add exact_times column for schema consistency (timetable trips are always exact)
timetable_stop_times["exact_times"] = 1
timetable_stop_times["departure_time"] = parse_time_column(
    timetable_stop_times["departure_time"]
)

Step 5: Merge, deduplicate, and resolve overlaps

Once frequencies are expanded, merge with timetable trips and remove duplicates. Overlaps occur when agencies publish both a timetable entry and a frequency block for the same trip_id — a known real-world inconsistency in several North American metro feeds.

def merge_and_deduplicate(
    timetable_df: pd.DataFrame,
    expanded_freq_df: pd.DataFrame,
) -> pd.DataFrame:
    """
    Combine timetable and expanded frequency departures.
    Timetable rows take precedence over frequency-expanded rows for the same
    (trip_id, stop_id, departure_time) combination.
    """
    # Timetable rows first so drop_duplicates keeps them over frequency rows
    combined = pd.concat(
        [timetable_df, expanded_freq_df],
        ignore_index=True,
        sort=False,
    )

    combined = combined.sort_values(
        ["trip_id", "stop_sequence", "departure_time"]
    ).reset_index(drop=True)

    # Round to 1-second precision before dedup to absorb floating-point drift
    combined["departure_time"] = combined["departure_time"].dt.round("s")

    before_count = len(combined)
    combined = combined.drop_duplicates(
        subset=["trip_id", "stop_id", "departure_time"],
        keep="first",  # first = timetable row, since we concatenated it first
    )
    duplicates_removed = before_count - len(combined)
    if duplicates_removed > 0:
        print(f"[INFO] Removed {duplicates_removed} duplicate departure records.")

    return combined

Step 6: Timezone normalization and UTC alignment

Routing engines require continuous POSIX timestamps. Apply timezone normalization after merging both schedule types. The agency timezone is a mandatory field in agency.txt; fetch it before running this step. For feeds with multiple agencies, join on agency_id through routes.txt → trips.txt to resolve the correct timezone per trip.

def normalize_to_utc(
    df: pd.DataFrame,
    service_date_str: str,
    agency_tz: str,
) -> pd.DataFrame:
    """
    Convert GTFS timedelta departure_time values to UTC wall-clock timestamps.

    Parameters
    ----------
    df : DataFrame with a 'departure_time' column of pd.Timedelta
    service_date_str : ISO date string for the service day, e.g. '2024-01-15'
    agency_tz : IANA timezone name from agency.txt agency_timezone field
    """
    # Build a timezone-aware base date representing midnight of the service day
    base_date = pd.Timestamp(service_date_str).tz_localize(agency_tz)

    # Adding a timedelta to a tz-aware Timestamp handles day-wrap automatically.
    # A departure_time of 25:30:00 correctly maps to 01:30:00 the next calendar day.
    df = df.copy()
    df["departure_utc"] = (base_date + df["departure_time"]).dt.tz_convert("UTC")

    df = df.drop(columns=["departure_time"])
    return df

For the full treatment of DST edge cases and multi-timezone feeds, see converting local transit times to UTC in Python.

Validation and Verification

Normalized schedules must pass structural validation before reaching downstream consumers. The assertions below catch the most common normalization failures:

def validate_normalized_schedule(df: pd.DataFrame) -> bool:
    """
    Run structural checks on a normalized departure DataFrame.
    Returns True if all checks pass; prints warnings for any violations found.
    """
    checks = {
        "missing_departure_utc": df["departure_utc"].isna().sum(),
        "missing_stop_id": df["stop_id"].isna().sum(),
        "missing_trip_id": df["trip_id"].isna().sum(),
        "duplicate_trip_stop_departure": df.duplicated(
            subset=["trip_id", "stop_id", "departure_utc"]
        ).sum(),
        "negative_headway_within_trip": (
            df.sort_values(["trip_id", "stop_sequence", "departure_utc"])
            .groupby(["trip_id", "stop_sequence"])["departure_utc"]
            .diff()
            .dropna()
            .lt(pd.Timedelta(0))
            .sum()
        ),
    }

    all_pass = True
    for metric, count in checks.items():
        if count > 0:
            print(f"[WARN] {metric}: {count} records flagged")
            all_pass = False

    # Assert minimum expected row count (basic sanity: at least one departure per trip)
    assert len(df) >= df["trip_id"].nunique(), (
        "Normalized table has fewer rows than unique trip_ids — expansion likely failed."
    )

    if all_pass:
        print(f"[OK] Validation passed: {len(df):,} departure records, "
              f"{df['trip_id'].nunique():,} unique trips.")
    return all_pass

For teams running recurring feed drops, embedding these checks in an automated update pipeline is essential. See automating feed updates with GTFS-Kit for a production deployment pattern that gates ingestion on validation results.

Failure Modes and Edge Cases

Real agency feeds introduce problems that the GTFS specification does not prevent. The following edge cases appear regularly in production pipelines:

Frequency trips with stop_times offsets starting at 00:00:00. Some agencies populate the stop_times template for frequency trips with literal clock times rather than relative offsets from trip start. The result: all expanded departures share the same absolute time regardless of the frequency window. Detect this by checking whether stop_times.departure_time for a frequency trip_id equals the frequencies.start_time — if so, treat the stop_times offsets as relative zeros and apply only the headway sequence.
Headway windows that do not tile cleanly. If (end_time - start_time) % headway_secs != 0, the last generated departure in pd.timedelta_range may fall slightly before end_time or be silently omitted. Always verify the generated count matches floor((end_time - start_time) / headway_secs) + 1.
Trips in frequencies.txt with no matching stop_times rows. Referential integrity is not enforced by all agency feed generators. A frequency record without corresponding stop_times rows produces zero expanded departures. Log these orphaned trip_id values and route them to a dead-letter store for manual review. The broader topic of referential integrity checks is covered in GTFS validation rules and common schema errors.
Mixed exact_times values within the same trip_id. Technically non-conforming, but real feeds ship this. A single trip_id may have one frequency row with exact_times=1 (morning peak) and another with exact_times=0 (midday). Group by (trip_id, exact_times) when expanding rather than by trip_id alone.
Times exceeding 30:00:00. Some overnight routes and 24-hour bus networks generate departure strings like 28:45:00. pd.to_timedelta handles these correctly, but downstream Parquet readers expecting datetime64[ns] columns will fail if the resulting UTC timestamp falls outside the pd.Timestamp range. Cap or filter records with departure offsets above 30 hours and log them.
Duplicate frequency rows for the same trip_id and window. Occasionally a feed publishes two frequencies.txt rows for the same trip_id, start_time, end_time with different headway_secs values. This doubles the departure count. Deduplicate frequencies.txt on (trip_id, start_time, end_time) before expansion, keeping the smaller headway (more conservative, avoids under-serving).

Performance and Scale Notes

For feeds exceeding 500 MB uncompressed, naive in-memory processing of stop_times.txt exhausts RAM before expansion begins. Apply these patterns:

Column pruning on ingestion. stop_times.txt often contains optional columns (shape_dist_traveled, pickup_type, drop_off_type) that are irrelevant to departure normalization. Specifying usecols at read time reduces peak memory by 30–40% on large feeds.

Chunked processing by trip_id batch. Partition the trips.txt table by route or service pattern, then process one batch of trips at a time. Write each batch to Parquet using pyarrow with partitioning on service_date and agency_id. This limits peak working-set memory to a controllable ceiling regardless of total feed size.

import pyarrow as pa
import pyarrow.parquet as pq

def write_normalized_partition(
    df: pd.DataFrame,
    output_dir: str,
    service_date: str,
    agency_id: str,
) -> None:
    """Write a normalized departure batch to a Parquet partition."""
    table = pa.Table.from_pandas(df, preserve_index=False)
    pq.write_to_dataset(
        table,
        root_path=output_dir,
        partition_cols=["agency_id", "service_date"],
    )

dtype optimization before merge. Convert trip_id and stop_id columns to pd.Categorical dtype before the merge step. On a 10-million-row stop_times.txt, this reduces memory by 50–70% compared to object dtype strings. See memory-efficient processing for large feeds for detailed dtype profiling patterns and Parquet output strategies.

Avoid iterrows in expansion loops. The expand_frequencies implementation above still uses itertuples for per-group expansion. For feeds with thousands of frequency rows, replace the inner loop with a vectorized cross-join using pd.merge on a range index:

# Vectorized alternative to the itertuples inner loop
trip_starts_frames = []
for (trip_id,), group in frequencies.groupby(["trip_id"]):
    starts = pd.timedelta_range(
        start=group["start_td"].iloc[0],
        end=group["end_td"].iloc[0],
        freq=group["headway_td"].iloc[0],
    )
    trip_starts_frames.append(
        pd.DataFrame({"trip_id": trip_id, "trip_start_td": starts})
    )

all_trip_starts = pd.concat(trip_starts_frames, ignore_index=True)

# Now merge the full stop sequence against all trip starts in a single join
expanded = freq_stop_times.merge(all_trip_starts, on="trip_id", how="inner")
expanded["departure_time"] = expanded["trip_start_td"] + expanded["relative_offset_td"]

This approach delegates the cross-product to pandas’ optimized C-layer merge rather than Python interpreter loops, cutting expansion time by 5–10x on feeds with hundreds of frequency-governed routes.

What is the difference between exact_times=0 and exact_times=1 in frequencies.txt?

When exact_times is 1, the agency guarantees departures at the precise headway interval starting from start_time — routing engines can treat these as fixed schedule points. When exact_times is 0 (or omitted), the headway is a target interval but individual departures may vary due to real-time conditions. For normalization pipelines, both modes are typically expanded to discrete timestamps, but exact_times=0 records should be flagged so downstream systems can apply uncertainty buffers rather than treating them as guaranteed departure times.

Can a GTFS trip appear in both stop_times.txt and frequencies.txt?

Yes. When a trip_id appears in frequencies.txt, the stop_times.txt rows define the stop sequence and relative time offsets from trip start — not absolute departure times. The absolute wall-clock departures come from expanding frequencies.txt. Treating such trips as pure timetable records is a common ETL error that produces phantom departures derived from the arbitrary offsets the agency entered as placeholders.

How do I handle GTFS times greater than 24:00:00?

Parse the time string manually into a total-seconds integer before using pd.to_timedelta. Standard datetime.strptime rejects values like 25:30:00. Once converted to a timedelta, add it to a timezone-aware base pd.Timestamp representing midnight of the service day. The arithmetic correctly maps 25:30:00 to 01:30 AM of the next calendar day, and pd.Timestamp.tz_convert("UTC") handles DST transitions transparently.

Converting GTFS frequencies.txt to exact departure times — deep dive on the expansion arithmetic and exact_times flag handling
Automating feed updates with GTFS-Kit — production pipeline for recurring feed ingestion with validation gating
Memory-efficient processing for large feeds — dtype optimization and Parquet partitioning for metro-scale feeds
GTFS validation rules and common schema errors — referential integrity checks that catch orphaned trip_ids before normalization
Timezone handling and schedule normalization — agency timezone resolution and DST-safe UTC conversion

Up: Python Parsing & Data Normalization | Home