Handling Frequency-Based vs Timetable Schedules

Transit agencies publish schedules using fundamentally different operational paradigms. Some routes run on fixed, published departure times, while others operate on headway-based intervals that adapt to demand, traffic, or driver availability. In the General Transit Feed Specification (GTFS), these paradigms map to two distinct data structures: timetable schedules (stop_times.txt) and frequency-based schedules (frequencies.txt).

For transit analysts, urban tech developers, and mobility platform teams, correctly Handling Frequency-Based vs Timetable Schedules is not optional. Routing engines, real-time prediction models, and accessibility calculators all require a unified, time-normalized departure table. This guide provides a production-ready workflow for ingesting, differentiating, expanding, and merging both schedule types into a single analytical dataset.

Understanding GTFS Schedule Paradigms

The GTFS specification explicitly separates exact-time and headway-based operations. Timetable schedules define every vehicle arrival and departure at each stop along a trip. Each row in stop_times.txt contains a trip_id, stop_sequence, arrival_time, and departure_time. These are deterministic and ideal for commuter rail, intercity buses, and scheduled urban routes where precise scheduling is mandated.

Frequency-based schedules, by contrast, define a window of operation and a headway interval. The frequencies.txt file links to a trip_id and specifies start_time, end_time, headway_secs, and an optional exact_times flag. When exact_times is 0 (or omitted), departures are approximate and follow a rolling headway. When exact_times is 1, the agency guarantees exact departures at the specified interval, effectively mimicking a timetable but stored more compactly.

Many metropolitan feeds use both paradigms simultaneously. A single route may run on exact times during peak hours and switch to frequency-based operation during midday or weekends. Normalizing these into a consistent structure requires careful time arithmetic, deduplication logic, and explicit handling of day-wrap scenarios (times exceeding 24:00:00). For authoritative schema definitions, consult the official GTFS Schedule Reference. Mastering these distinctions is foundational to any Python Parsing & Data Normalization pipeline.

Prerequisites & Environment Setup

Before implementing the normalization pipeline, ensure your environment meets the following requirements:

  • Python 3.9+ with pandas>=1.5.0, pyarrow, and numpy
  • GTFS ingestion libraries: partridge for lightweight feed parsing, or gtfs-kit for higher-level abstractions. If you are new to feed ingestion, review our guide on Parsing GTFS with Pandas and Partridge for optimized memory patterns.
  • Memory allocation: 8GB+ RAM for feeds exceeding 500MB uncompressed
  • Timezone awareness: Agency timezone must be known (typically extracted from feed_info.txt or agency.txt)

This workflow assumes familiarity with vectorized time arithmetic. We will leverage pandas for batch operations and avoid row-by-row iteration to maintain sub-second processing on large metropolitan feeds.

Production Workflow: Normalizing Mixed Schedules

Step 1: Ingest & Isolate Schedule Types

The first step is to isolate timetable trips from frequency-based trips. Since frequencies.txt can override or supplement stop_times.txt, we must flag trips that appear in both files. We’ll load the core tables, filter by trip_id, and tag the schedule type.

python
import pandas as pd
import numpy as np

# Load core GTFS tables
stop_times = pd.read_csv("stop_times.txt", dtype={"trip_id": str, "stop_id": str})
frequencies = pd.read_csv("frequencies.txt", dtype={"trip_id": str})

# Identify trip types
freq_trip_ids = set(frequencies["trip_id"].unique())
stop_times["schedule_type"] = np.where(
    stop_times["trip_id"].isin(freq_trip_ids), "frequency", "timetable"
)

# Split for separate processing
timetable_trips = stop_times[stop_times["schedule_type"] == "timetable"]
freq_trips = stop_times[stop_times["schedule_type"] == "frequency"].drop_duplicates(subset="trip_id")

Step 2: Expand Frequency Windows to Exact Departures

Expanding headways into discrete departure times requires careful interval generation. When exact_times == 1, the expansion is straightforward. When exact_times == 0, you are generating service windows for probabilistic modeling, but for routing engines, we still materialize exact timestamps at the headway interval. We’ll use pandas to generate these sequences efficiently without Python loops.

python
def expand_frequencies(freq_df, stop_times_df):
    expanded_rows = []

    # Attach frequency definitions to each trip's stop sequence so we can
    # materialize departures per stop.
    merged = stop_times_df.merge(
        freq_df[["trip_id", "start_time", "end_time", "headway_secs", "exact_times"]],
        on="trip_id",
        how="inner"
    )
    
    # Convert time strings to timedelta for arithmetic
    merged["start_td"] = pd.to_timedelta(merged["start_time"])
    merged["end_td"] = pd.to_timedelta(merged["end_time"])
    merged["headway_td"] = pd.to_timedelta(merged["headway_secs"], unit="s")
    
    # Generate departure sequences per trip
    for _, row in merged.iterrows():
        # Vectorized range generation
        times = pd.timedelta_range(
            start=row["start_td"], 
            end=row["end_td"], 
            freq=row["headway_td"]
        )
        
        # Create a DataFrame for this trip's expanded times
        trip_df = pd.DataFrame({"departure_time": times})
        trip_df["trip_id"] = row["trip_id"]
        trip_df["exact_times"] = row["exact_times"]
        expanded_rows.append(trip_df)
        
    return pd.concat(expanded_rows, ignore_index=True)

For detailed implementation of pandas.to_timedelta and interval arithmetic, refer to the official pandas documentation.

Step 3: Merge, Deduplicate, & Resolve Overlaps

Once frequencies are expanded, they must be merged with the original stop_times.txt. Overlaps occur when agencies publish both a timetable and a frequency block for the same trip ID. Deduplication logic should prioritize the timetable during peak hours and fall back to frequency expansions during off-peak windows.

python
def merge_and_deduplicate(timetable_df, expanded_freq_df):
    # Combine both datasets
    combined = pd.concat([timetable_df, expanded_freq_df], ignore_index=True)
    
    # Sort by trip_id and departure_time for consistent ordering
    combined = combined.sort_values(["trip_id", "departure_time"])
    
    # Remove exact duplicate rows (common when agencies redundantly publish)
    combined = combined.drop_duplicates(subset=["trip_id", "stop_id", "departure_time"])
    
    # Flag overlapping trips for manual QA if needed
    overlap_check = combined.groupby("trip_id")["departure_time"].apply(lambda x: x.duplicated(keep=False))
    if overlap_check.any():
        print("Warning: Overlapping departure windows detected in merged dataset.")
        
    return combined

Step 4: Normalize Day-Wrap & Timezone Alignment

GTFS represents times past midnight as 24:00:00, 25:30:00, etc. Routing engines require continuous POSIX timestamps. We’ll convert HH:MM:SS strings to timedelta, add them to a base service date, and normalize to UTC using the agency timezone.

python
from datetime import datetime
import pytz

def normalize_times(df, service_date_str="2024-01-01", agency_tz="America/New_York"):
    base_date = pd.Timestamp(service_date_str, tz=agency_tz)
    
    # Parse departure_time strings to timedelta
    df["departure_td"] = pd.to_timedelta(df["departure_time"])
    
    # Add to base date, handling >24h automatically via timedelta arithmetic
    df["departure_utc"] = base_date + df["departure_td"]
    
    # Convert to UTC for cross-agency consistency
    df["departure_utc"] = df["departure_utc"].dt.tz_convert("UTC")
    
    # Drop intermediate columns
    df = df.drop(columns=["departure_time", "departure_td"])
    return df

Validation & Data Quality Checks

Normalized schedules must pass structural validation before downstream consumption. Common failure points include negative headway values, missing stop_sequence continuity, and overlapping service windows. Implement automated assertions to catch these early.

python
def validate_normalized_schedule(df):
    checks = {
        "missing_departures": df["departure_utc"].isna().sum(),
        "negative_headways": (df["departure_utc"].diff() < pd.Timedelta(0)).sum(),
        "duplicate_trip_stops": df.duplicated(subset=["trip_id", "stop_id", "departure_utc"]).sum()
    }
    
    for metric, count in checks.items():
        if count > 0:
            print(f"[WARN] {metric}: {count} records flagged")
            
    return all(v == 0 for v in checks.values())

For teams managing recurring feed drops, integrating validation into an update pipeline is critical. See our workflow on Automating Feed Updates with GTFS-Kit for production deployment patterns.

Scaling for Multi-Agency Pipelines

When processing multi-agency feeds or historical archives, memory constraints become the primary bottleneck. Exporting the normalized departure table to Parquet with partitioned columns (agency_id, service_date) enables downstream analytics without reloading raw text files. Column pruning during ingestion and explicit dtype casting reduce peak RAM usage by 40–60% compared to default pandas behavior.

For a deep dive into the mathematical expansion logic and edge-case handling, refer to our companion guide on Converting GTFS Frequency.txt to Exact Departure Times. By standardizing both schedule types into a single, time-aligned departure matrix, your routing algorithms, isochrone generators, and performance dashboards will operate on consistent, production-grade data.