GTFS Validation Rules and Common Schema Errors

Public transit data pipelines fail silently when schema violations slip through ingestion. For mobility platform teams, transit analysts, and Python GIS engineers, catching these anomalies before they reach routing engines, fare calculators, or passenger-facing applications is non-negotiable. This guide breaks down GTFS Validation Rules and Common Schema Errors using production-tested Python patterns. Building on the architectural principles outlined in GTFS Feed Architecture & Fundamentals, we will move from theoretical compliance to executable validation workflows that scale across enterprise data ecosystems.

Prerequisites for Automated GTFS Validation

Before implementing validation logic, ensure your environment meets the following baseline requirements:

  • Python 3.9+ with pandas>=2.0, pydantic>=2.0, and zipfile (standard library)
  • Working knowledge of relational referential integrity (primary keys, foreign keys, cardinality constraints)
  • Access to a raw GTFS static feed (ZIP archive containing .txt files)
  • Familiarity with the official specification published by MobilityData, which defines mandatory files, field types, and business rules: GTFS Specification Reference

Validation is not a single script; it is a layered process. Schema validation catches structural violations, referential checks enforce relational integrity, and business rule validation ensures operational plausibility. Skipping any layer introduces silent data degradation that compounds downstream.

The Validation Pipeline Architecture

A robust GTFS validation pipeline follows a deterministic sequence. Deviating from this order often produces false positives, masks root causes, or triggers cascading failures during DataFrame joins.

1. Archive Extraction & File Discovery

Unpack the ZIP archive and verify the presence of core files (agency.txt, stops.txt, routes.txt, trips.txt, stop_times.txt, calendar.txt/calendar_dates.txt). Missing mandatory files should trigger immediate pipeline failure. Python’s built-in zipfile module provides reliable extraction, but you must handle encoding edge cases (often UTF-8-BOM) and validate that the archive contains only .txt files at the root level. Nested directories or extraneous files frequently break downstream parsers. For implementation details on safe archive handling, consult the official Python zipfile documentation.

2. Schema & Type Validation

Load each .txt file into a structured DataFrame. Validate column presence, data types, and allowed enumerations. This stage aligns with the structural expectations detailed in Understanding GTFS Static Feed Structure. Common failures here include:

  • Missing required columns: agency.txt lacking agency_id or agency_timezone.
  • Type mismatches: stop_lat or stop_lon parsed as strings instead of floats due to malformed CSV headers.
  • Invalid enumerations: location_type in stops.txt containing values outside the 0–4 range, or wheelchair_boarding using non-standard flags.

Using pydantic models to define expected schemas before DataFrame ingestion catches these issues early, preventing silent coercion errors that corrupt geospatial indexing.

3. Referential Integrity Checks

Cross-reference foreign keys across files. Every stop_id in stop_times.txt must exist in stops.txt. Every trip_id must map to a valid route_id and service_id. This relational mapping is where most enterprise pipelines break. Orphaned records, duplicated primary keys, and mismatched cardinalities cascade into routing failures. For a deeper dive into how these tables interact, see Mastering stops.txt and stop_times.txt Relationships. Implementing set-based joins or pandas merge operations with indicator=True flags quickly surfaces missing or extraneous references.

4. Business Rule & Plausibility Validation

Enforce domain-specific constraints: valid WGS84 coordinate ranges, non-negative travel times, chronological stop sequences, and timezone-aware schedule normalization. A trip cannot depart a stop before it arrives, and arrival_time must always be less than or equal to departure_time within the same stop sequence. Additionally, stop_times.txt requires strictly increasing stop_sequence values per trip. These rules prevent logical impossibilities that routing engines will otherwise interpret as infinite loops or negative travel durations.

5. Error Aggregation & Reporting

Compile violations into a structured JSON/CSV report. Categorize by severity (error, warning, info) and attach actionable metadata: file name, row index, violated rule ID, and suggested remediation. Standardized reporting enables automated alerting, feed publisher feedback loops, and compliance tracking. Tools like MobilityData’s open-source gtfs-validator can be integrated for cross-verification, but custom Python pipelines offer greater flexibility for enterprise-specific business logic and CI/CD gating.

Common Schema Errors & Mitigation Strategies

Real-world GTFS feeds rarely conform perfectly to the specification. Below are the most frequent violations encountered in production environments and how to resolve them programmatically.

1. Duplicate Primary Keys Symptom: Multiple rows in stops.txt or routes.txt share the same stop_id or route_id. Impact: Ambiguous joins, unpredictable routing behavior, and silent overwrites during ingestion. Fix: Deduplicate using deterministic rules (e.g., keep the row with the most recent last_modified timestamp or highest priority source). Log duplicates for manual review and enforce unique constraints at the database level post-validation.

2. Invalid Coordinate Ranges Symptom: stop_lat or stop_lon falls outside [-90, 90] or [-180, 180], or uses swapped axes. Impact: Geospatial queries fail, map rendering breaks, and distance calculations return NaN. Fix: Apply strict range validation during ingestion. Use pyproj or shapely for coordinate system verification if feeds claim non-WGS84 projections (though GTFS mandates WGS84). Flag coordinates near 0,0 as likely null-value placeholders.

3. Time Format Violations & Day Rollover Symptom: stop_times.txt uses HH:MM:SS but exceeds 23:59:59 without proper day rollover notation, or uses 24:00:00 inconsistently. Impact: Schedule normalization fails, causing trips to appear on the wrong calendar day. Fix: Parse times as strings first, then convert to pandas.Timedelta to handle values >24 hours correctly. GTFS explicitly allows times beyond 24:00:00 to represent overnight service, but parsers must handle this mathematically rather than as standard clock time.

4. Missing Calendar Coverage Symptom: calendar.txt defines service days, but calendar_dates.txt lacks exceptions for holidays, or vice versa. Impact: Routing engines assume daily service when none exists, or drop valid trips due to missing date coverage. Fix: Validate that every service_id in trips.txt has at least one active day in either calendar.txt or calendar_dates.txt. Cross-check against known holiday schedules and generate coverage matrices to identify service gaps.

Production-Ready Python Validation Patterns

Implementing these checks requires a balance of performance and readability. Below is a minimal, production-grade pattern using pandas and pydantic for schema validation.

python
import pandas as pd
from pydantic import BaseModel, field_validator
from typing import Optional

class StopSchema(BaseModel):
    stop_id: str
    stop_name: str
    stop_lat: float
    stop_lon: float
    location_type: Optional[int] = 0

    @field_validator("stop_lat", "stop_lon")
    @classmethod
    def validate_coordinates(cls, v):
        if not (-90.0 <= v <= 90.0):
            raise ValueError(f"Coordinate out of WGS84 range: {v}")
        return v

def validate_stops(df: pd.DataFrame) -> list[str]:
    errors = []
    for _, row in df.iterrows():
        try:
            StopSchema(**row.to_dict())
        except Exception as e:
            errors.append(f"Row {row.name}: {e}")
    return errors

While row-by-row validation works for small feeds, enterprise pipelines should leverage vectorized operations to avoid Python-level iteration overhead:

python
def vectorized_coordinate_check(df: pd.DataFrame) -> pd.DataFrame:
    mask = (df["stop_lat"].between(-90, 90)) & (df["stop_lon"].between(-180, 180))
    return df[~mask].copy()

Vectorization reduces execution time from minutes to seconds on multi-million-row feeds. For memory-constrained environments, process feeds in chunks using pd.read_csv(..., chunksize=50000) or migrate to Polars for out-of-core execution. For comprehensive implementation strategies, review How to Validate a GTFS Feed with Python, which covers batch processing, memory optimization, and CI/CD integration.

Scaling Validation in Enterprise Workflows

As transit networks grow, validation must transition from ad-hoc scripts to governed pipelines. Key architectural considerations include:

  • Idempotent Processing: Ensure validation runs can be re-executed without side effects. Cache intermediate DataFrames and use transactional writes for error logs.
  • Incremental Validation: For feeds that update daily, diff against the previous version to isolate regressions rather than reprocessing the entire archive. Hash-based change detection minimizes compute costs.
  • Schema Versioning: GTFS evolves. Track feed specification versions and map validation rules to specific spec releases to avoid breaking changes when publishers adopt GTFS-Realtime extensions or new static fields.
  • Observability: Integrate validation metrics into monitoring dashboards. Track error rates, feed size, processing duration, and severity distributions over time. Alert on threshold breaches before feeds reach production routing engines.

Implementing a centralized validation service decouples quality checks from ingestion and routing. This separation of concerns allows mobility teams to enforce data contracts, publish compliance reports, and provide actionable feedback to feed publishers. When combined with automated testing and continuous integration, validation becomes a proactive quality gate rather than a reactive debugging exercise.

Conclusion

GTFS validation is the foundation of reliable transit data infrastructure. By enforcing strict schema rules, verifying referential integrity, and applying domain-specific business logic, engineering teams can prevent silent failures and ensure passenger-facing applications deliver accurate, real-time information. Adopting standardized Python patterns and integrating validation into CI/CD pipelines transforms compliance from a manual chore into an automated, scalable workflow. As mobility ecosystems expand, disciplined validation practices will remain the critical differentiator between resilient transit platforms and brittle data pipelines.