GTFS Validation Rules and Common Schema Errors
Public transit data pipelines fail silently when schema violations slip through ingestion. For mobility platform teams, transit analysts, and Python GIS engineers, catching these anomalies before they reach routing engines, fare calculators, or passenger-facing applications is non-negotiable. This guide breaks down GTFS Validation Rules and Common Schema Errors using production-tested Python patterns. Building on the architectural principles outlined in GTFS Feed Architecture & Fundamentals, we will move from theoretical compliance to executable validation workflows that scale across enterprise data ecosystems.
Prerequisites for Automated GTFS Validation
Before implementing validation logic, ensure your environment meets the following baseline requirements:
- Python 3.9+ with
pandas>=2.0,pydantic>=2.0, andzipfile(standard library) - Working knowledge of relational referential integrity (primary keys, foreign keys, cardinality constraints)
- Access to a raw GTFS static feed (ZIP archive containing
.txtfiles) - Familiarity with the official specification published by MobilityData, which defines mandatory files, field types, and business rules: GTFS Specification Reference
Validation is not a single script; it is a layered process. Schema validation catches structural violations, referential checks enforce relational integrity, and business rule validation ensures operational plausibility. Skipping any layer introduces silent data degradation that compounds downstream.
The Validation Pipeline Architecture
A robust GTFS validation pipeline follows a deterministic sequence. Deviating from this order often produces false positives, masks root causes, or triggers cascading failures during DataFrame joins.
1. Archive Extraction & File Discovery
Unpack the ZIP archive and verify the presence of core files (agency.txt, stops.txt, routes.txt, trips.txt, stop_times.txt, calendar.txt/calendar_dates.txt). Missing mandatory files should trigger immediate pipeline failure. Python’s built-in zipfile module provides reliable extraction, but you must handle encoding edge cases (often UTF-8-BOM) and validate that the archive contains only .txt files at the root level. Nested directories or extraneous files frequently break downstream parsers. For implementation details on safe archive handling, consult the official Python zipfile documentation.
2. Schema & Type Validation
Load each .txt file into a structured DataFrame. Validate column presence, data types, and allowed enumerations. This stage aligns with the structural expectations detailed in Understanding GTFS Static Feed Structure. Common failures here include:
- Missing required columns:
agency.txtlackingagency_idoragency_timezone. - Type mismatches:
stop_latorstop_lonparsed as strings instead of floats due to malformed CSV headers. - Invalid enumerations:
location_typeinstops.txtcontaining values outside the 0–4 range, orwheelchair_boardingusing non-standard flags.
Using pydantic models to define expected schemas before DataFrame ingestion catches these issues early, preventing silent coercion errors that corrupt geospatial indexing.
3. Referential Integrity Checks
Cross-reference foreign keys across files. Every stop_id in stop_times.txt must exist in stops.txt. Every trip_id must map to a valid route_id and service_id. This relational mapping is where most enterprise pipelines break. Orphaned records, duplicated primary keys, and mismatched cardinalities cascade into routing failures. For a deeper dive into how these tables interact, see Mastering stops.txt and stop_times.txt Relationships. Implementing set-based joins or pandas merge operations with indicator=True flags quickly surfaces missing or extraneous references.
4. Business Rule & Plausibility Validation
Enforce domain-specific constraints: valid WGS84 coordinate ranges, non-negative travel times, chronological stop sequences, and timezone-aware schedule normalization. A trip cannot depart a stop before it arrives, and arrival_time must always be less than or equal to departure_time within the same stop sequence. Additionally, stop_times.txt requires strictly increasing stop_sequence values per trip. These rules prevent logical impossibilities that routing engines will otherwise interpret as infinite loops or negative travel durations.
5. Error Aggregation & Reporting
Compile violations into a structured JSON/CSV report. Categorize by severity (error, warning, info) and attach actionable metadata: file name, row index, violated rule ID, and suggested remediation. Standardized reporting enables automated alerting, feed publisher feedback loops, and compliance tracking. Tools like MobilityData’s open-source gtfs-validator can be integrated for cross-verification, but custom Python pipelines offer greater flexibility for enterprise-specific business logic and CI/CD gating.
Common Schema Errors & Mitigation Strategies
Real-world GTFS feeds rarely conform perfectly to the specification. Below are the most frequent violations encountered in production environments and how to resolve them programmatically.
1. Duplicate Primary Keys
Symptom: Multiple rows in stops.txt or routes.txt share the same stop_id or route_id.
Impact: Ambiguous joins, unpredictable routing behavior, and silent overwrites during ingestion.
Fix: Deduplicate using deterministic rules (e.g., keep the row with the most recent last_modified timestamp or highest priority source). Log duplicates for manual review and enforce unique constraints at the database level post-validation.
2. Invalid Coordinate Ranges
Symptom: stop_lat or stop_lon falls outside [-90, 90] or [-180, 180], or uses swapped axes.
Impact: Geospatial queries fail, map rendering breaks, and distance calculations return NaN.
Fix: Apply strict range validation during ingestion. Use pyproj or shapely for coordinate system verification if feeds claim non-WGS84 projections (though GTFS mandates WGS84). Flag coordinates near 0,0 as likely null-value placeholders.
3. Time Format Violations & Day Rollover
Symptom: stop_times.txt uses HH:MM:SS but exceeds 23:59:59 without proper day rollover notation, or uses 24:00:00 inconsistently.
Impact: Schedule normalization fails, causing trips to appear on the wrong calendar day.
Fix: Parse times as strings first, then convert to pandas.Timedelta to handle values >24 hours correctly. GTFS explicitly allows times beyond 24:00:00 to represent overnight service, but parsers must handle this mathematically rather than as standard clock time.
4. Missing Calendar Coverage
Symptom: calendar.txt defines service days, but calendar_dates.txt lacks exceptions for holidays, or vice versa.
Impact: Routing engines assume daily service when none exists, or drop valid trips due to missing date coverage.
Fix: Validate that every service_id in trips.txt has at least one active day in either calendar.txt or calendar_dates.txt. Cross-check against known holiday schedules and generate coverage matrices to identify service gaps.
Production-Ready Python Validation Patterns
Implementing these checks requires a balance of performance and readability. Below is a minimal, production-grade pattern using pandas and pydantic for schema validation.
import pandas as pd
from pydantic import BaseModel, field_validator
from typing import Optional
class StopSchema(BaseModel):
stop_id: str
stop_name: str
stop_lat: float
stop_lon: float
location_type: Optional[int] = 0
@field_validator("stop_lat", "stop_lon")
@classmethod
def validate_coordinates(cls, v):
if not (-90.0 <= v <= 90.0):
raise ValueError(f"Coordinate out of WGS84 range: {v}")
return v
def validate_stops(df: pd.DataFrame) -> list[str]:
errors = []
for _, row in df.iterrows():
try:
StopSchema(**row.to_dict())
except Exception as e:
errors.append(f"Row {row.name}: {e}")
return errors
While row-by-row validation works for small feeds, enterprise pipelines should leverage vectorized operations to avoid Python-level iteration overhead:
def vectorized_coordinate_check(df: pd.DataFrame) -> pd.DataFrame:
mask = (df["stop_lat"].between(-90, 90)) & (df["stop_lon"].between(-180, 180))
return df[~mask].copy()
Vectorization reduces execution time from minutes to seconds on multi-million-row feeds. For memory-constrained environments, process feeds in chunks using pd.read_csv(..., chunksize=50000) or migrate to Polars for out-of-core execution. For comprehensive implementation strategies, review How to Validate a GTFS Feed with Python, which covers batch processing, memory optimization, and CI/CD integration.
Scaling Validation in Enterprise Workflows
As transit networks grow, validation must transition from ad-hoc scripts to governed pipelines. Key architectural considerations include:
- Idempotent Processing: Ensure validation runs can be re-executed without side effects. Cache intermediate DataFrames and use transactional writes for error logs.
- Incremental Validation: For feeds that update daily, diff against the previous version to isolate regressions rather than reprocessing the entire archive. Hash-based change detection minimizes compute costs.
- Schema Versioning: GTFS evolves. Track feed specification versions and map validation rules to specific spec releases to avoid breaking changes when publishers adopt GTFS-Realtime extensions or new static fields.
- Observability: Integrate validation metrics into monitoring dashboards. Track error rates, feed size, processing duration, and severity distributions over time. Alert on threshold breaches before feeds reach production routing engines.
Implementing a centralized validation service decouples quality checks from ingestion and routing. This separation of concerns allows mobility teams to enforce data contracts, publish compliance reports, and provide actionable feedback to feed publishers. When combined with automated testing and continuous integration, validation becomes a proactive quality gate rather than a reactive debugging exercise.
Conclusion
GTFS validation is the foundation of reliable transit data infrastructure. By enforcing strict schema rules, verifying referential integrity, and applying domain-specific business logic, engineering teams can prevent silent failures and ensure passenger-facing applications deliver accurate, real-time information. Adopting standardized Python patterns and integrating validation into CI/CD pipelines transforms compliance from a manual chore into an automated, scalable workflow. As mobility ecosystems expand, disciplined validation practices will remain the critical differentiator between resilient transit platforms and brittle data pipelines.