How to Validate a GTFS Feed with Python
The most reliable production approach to validate a GTFS feed with Python is to orchestrate the industry-standard MobilityData GTFS Validator via subprocess, parse its structured JSON report, and gracefully degrade to a lightweight pandas-based schema checker when a Java runtime is unavailable. GTFS validation extends far beyond CSV parsing; it requires cross-file referential integrity checks, business-rule enforcement, and strict type validation. Wrapping the official Java validator in Python guarantees deterministic, spec-compliant results, while native Python fallbacks keep CI/CD pipelines from stalling due to missing dependencies.
Why GTFS Validation Requires Cross-File Integrity Checks
A GTFS feed is fundamentally a ZIP archive containing interrelated CSV files that map to transit domains: agencies, stops, routes, trips, stop times, calendars, and fare rules. Understanding the GTFS Feed Architecture & Fundamentals is essential because validation failures rarely stem from malformed CSV syntax alone. They typically arise from broken foreign keys (e.g., a trip_id in stop_times.txt missing from trips.txt), invalid coordinate bounds, or overlapping service calendars. Automated validation must catch these structural violations before the feed reaches routing engines, schedule planners, or passenger-facing applications.
Production Architecture: Java Validator + Python Orchestration
The official validator is Java-based, which ensures deterministic results across thousands of edge cases. Python acts as the orchestration layer: downloading feeds, invoking the JAR, parsing JSON outputs, and routing failures to alerting systems. When the Java runtime is missing or network access is restricted, a native Python fallback verifies required files, column headers, and basic data types using pandas. This dual-path architecture is standard in mobility platform engineering because it balances accuracy with resilience.
For environments where you must invoke external binaries, Python’s subprocess module provides robust process management, timeout controls, and stream capture. See the official Python subprocess documentation for best practices on secure command execution and error handling.
Complete Validation Script
The following script downloads a GTFS ZIP, runs the MobilityData validator, parses the output, and gracefully degrades to a Python-native schema check if the Java runtime is missing. It uses tempfile to avoid leaving extraction artifacts, logging for production-grade output, and explicit type hints.
import os
import subprocess
import json
import zipfile
import logging
import tempfile
import urllib.request
import pandas as pd
from pathlib import Path
from typing import Dict, List, Any, Optional
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
# Configuration
GTFS_URL = "https://transit.agency/gtfs-static.zip"
VALIDATOR_JAR = "gtfs-validator.jar"
REQUIRED_FILES = {"agency.txt", "stops.txt", "routes.txt", "trips.txt", "stop_times.txt", "calendar.txt"}
REQUIRED_HEADERS = {
"agency.txt": ["agency_id", "agency_name"],
"stops.txt": ["stop_id", "stop_lat", "stop_lon"],
"routes.txt": ["route_id", "route_short_name", "route_type"],
"trips.txt": ["route_id", "trip_id", "service_id"],
"stop_times.txt": ["trip_id", "arrival_time", "departure_time", "stop_id", "stop_sequence"]
}
def download_gtfs(url: str, dest: Path) -> Path:
dest.parent.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(url, dest)
return dest
def run_mobility_validator(zip_path: Path, output_dir: Path) -> Dict[str, Any]:
"""Invoke the official MobilityData GTFS Validator and return the JSON report."""
output_dir.mkdir(parents=True, exist_ok=True)
cmd = [
"java", "-Xmx2g", "-jar", VALIDATOR_JAR,
"-i", str(zip_path),
"-o", str(output_dir)
]
try:
# Non-zero exit codes indicate validation errors, not script failure
subprocess.run(cmd, check=True, capture_output=True, text=True, timeout=300)
except subprocess.CalledProcessError as e:
logging.warning(f"Validator finished with exit code {e.returncode}. Stderr: {e.stderr[:200]}")
except subprocess.TimeoutExpired:
logging.error("Validator timed out after 300 seconds.")
return {"status": "timeout", "message": "Validation exceeded time limit"}
except FileNotFoundError:
logging.error("Java runtime not found. Falling back to native schema check.")
return {"status": "missing_runtime", "message": "Java not installed"}
report_path = output_dir / "report.json"
if report_path.exists():
with open(report_path, "r", encoding="utf-8") as f:
return json.load(f)
return {"status": "failed", "message": "No validation report generated"}
def fallback_schema_check(zip_path: Path) -> Dict[str, Any]:
"""Python-native fallback: verifies required files, headers, and basic foreign keys."""
results: List[Dict[str, Any]] = []
with tempfile.TemporaryDirectory() as tmpdir:
with zipfile.ZipFile(zip_path, "r") as z:
z.extractall(tmpdir)
extracted = Path(tmpdir)
missing_files = REQUIRED_FILES - {f.name for f in extracted.iterdir() if f.suffix == ".txt"}
if missing_files:
results.append({"severity": "ERROR", "message": f"Missing required files: {', '.join(missing_files)}"})
return {"status": "fallback_error", "findings": results}
for file_name, required_cols in REQUIRED_HEADERS.items():
file_path = extracted / file_name
if not file_path.exists():
continue
try:
df = pd.read_csv(file_path, dtype=str, low_memory=False)
missing_cols = set(required_cols) - set(df.columns)
if missing_cols:
results.append({"severity": "ERROR", "message": f"{file_name} missing columns: {', '.join(missing_cols)}"})
except pd.errors.ParserError as e:
results.append({"severity": "ERROR", "message": f"Failed to parse {file_name}: {e}"})
# Basic foreign key check: trips.txt -> stop_times.txt
try:
trips = pd.read_csv(extracted / "trips.txt", dtype=str, usecols=["trip_id"])
stop_times = pd.read_csv(extracted / "stop_times.txt", dtype=str, usecols=["trip_id"])
orphaned = set(stop_times["trip_id"].unique()) - set(trips["trip_id"].unique())
if orphaned:
results.append({"severity": "WARNING", "message": f"{len(orphaned)} orphaned trip_ids in stop_times.txt"})
except Exception as e:
results.append({"severity": "WARNING", "message": f"FK check skipped: {e}"})
return {"status": "fallback_complete", "findings": results}
def validate_feed(url: str, jar_path: str = VALIDATOR_JAR) -> Dict[str, Any]:
zip_dest = Path("./gtfs_feed.zip")
output_dir = Path("./validation_output")
logging.info(f"Downloading feed from {url}")
download_gtfs(url, zip_dest)
report = run_mobility_validator(zip_dest, output_dir)
if report.get("status") in ("missing_runtime", "timeout"):
logging.info("Running native Python fallback validation...")
report = fallback_schema_check(zip_dest)
# Clean up
zip_dest.unlink(missing_ok=True)
return report
if __name__ == "__main__":
result = validate_feed(GTFS_URL)
print(json.dumps(result, indent=2, default=str))
Parsing Results & CI/CD Integration
The MobilityData validator outputs a hierarchical JSON report containing notices categorized by severity (ERROR, WARNING, INFO, NOTE). In production, parse this report to extract actionable metrics:
def summarize_report(report: Dict[str, Any]) -> Dict[str, int]:
counts = {"ERROR": 0, "WARNING": 0, "INFO": 0}
for notice in report.get("notices", []):
severity = notice.get("severity", "INFO")
if severity in counts:
counts[severity] += 1
return counts
Integrate this into your pipeline by gating deployments on ERROR counts. For Airflow or GitHub Actions workflows, fail the job if counts["ERROR"] > 0, but allow WARNING thresholds to trigger Slack alerts instead. When orchestrating external processes in CI, always set explicit timeouts and capture stderr to avoid silent hangs. Refer to the MobilityData GTFS Validator repository for the latest release binaries and configuration flags.
Common Pitfalls & Next Steps
- Memory Limits: Large metropolitan feeds can exceed 2GB. Pass
-Xmx4gor higher to the JVM, or chunk validation by agency if your infrastructure supports it. - Timezone & Calendar Edge Cases: Feeds spanning daylight saving transitions or holiday exceptions often trigger false positives. Cross-reference
calendar_dates.txtoverrides before marking calendar overlaps as critical. - Pure Python Limitations: Regex or CSV-only parsers miss referential integrity violations. Always pair lightweight checks with a spec-compliant validator for production feeds.
Once your pipeline reliably catches structural violations, expand coverage to real-time GTFS-RT feeds, validate fare rule hierarchies, and automate regression testing against historical feed versions. For a deeper dive into enforcement logic and spec compliance, review GTFS Validation Rules and Common Schema Errors to align your alerting thresholds with transit authority requirements.