Best Practices for GTFS Agency Metadata
Implementing best practices for GTFS agency metadata starts with treating agency.txt as the immutable anchor of your transit feed. While the specification marks several fields as optional, production pipelines must enforce strict schema compliance, stable identifiers, and standardized locale codes. Proper implementation prevents routing engine failures, ensures accurate fare attribution, and maintains compatibility across feed versioning cycles. For teams building scalable mobility platforms, understanding how agency data propagates through GTFS Feed Architecture & Fundamentals is critical before deploying validation logic.
Core agency.txt Requirements & Validation Rules
The agency.txt file defines the operating entity behind every route, trip, and stop. To avoid downstream parsing errors, enforce these production standards:
- Stable
agency_id: Must be a persistent, non-numeric string (e.g.,MTA-NYCT,BART-SF). Reusing IDs, rotating auto-generated UUIDs, or relying on integers breaks historical analytics, breaks real-time subscriptions, and corrupts trip-to-vehicle joins. - Mandatory Core Fields: Treat
agency_name,agency_url,agency_timezone, andagency_langas required. Omission causes silent failures in consumer SDKs and accessibility tools. - IANA Timezones Only: Use exact identifiers from the IANA Time Zone Database (e.g.,
America/New_York). Abbreviations likeESTorPSTare ambiguous, region-dependent, and fail daylight-saving transitions. - ISO 639-1 Language Codes: Restrict
agency_langto two-letter lowercase codes (en,es,fr). Avoid BCP-47 or extended tags unless your consumer stack explicitly supports them. - HTTPS Enforcement:
agency_urland optionalagency_fare_urlmust resolve to secure endpoints. Mixed-content warnings break mobile apps and violate modern transit API security baselines.
Production-Ready Python Validation
Transit automation pipelines should validate metadata before ingestion. The following routine uses pandas for CSV parsing and Pydantic v2 for schema enforcement. It normalizes inputs, rejects malformed records, and logs actionable errors.
import pandas as pd
from pydantic import BaseModel, Field, field_validator, ValidationError
from typing import Optional
import zoneinfo
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("gtfs_agency_validator")
class AgencyRecord(BaseModel):
agency_id: str = Field(..., min_length=2, max_length=50)
agency_name: str = Field(..., min_length=2)
agency_url: str = Field(..., pattern=r'^https?://')
agency_timezone: str
agency_lang: str = Field(..., min_length=2, max_length=2)
agency_phone: Optional[str] = None
agency_fare_url: Optional[str] = None
agency_email: Optional[str] = None
@field_validator('agency_timezone')
@classmethod
def validate_timezone(cls, v: str) -> str:
try:
zoneinfo.ZoneInfo(v)
return v
except Exception:
raise ValueError(f"Invalid IANA timezone: {v}")
@field_validator('agency_lang')
@classmethod
def validate_language(cls, v: str) -> str:
if not v.isalpha() or not v.islower() or len(v) != 2:
raise ValueError("agency_lang must be a lowercase ISO 639-1 code (e.g., 'en')")
return v
def validate_agency_csv(filepath: str) -> list[dict]:
df = pd.read_csv(filepath, dtype=str, keep_default_na=False)
valid_records = []
for idx, row in df.iterrows():
try:
record = AgencyRecord(**row.to_dict())
valid_records.append(record.model_dump())
except ValidationError as e:
logger.warning(f"Row {idx} validation failed: {e}")
if not valid_records:
raise ValueError("No valid agency records found. Feed rejected.")
return valid_records
Key implementation notes:
- Uses Pydantic v2 syntax (
@field_validator,model_dump()) for current compatibility and faster serialization. dtype=strduring CSV read prevents pandas from coercing IDs into floats or stripping leading zeros.- Fails fast with explicit warnings, preventing silent data corruption in downstream routing engines.
Handling Multi-Agency Feeds & Mergers
Regional transit hubs often consolidate multiple operators into a single GTFS package. In these cases, agency_id collisions become a critical failure point. When merging feeds:
- Prefix IDs with a regional namespace (e.g.,
SFMTA_MUNI,SFMTA_BART) before concatenation. - Maintain a crosswalk table mapping legacy IDs to canonical identifiers.
- Validate that
agency_urlandagency_fare_urlpoint to operator-specific endpoints, not generic portal pages.
Without namespace isolation, routing engines will misattribute trips, fare calculators will apply incorrect rules, and real-time vehicle positions will detach from scheduled routes.
CI/CD Integration & Automated Guardrails
Manual validation is insufficient for high-frequency feed updates. Embed agency checks directly into your CI/CD pipeline:
- Pre-commit hooks: Run lightweight schema checks before
agency.txtenters version control. - Scheduled validation: Trigger full pipeline runs on every feed export using GitHub Actions, GitLab CI, or Airflow DAGs.
- Threshold enforcement: Block feed publication if
agency_idcount changes unexpectedly or if timezone/language codes deviate from the approved allowlist.
Automated guardrails catch drift before consumers ingest broken data. For teams managing frequent schedule updates, aligning validation with Agency Metadata and Feed Versioning Practices ensures backward compatibility and clean changelog generation.
Downstream Impact & Versioning Strategy
Validated agency metadata must propagate cleanly to GTFS-Realtime consumers, routing engines, and fare calculators. When agency_id drifts between static and realtime feeds, vehicle positions detach from scheduled trips, causing blank maps and ETA failures. Similarly, mismatched timezones break schedule interpolation during daylight-saving shifts.
To maintain consistency across updates, implement automated diffing and semantic versioning. Track metadata changes alongside route and stop updates, and publish changelogs that explicitly flag agency_id rotations or timezone corrections. Additionally, align your validation thresholds with the official GTFS Specification. The spec evolves, and consumer platforms increasingly reject feeds that omit formerly optional fields. Treat the specification as a living contract, not a minimum viable baseline.