GeoNames: - cities15000 → cities1000 (~140K global locations, pop ≥ 1K) - Add lat/lon, admin1_code, admin2_code to output (needed for dim_locations) - Expand feature codes to include PPLA3/4/5 (Gemeinden, cantons, etc.) - Remove MIN_POPULATION=50K floor — cities1000 already pre-filters to ≥1K - Update assertions for new scale (~100K+ expected) Tennis courts: - New overpass_tennis.py extractor (sport=tennis, 180s Overpass timeout) - Registered as extract-overpass-tennis, added to EXTRACTORS list - New stg_tennis_courts.sql staging model (grain: osm_id) stg_population_geonames: add lat, lon, admin1_code, admin2_code columns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Padelnomics Extraction
Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.
Running
# Run all extractors sequentially
LANDING_DIR=data/landing uv run extract
# Run a single extractor
LANDING_DIR=data/landing uv run extract-overpass
LANDING_DIR=data/landing uv run extract-eurostat
LANDING_DIR=data/landing uv run extract-playtomic-tenants
LANDING_DIR=data/landing uv run extract-playtomic-availability
Architecture: one file per source
Each data source lives in its own module with a dedicated CLI entry point:
src/padelnomics_extract/
├── __init__.py
├── _shared.py # LANDING_DIR, logger, run_extractor() wrapper
├── utils.py # SQLite state tracking, atomic I/O helpers
├── overpass.py # OSM padel courts via Overpass API
├── eurostat.py # Eurostat city demographics (urb_cpop1, ilc_di03)
├── playtomic_tenants.py # Playtomic venue listings (tenant search)
├── playtomic_availability.py # Playtomic booking slots (next-day availability)
└── all.py # Runs all extractors sequentially
Adding a new extractor
- Create
my_source.pyfollowing the pattern:
from ._shared import run_extractor, setup_logging
from .utils import landing_path, write_gzip_atomic
logger = setup_logging("padelnomics.extract.my_source")
EXTRACTOR_NAME = "my_source"
def extract(landing_dir, year_month, conn, session):
"""Returns {"files_written": N, "bytes_written": N, ...}."""
year, month = year_month.split("/")
dest_dir = landing_path(landing_dir, "my_source", year, month)
# ... fetch data, write to dest_dir ...
return {"files_written": 1, "files_skipped": 0, "bytes_written": n}
def main():
run_extractor(EXTRACTOR_NAME, extract)
- Add entry point to
pyproject.toml:
extract-my-source = "padelnomics_extract.my_source:main"
-
Import in
all.pyand add toEXTRACTORSlist. -
Add a staging model in
transform/sqlmesh_padelnomics/models/staging/.
Design: filesystem as state
The landing zone is an append-only store of raw files:
- Idempotency: running twice writes nothing if the source hasn't changed
- Debugging: every historical raw file is preserved
- Safety: extraction never mutates existing files, only appends new ones
Etag-based dedup (Eurostat)
When the source provides an ETag header, store it in a sibling .etag file.
On the next request, send If-None-Match — 304 means skip.
Content-addressed (Overpass, Playtomic)
Files named by date or content. write_gzip_atomic() writes to a .tmp sibling
then renames — never leaves partial files on crash.
State tracking
Every run writes one row to data/landing/.state.sqlite:
sqlite3 data/landing/.state.sqlite \
"SELECT extractor, started_at, status, files_written, cursor_value
FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
| Column | Type | Description |
|---|---|---|
run_id |
INTEGER | Auto-increment primary key |
extractor |
TEXT | Extractor name (e.g. overpass, eurostat) |
started_at |
TEXT | ISO 8601 UTC timestamp |
finished_at |
TEXT | ISO 8601 UTC timestamp |
status |
TEXT | running → success or failed |
files_written |
INTEGER | New files written this run |
files_skipped |
INTEGER | Files already present |
bytes_written |
INTEGER | Compressed bytes written |
cursor_value |
TEXT | Resume cursor (date, index, etc.) |
error_message |
TEXT | Exception message if failed |
Landing zone structure
data/landing/
├── .state.sqlite
├── overpass/{year}/{month}/courts.json.gz
├── eurostat/{year}/{month}/urb_cpop1.json.gz
├── eurostat/{year}/{month}/ilc_di03.json.gz
├── playtomic/{year}/{month}/tenants.json.gz
└── playtomic/{year}/{month}/availability_{date}.json.gz
Data sources
| Source | Module | Schedule | Notes |
|---|---|---|---|
| Overpass API | overpass.py |
Daily | OSM padel courts, ~5K nodes |
| Eurostat | eurostat.py |
Daily (304 most runs) | urb_cpop1, ilc_di03 — etag dedup |
| Playtomic tenants | playtomic_tenants.py |
Daily | ~8K venues, bounded pagination |
| Playtomic availability | playtomic_availability.py |
Daily | Next-day slots, ~4.5h runtime |