Part A: Data Layer — Sprints 1-5 Sprint 1 — Eurostat SDMX city labels (unblocks EU population): - New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist (city_code → city_name mapping) with ETag dedup - New staging model: stg_city_labels.sql — grain city_code - Updated dim_cities.sql — joins Eurostat population via city code lookup; replaces hardcoded 0::BIGINT population Sprint 2 — Market score formula v2: - city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200), 30pt demand (occupancy or density), 15pt data confidence - Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate is available to the scoring formula Sprint 3 — US Census ACS extractor: - New extractor: census_usa.py — ACS 5-year place population (vintage 2023) - New staging model: stg_population_usa.sql — grain (place_fips, ref_year) Sprint 4 — ONS UK extractor: - New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API - New staging model: stg_population_uk.sql — grain (lad_code, ref_year) Sprint 5 — GeoNames global extractor: - New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop - New staging model: stg_population_geonames.sql — grain geoname_id - dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0) with case/whitespace-insensitive city name matching Registered all 4 new CLI entrypoints in pyproject.toml and all.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Padelnomics Transform (SQLMesh)
3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.
Running
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format
# Export serving tables to analytics.duckdb (run after SQLMesh)
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
uv run python -m padelnomics.export_serving
3-layer architecture
landing/ ← raw files (extraction output)
├── overpass/*/*/courts.json.gz
├── eurostat/*/*/urb_cpop1.json.gz
└── playtomic/*/*/tenants.json.gz
staging/ ← reads landing files directly, type casting, dedup
├── staging.stg_padel_courts
├── staging.stg_playtomic_venues
└── staging.stg_population
foundation/ ← business logic, dimensions, facts
├── foundation.dim_venues ← conformed venue dimension (Playtomic + OSM)
├── foundation.dim_cities ← conformed city dimension (venue-derived + Eurostat)
├── foundation.dim_venue_capacity ← static capacity attributes per venue
├── foundation.fct_availability_slot ← event-grain: one row per deduplicated slot
└── foundation.fct_daily_availability← venue-day aggregate: occupancy + revenue estimates
serving/ ← pre-aggregated for web app
├── serving.city_market_profile
└── serving.planner_defaults
staging/ — read landing files + type casting
- Reads landing zone JSON files directly with
read_json(..., format='auto', filename=true) - Uses
@LANDING_DIRvariable for file path discovery - Casts all columns to correct types:
TRY_CAST(... AS DOUBLE) - Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
- Validates coordinates, nulls, and data quality inline
- Naming:
staging.stg_<source>
foundation/ — business logic
- Dimensions (
dim_*): slowly changing attributes, one row per entity - Facts (
fact_*): events and measurements, one row per event - May join across multiple staging models from different sources
- Naming:
foundation.dim_<entity>,foundation.fact_<event>
serving/ — analytics-ready aggregates
- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads (via
analytics.duckdb) - Queried from
analytics.pyviafetch_analytics() - Naming:
serving.<purpose>
Two-DuckDB architecture
data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH)
├── staging.*
├── foundation.*
└── serving.*
data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH)
└── serving.* ← atomically replaced by export_serving.py
SQLMesh holds an exclusive write lock on lakehouse.duckdb during plan/run.
The web app needs read-only access at all times. export_serving.py copies
serving.* tables to a temp file, then atomically renames it to analytics.duckdb.
The web app detects the inode change on next query — no restart needed.
Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.
Adding a new data source
- Add an extractor in
extract/padelnomics_extract/(see extraction README) - Add a staging model:
models/staging/stg_<source>.sqlthat reads landing files directly - Join into foundation or serving models as needed
Model materialization
| Layer | Default kind | Rationale |
|---|---|---|
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |
For large historical tables, switch to kind INCREMENTAL_BY_TIME_RANGE with a time partition column.
Environment variables
| Variable | Default | Description |
|---|---|---|
LANDING_DIR |
data/landing |
Root of the landing zone |
DUCKDB_PATH |
data/lakehouse.duckdb |
DuckDB file (SQLMesh exclusive write access) |
SERVING_DUCKDB_PATH |
data/analytics.duckdb |
Serving DB (web app reads from here) |