- cities_global.jsonl.gz replaces .json.gz (one city object per line)
- Empty placeholder writes a minimal .jsonl.gz (null row, filtered in staging)
- Eliminates the {"rows": [...]} blob wrapper and maximum_object_size workaround
stg_population_geonames: UNION ALL transition (jsonl_rows + blob_rows)
- jsonl_rows: read_json JSONL, explicit columns, no UNNEST
- blob_rows: existing UNNEST(rows) pattern with 40MB size limit retained
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Padelnomics Transform (SQLMesh)
3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.
Running
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format
# Export serving tables to analytics.duckdb (run after SQLMesh)
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
uv run python -m padelnomics.export_serving
3-layer architecture
landing/ ← raw files (extraction output)
├── overpass/*/*/courts.json.gz
├── eurostat/*/*/urb_cpop1.json.gz
└── playtomic/*/*/tenants.json.gz
staging/ ← reads landing files directly, type casting, dedup
├── staging.stg_padel_courts
├── staging.stg_playtomic_venues
└── staging.stg_population
foundation/ ← business logic, dimensions, facts
├── foundation.dim_venues ← conformed venue dimension (Playtomic + OSM)
├── foundation.dim_cities ← conformed city dimension (venue-derived + Eurostat)
├── foundation.dim_venue_capacity ← static capacity attributes per venue
├── foundation.fct_availability_slot ← event-grain: one row per deduplicated slot
└── foundation.fct_daily_availability← venue-day aggregate: occupancy + revenue estimates
serving/ ← pre-aggregated for web app
├── serving.city_market_profile
└── serving.planner_defaults
staging/ — read landing files + type casting
- Reads landing zone JSON files directly with
read_json(..., format='auto', filename=true) - Uses
@LANDING_DIRvariable for file path discovery - Casts all columns to correct types:
TRY_CAST(... AS DOUBLE) - Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
- Validates coordinates, nulls, and data quality inline
- Naming:
staging.stg_<source>
foundation/ — business logic
- Dimensions (
dim_*): slowly changing attributes, one row per entity - Facts (
fact_*): events and measurements, one row per event - May join across multiple staging models from different sources
- Naming:
foundation.dim_<entity>,foundation.fact_<event>
serving/ — analytics-ready aggregates
- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads (via
analytics.duckdb) - Queried from
analytics.pyviafetch_analytics() - Naming:
serving.<purpose>
Two-DuckDB architecture
data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH)
├── staging.*
├── foundation.*
└── serving.*
data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH)
└── serving.* ← atomically replaced by export_serving.py
SQLMesh holds an exclusive write lock on lakehouse.duckdb during plan/run.
The web app needs read-only access at all times. export_serving.py copies
serving.* tables to a temp file, then atomically renames it to analytics.duckdb.
The web app detects the inode change on next query — no restart needed.
Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.
Adding a new data source
- Add an extractor in
extract/padelnomics_extract/(see extraction README) - Add a staging model:
models/staging/stg_<source>.sqlthat reads landing files directly - Join into foundation or serving models as needed
Model materialization
| Layer | Default kind | Rationale |
|---|---|---|
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |
For large historical tables, switch to kind INCREMENTAL_BY_TIME_RANGE with a time partition column.
Environment variables
| Variable | Default | Description |
|---|---|---|
LANDING_DIR |
data/landing |
Root of the landing zone |
DUCKDB_PATH |
data/lakehouse.duckdb |
DuckDB file (SQLMesh exclusive write access) |
SERVING_DUCKDB_PATH |
data/analytics.duckdb |
Serving DB (web app reads from here) |