# Padelnomics Transform (SQLMesh) 3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB. ## Running ```bash # From repo root — plan all changes (shows what will run) uv run sqlmesh -p transform/sqlmesh_padelnomics plan # Apply to production uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod # Run model tests uv run sqlmesh -p transform/sqlmesh_padelnomics test # Format SQL uv run sqlmesh -p transform/sqlmesh_padelnomics format # Export serving tables to analytics.duckdb (run after SQLMesh) DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \ uv run python -m padelnomics.export_serving ``` ## 3-layer architecture ``` landing/ ← raw files (extraction output) ├── overpass/*/*/courts.json.gz ├── eurostat/*/*/urb_cpop1.json.gz └── playtomic/*/*/tenants.json.gz staging/ ← reads landing files directly, type casting, dedup ├── staging.stg_padel_courts ├── staging.stg_playtomic_venues └── staging.stg_population foundation/ ← business logic, dimensions, facts ├── foundation.dim_venues └── foundation.dim_cities serving/ ← pre-aggregated for web app ├── serving.city_market_profile └── serving.planner_defaults ``` ### staging/ — read landing files + type casting - Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)` - Uses `@LANDING_DIR` variable for file path discovery - Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)` - Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID) - Validates coordinates, nulls, and data quality inline - Naming: `staging.stg_` ### foundation/ — business logic - Dimensions (`dim_*`): slowly changing attributes, one row per entity - Facts (`fact_*`): events and measurements, one row per event - May join across multiple staging models from different sources - Naming: `foundation.dim_`, `foundation.fact_` ### serving/ — analytics-ready aggregates - Pre-aggregated for specific web app query patterns - These are the only tables the web app reads (via `analytics.duckdb`) - Queried from `analytics.py` via `fetch_analytics()` - Naming: `serving.` ## Two-DuckDB architecture ``` data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH) ├── staging.* ├── foundation.* └── serving.* data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH) └── serving.* ← atomically replaced by export_serving.py ``` SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run. The web app needs read-only access at all times. `export_serving.py` copies `serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`. The web app detects the inode change on next query — no restart needed. **Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.** ## Adding a new data source 1. Add an extractor in `extract/padelnomics_extract/` (see extraction README) 2. Add a staging model: `models/staging/stg_.sql` that reads landing files directly 3. Join into foundation or serving models as needed ## Model materialization | Layer | Default kind | Rationale | |-------|-------------|-----------| | staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan | | foundation | FULL | Business logic rarely changes; recompute is fast | | serving | FULL | Small aggregates; web app needs latest at all times | For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. ## Environment variables | Variable | Default | Description | |----------|---------|-------------| | `LANDING_DIR` | `data/landing` | Root of the landing zone | | `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) | | `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |