feat: migrate transform to 3-layer architecture with per-layer schemas

Remove raw/ layer — staging models now read landing JSON directly. Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*. Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH. Supervisor gets daily sleep interval between pipeline runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:04:40 +01:00
parent 53e9bbd66b
commit 2db66efe77
19 changed files with 306 additions and 301 deletions
--- a/transform/sqlmesh_padelnomics/README.md
+++ b/transform/sqlmesh_padelnomics/README.md
@@ -1,6 +1,6 @@
 # Padelnomics Transform (SQLMesh)

-4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
+3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.

 ## Running

@@ -16,42 +16,41 @@ uv run sqlmesh -p transform/sqlmesh_padelnomics test

 # Format SQL
 uv run sqlmesh -p transform/sqlmesh_padelnomics format
+
+# Export serving tables to analytics.duckdb (run after SQLMesh)
+DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
+    uv run python -m padelnomics.export_serving
 ```

-## 4-layer architecture
+## 3-layer architecture

 ```
 landing/                    ← raw files (extraction output)
-  └── padelnomics/
-      └── {year}/{etag}.csv.gz
+  ├── overpass/*/*/courts.json.gz
+  ├── eurostat/*/*/urb_cpop1.json.gz
+  └── playtomic/*/*/tenants.json.gz

-raw/                        ← reads files verbatim
-  └── raw.padelnomics
-
-staging/                    ← type casting, deduplication
-  └── staging.stg_padelnomics
+staging/                    ← reads landing files directly, type casting, dedup
+  ├── staging.stg_padel_courts
+  ├── staging.stg_playtomic_venues
+  └── staging.stg_population

 foundation/                 ← business logic, dimensions, facts
-  └── foundation.dim_category
+  ├── foundation.dim_venues
+  └── foundation.dim_cities

 serving/                    ← pre-aggregated for web app
-  └── serving.padelnomics_metrics
+  ├── serving.city_market_profile
+  └── serving.planner_defaults
 ```

-### raw/ — verbatim source reads
+### staging/ — read landing files + type casting

- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
- No transformations, no business logic
- Column names match the source exactly
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
- Naming: `raw.<source>`
-
-### staging/ — type casting and cleansing
-
- One model per raw model (1:1)
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
- Deduplicate if source produces duplicates
- Minimal renaming — only where raw names are genuinely unclear
+- Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)`
+- Uses `@LANDING_DIR` variable for file path discovery
+- Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)`
+- Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
+- Validates coordinates, nulls, and data quality inline
 - Naming: `staging.stg_<source>`

 ### foundation/ — business logic
@@ -59,49 +58,54 @@ serving/                    ← pre-aggregated for web app
 - Dimensions (`dim_*`): slowly changing attributes, one row per entity
 - Facts (`fact_*`): events and measurements, one row per event
 - May join across multiple staging models from different sources
- Surrogate keys: `MD5(business_key)` for stable joins
 - Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`

 ### serving/ — analytics-ready aggregates

 - Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads
+- These are the only tables the web app reads (via `analytics.duckdb`)
 - Queried from `analytics.py` via `fetch_analytics()`
- Named to match what the frontend expects
 - Naming: `serving.<purpose>`

+## Two-DuckDB architecture
+
+```
+data/lakehouse.duckdb       ← SQLMesh exclusive write (DUCKDB_PATH)
+  ├── staging.*
+  ├── foundation.*
+  └── serving.*
+
+data/analytics.duckdb       ← web app read-only (SERVING_DUCKDB_PATH)
+  └── serving.*             ← atomically replaced by export_serving.py
+```
+
+SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run.
+The web app needs read-only access at all times. `export_serving.py` copies
+`serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`.
+The web app detects the inode change on next query — no restart needed.
+
+**Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.**
+
 ## Adding a new data source

-1. Add a landing zone directory in the extraction package
-2. Add a glob macro in `macros/__init__.py`:
-   ```python
-   @macro()
-   def my_source_glob(evaluator) -> str:
-       landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
-       return f"'{landing_dir}/my_source/**/*.csv.gz'"
-   ```
-3. Add a raw model: `models/raw/raw_my_source.sql`
-4. Add a staging model: `models/staging/stg_my_source.sql`
-5. Join into foundation or serving models as needed
+1. Add an extractor in `extract/padelnomics_extract/` (see extraction README)
+2. Add a staging model: `models/staging/stg_<source>.sql` that reads landing files directly
+3. Join into foundation or serving models as needed

 ## Model materialization

 | Layer | Default kind | Rationale |
 |-------|-------------|-----------|
-| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
-| staging | FULL | 1:1 with raw; same cost |
+| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
 | foundation | FULL | Business logic rarely changes; recompute is fast |
 | serving | FULL | Small aggregates; web app needs latest at all times |

-For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
+For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column.

 ## Environment variables

 | Variable | Default | Description |
 |----------|---------|-------------|
 | `LANDING_DIR` | `data/landing` | Root of the landing zone |
-| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
-
-The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
-Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
-SQLMesh holds an exclusive write lock during plan/run.
+| `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) |
+| `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |