feat: migrate transform to 3-layer architecture with per-layer schemas
Remove raw/ layer — staging models now read landing JSON directly. Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*. Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH. Supervisor gets daily sleep interval between pipeline runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Padelnomics Transform (SQLMesh)
|
||||
|
||||
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
|
||||
3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.
|
||||
|
||||
## Running
|
||||
|
||||
@@ -16,42 +16,41 @@ uv run sqlmesh -p transform/sqlmesh_padelnomics test
|
||||
|
||||
# Format SQL
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics format
|
||||
|
||||
# Export serving tables to analytics.duckdb (run after SQLMesh)
|
||||
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
|
||||
uv run python -m padelnomics.export_serving
|
||||
```
|
||||
|
||||
## 4-layer architecture
|
||||
## 3-layer architecture
|
||||
|
||||
```
|
||||
landing/ ← raw files (extraction output)
|
||||
└── padelnomics/
|
||||
└── {year}/{etag}.csv.gz
|
||||
├── overpass/*/*/courts.json.gz
|
||||
├── eurostat/*/*/urb_cpop1.json.gz
|
||||
└── playtomic/*/*/tenants.json.gz
|
||||
|
||||
raw/ ← reads files verbatim
|
||||
└── raw.padelnomics
|
||||
|
||||
staging/ ← type casting, deduplication
|
||||
└── staging.stg_padelnomics
|
||||
staging/ ← reads landing files directly, type casting, dedup
|
||||
├── staging.stg_padel_courts
|
||||
├── staging.stg_playtomic_venues
|
||||
└── staging.stg_population
|
||||
|
||||
foundation/ ← business logic, dimensions, facts
|
||||
└── foundation.dim_category
|
||||
├── foundation.dim_venues
|
||||
└── foundation.dim_cities
|
||||
|
||||
serving/ ← pre-aggregated for web app
|
||||
└── serving.padelnomics_metrics
|
||||
├── serving.city_market_profile
|
||||
└── serving.planner_defaults
|
||||
```
|
||||
|
||||
### raw/ — verbatim source reads
|
||||
### staging/ — read landing files + type casting
|
||||
|
||||
- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
|
||||
- No transformations, no business logic
|
||||
- Column names match the source exactly
|
||||
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
|
||||
- Naming: `raw.<source>`
|
||||
|
||||
### staging/ — type casting and cleansing
|
||||
|
||||
- One model per raw model (1:1)
|
||||
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
|
||||
- Deduplicate if source produces duplicates
|
||||
- Minimal renaming — only where raw names are genuinely unclear
|
||||
- Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)`
|
||||
- Uses `@LANDING_DIR` variable for file path discovery
|
||||
- Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)`
|
||||
- Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
|
||||
- Validates coordinates, nulls, and data quality inline
|
||||
- Naming: `staging.stg_<source>`
|
||||
|
||||
### foundation/ — business logic
|
||||
@@ -59,49 +58,54 @@ serving/ ← pre-aggregated for web app
|
||||
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
|
||||
- Facts (`fact_*`): events and measurements, one row per event
|
||||
- May join across multiple staging models from different sources
|
||||
- Surrogate keys: `MD5(business_key)` for stable joins
|
||||
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
||||
|
||||
### serving/ — analytics-ready aggregates
|
||||
|
||||
- Pre-aggregated for specific web app query patterns
|
||||
- These are the only tables the web app reads
|
||||
- These are the only tables the web app reads (via `analytics.duckdb`)
|
||||
- Queried from `analytics.py` via `fetch_analytics()`
|
||||
- Named to match what the frontend expects
|
||||
- Naming: `serving.<purpose>`
|
||||
|
||||
## Two-DuckDB architecture
|
||||
|
||||
```
|
||||
data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH)
|
||||
├── staging.*
|
||||
├── foundation.*
|
||||
└── serving.*
|
||||
|
||||
data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH)
|
||||
└── serving.* ← atomically replaced by export_serving.py
|
||||
```
|
||||
|
||||
SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run.
|
||||
The web app needs read-only access at all times. `export_serving.py` copies
|
||||
`serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`.
|
||||
The web app detects the inode change on next query — no restart needed.
|
||||
|
||||
**Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.**
|
||||
|
||||
## Adding a new data source
|
||||
|
||||
1. Add a landing zone directory in the extraction package
|
||||
2. Add a glob macro in `macros/__init__.py`:
|
||||
```python
|
||||
@macro()
|
||||
def my_source_glob(evaluator) -> str:
|
||||
landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
|
||||
return f"'{landing_dir}/my_source/**/*.csv.gz'"
|
||||
```
|
||||
3. Add a raw model: `models/raw/raw_my_source.sql`
|
||||
4. Add a staging model: `models/staging/stg_my_source.sql`
|
||||
5. Join into foundation or serving models as needed
|
||||
1. Add an extractor in `extract/padelnomics_extract/` (see extraction README)
|
||||
2. Add a staging model: `models/staging/stg_<source>.sql` that reads landing files directly
|
||||
3. Join into foundation or serving models as needed
|
||||
|
||||
## Model materialization
|
||||
|
||||
| Layer | Default kind | Rationale |
|
||||
|-------|-------------|-----------|
|
||||
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
|
||||
| staging | FULL | 1:1 with raw; same cost |
|
||||
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
|
||||
| foundation | FULL | Business logic rarely changes; recompute is fast |
|
||||
| serving | FULL | Small aggregates; web app needs latest at all times |
|
||||
|
||||
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
|
||||
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column.
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
||||
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
|
||||
The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
|
||||
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
|
||||
SQLMesh holds an exclusive write lock during plan/run.
|
||||
| `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
| `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |
|
||||
|
||||
Reference in New Issue
Block a user