# Padelnomics Transform (SQLMesh) 4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app. ## Running ```bash # From repo root — plan all changes (shows what will run) uv run sqlmesh -p transform/sqlmesh_padelnomics plan # Apply to production uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod # Run model tests uv run sqlmesh -p transform/sqlmesh_padelnomics test # Format SQL uv run sqlmesh -p transform/sqlmesh_padelnomics format ``` ## 4-layer architecture ``` landing/ ← raw files (extraction output) └── padelnomics/ └── {year}/{etag}.csv.gz raw/ ← reads files verbatim └── raw.padelnomics staging/ ← type casting, deduplication └── staging.stg_padelnomics foundation/ ← business logic, dimensions, facts └── foundation.dim_category serving/ ← pre-aggregated for web app └── serving.padelnomics_metrics ``` ### raw/ — verbatim source reads - Reads landing zone files directly with `read_csv(..., all_varchar=true)` - No transformations, no business logic - Column names match the source exactly - Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically - Naming: `raw.` ### staging/ — type casting and cleansing - One model per raw model (1:1) - Cast all columns to correct types: `TRY_CAST(report_date AS DATE)` - Deduplicate if source produces duplicates - Minimal renaming — only where raw names are genuinely unclear - Naming: `staging.stg_` ### foundation/ — business logic - Dimensions (`dim_*`): slowly changing attributes, one row per entity - Facts (`fact_*`): events and measurements, one row per event - May join across multiple staging models from different sources - Surrogate keys: `MD5(business_key)` for stable joins - Naming: `foundation.dim_`, `foundation.fact_` ### serving/ — analytics-ready aggregates - Pre-aggregated for specific web app query patterns - These are the only tables the web app reads - Queried from `analytics.py` via `fetch_analytics()` - Named to match what the frontend expects - Naming: `serving.` ## Adding a new data source 1. Add a landing zone directory in the extraction package 2. Add a glob macro in `macros/__init__.py`: ```python @macro() def my_source_glob(evaluator) -> str: landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing") return f"'{landing_dir}/my_source/**/*.csv.gz'" ``` 3. Add a raw model: `models/raw/raw_my_source.sql` 4. Add a staging model: `models/staging/stg_my_source.sql` 5. Join into foundation or serving models as needed ## Model materialization | Layer | Default kind | Rationale | |-------|-------------|-----------| | raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan | | staging | FULL | 1:1 with raw; same cost | | foundation | FULL | Business logic rarely changes; recompute is fast | | serving | FULL | Small aggregates; web app needs latest at all times | For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically. ## Environment variables | Variable | Default | Description | |----------|---------|-------------| | `LANDING_DIR` | `data/landing` | Root of the landing zone | | `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) | The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`. Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file — SQLMesh holds an exclusive write lock during plan/run.