Pulls in template changes: export_serving.py for atomic DuckDB swap, supervisor export step, SQLMesh glob macro, server provisioning script, imprint template, and formatting improvements. Template scaffold SQL models excluded (padelnomics has real models). Web app routes/analytics unchanged (padelnomics-specific customizations). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.7 KiB
3.7 KiB
Padelnomics Transform (SQLMesh)
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
Running
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format
4-layer architecture
landing/ ← raw files (extraction output)
└── padelnomics/
└── {year}/{etag}.csv.gz
raw/ ← reads files verbatim
└── raw.padelnomics
staging/ ← type casting, deduplication
└── staging.stg_padelnomics
foundation/ ← business logic, dimensions, facts
└── foundation.dim_category
serving/ ← pre-aggregated for web app
└── serving.padelnomics_metrics
raw/ — verbatim source reads
- Reads landing zone files directly with
read_csv(..., all_varchar=true) - No transformations, no business logic
- Column names match the source exactly
- Uses a macro (
@padelnomics_glob()) so new landing files are picked up automatically - Naming:
raw.<source>
staging/ — type casting and cleansing
- One model per raw model (1:1)
- Cast all columns to correct types:
TRY_CAST(report_date AS DATE) - Deduplicate if source produces duplicates
- Minimal renaming — only where raw names are genuinely unclear
- Naming:
staging.stg_<source>
foundation/ — business logic
- Dimensions (
dim_*): slowly changing attributes, one row per entity - Facts (
fact_*): events and measurements, one row per event - May join across multiple staging models from different sources
- Surrogate keys:
MD5(business_key)for stable joins - Naming:
foundation.dim_<entity>,foundation.fact_<event>
serving/ — analytics-ready aggregates
- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads
- Queried from
analytics.pyviafetch_analytics() - Named to match what the frontend expects
- Naming:
serving.<purpose>
Adding a new data source
- Add a landing zone directory in the extraction package
- Add a glob macro in
macros/__init__.py:@macro() def my_source_glob(evaluator) -> str: landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing") return f"'{landing_dir}/my_source/**/*.csv.gz'" - Add a raw model:
models/raw/raw_my_source.sql - Add a staging model:
models/staging/stg_my_source.sql - Join into foundation or serving models as needed
Model materialization
| Layer | Default kind | Rationale |
|---|---|---|
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
| staging | FULL | 1:1 with raw; same cost |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |
For large historical tables, switch to kind INCREMENTAL_BY_TIME_RANGE with a time partition column. SQLMesh handles the incremental logic automatically.
Environment variables
| Variable | Default | Description |
|---|---|---|
LANDING_DIR |
data/landing |
Root of the landing zone |
DUCKDB_PATH |
local.duckdb |
DuckDB file (SQLMesh exclusive write access) |
The web app reads from a separate analytics.duckdb file via export_serving.py.
Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file —
SQLMesh holds an exclusive write lock during plan/run.