Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
108 lines
3.7 KiB
Markdown
108 lines
3.7 KiB
Markdown
# Padelnomics Transform (SQLMesh)
|
|
|
|
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
# From repo root — plan all changes (shows what will run)
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
|
|
|
|
# Apply to production
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
|
|
|
|
# Run model tests
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics test
|
|
|
|
# Format SQL
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics format
|
|
```
|
|
|
|
## 4-layer architecture
|
|
|
|
```
|
|
landing/ <- raw files (extraction output)
|
|
+-- padelnomics/
|
|
+-- {year}/{etag}.csv.gz
|
|
|
|
raw/ <- reads files verbatim
|
|
+-- raw.padelnomics
|
|
|
|
staging/ <- type casting, deduplication
|
|
+-- staging.stg_padelnomics
|
|
|
|
foundation/ <- business logic, dimensions, facts
|
|
+-- foundation.dim_category
|
|
|
|
serving/ <- pre-aggregated for web app
|
|
+-- serving.padelnomics_metrics
|
|
```
|
|
|
|
### raw/ — verbatim source reads
|
|
|
|
- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
|
|
- No transformations, no business logic
|
|
- Column names match the source exactly
|
|
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
|
|
- Naming: `raw.<source>`
|
|
|
|
### staging/ — type casting and cleansing
|
|
|
|
- One model per raw model (1:1)
|
|
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
|
|
- Deduplicate if source produces duplicates
|
|
- Minimal renaming — only where raw names are genuinely unclear
|
|
- Naming: `staging.stg_<source>`
|
|
|
|
### foundation/ — business logic
|
|
|
|
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
|
|
- Facts (`fact_*`): events and measurements, one row per event
|
|
- May join across multiple staging models from different sources
|
|
- Surrogate keys: `MD5(business_key)` for stable joins
|
|
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
|
|
|
### serving/ — analytics-ready aggregates
|
|
|
|
- Pre-aggregated for specific web app query patterns
|
|
- These are the only tables the web app reads
|
|
- Queried from `analytics.py` via `fetch_analytics()`
|
|
- Named to match what the frontend expects
|
|
- Naming: `serving.<purpose>`
|
|
|
|
## Adding a new data source
|
|
|
|
1. Add a landing zone directory in the extraction package
|
|
2. Add a glob macro in `macros/__init__.py`:
|
|
```python
|
|
@macro()
|
|
def my_source_glob(evaluator) -> str:
|
|
landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
|
|
return f"'{landing_dir}/my_source/**/*.csv.gz'"
|
|
```
|
|
3. Add a raw model: `models/raw/raw_my_source.sql`
|
|
4. Add a staging model: `models/staging/stg_my_source.sql`
|
|
5. Join into foundation or serving models as needed
|
|
|
|
## Model materialization
|
|
|
|
| Layer | Default kind | Rationale |
|
|
|-------|-------------|-----------|
|
|
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
|
|
| staging | FULL | 1:1 with raw; same cost |
|
|
| foundation | FULL | Business logic rarely changes; recompute is fast |
|
|
| serving | FULL | Small aggregates; web app needs latest at all times |
|
|
|
|
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
|
|
|
|
## Environment variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
|
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
|
|
|
The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
|
|
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
|
|
SQLMesh holds an exclusive write lock during plan/run.
|