feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides

Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:44:48 +01:00
parent b76e87a0b6
commit 18ee24818b
14 changed files with 1084 additions and 2 deletions
--- a/transform/sqlmesh_padelnomics/README.md
+++ b/transform/sqlmesh_padelnomics/README.md
@@ -0,0 +1,107 @@
+# Padelnomics Transform (SQLMesh)
+
+4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
+
+## Running
+
+```bash
+# From repo root — plan all changes (shows what will run)
+uv run sqlmesh -p transform/sqlmesh_padelnomics plan
+
+# Apply to production
+uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
+
+# Run model tests
+uv run sqlmesh -p transform/sqlmesh_padelnomics test
+
+# Format SQL
+uv run sqlmesh -p transform/sqlmesh_padelnomics format
+```
+
+## 4-layer architecture
+
+```
+landing/                    <- raw files (extraction output)
+  +-- padelnomics/
+      +-- {year}/{etag}.csv.gz
+
+raw/                        <- reads files verbatim
+  +-- raw.padelnomics
+
+staging/                    <- type casting, deduplication
+  +-- staging.stg_padelnomics
+
+foundation/                 <- business logic, dimensions, facts
+  +-- foundation.dim_category
+
+serving/                    <- pre-aggregated for web app
+  +-- serving.padelnomics_metrics
+```
+
+### raw/ — verbatim source reads
+
+- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
+- No transformations, no business logic
+- Column names match the source exactly
+- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
+- Naming: `raw.<source>`
+
+### staging/ — type casting and cleansing
+
+- One model per raw model (1:1)
+- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
+- Deduplicate if source produces duplicates
+- Minimal renaming — only where raw names are genuinely unclear
+- Naming: `staging.stg_<source>`
+
+### foundation/ — business logic
+
+- Dimensions (`dim_*`): slowly changing attributes, one row per entity
+- Facts (`fact_*`): events and measurements, one row per event
+- May join across multiple staging models from different sources
+- Surrogate keys: `MD5(business_key)` for stable joins
+- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
+
+### serving/ — analytics-ready aggregates
+
+- Pre-aggregated for specific web app query patterns
+- These are the only tables the web app reads
+- Queried from `analytics.py` via `fetch_analytics()`
+- Named to match what the frontend expects
+- Naming: `serving.<purpose>`
+
+## Adding a new data source
+
+1. Add a landing zone directory in the extraction package
+2. Add a glob macro in `macros/__init__.py`:
+   ```python
+   @macro()
+   def my_source_glob(evaluator) -> str:
+       landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
+       return f"'{landing_dir}/my_source/**/*.csv.gz'"
+   ```
+3. Add a raw model: `models/raw/raw_my_source.sql`
+4. Add a staging model: `models/staging/stg_my_source.sql`
+5. Join into foundation or serving models as needed
+
+## Model materialization
+
+| Layer | Default kind | Rationale |
+|-------|-------------|-----------|
+| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
+| staging | FULL | 1:1 with raw; same cost |
+| foundation | FULL | Business logic rarely changes; recompute is fast |
+| serving | FULL | Small aggregates; web app needs latest at all times |
+
+For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
+
+## Environment variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LANDING_DIR` | `data/landing` | Root of the landing zone |
+| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
+
+The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
+Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
+SQLMesh holds an exclusive write lock during plan/run.