Files

Deeman ea86940b78 feat: copier update v0.9.0 → v0.10.0

Pulls in template changes: export_serving.py for atomic DuckDB swap,
supervisor export step, SQLMesh glob macro, server provisioning script,
imprint template, and formatting improvements.

Template scaffold SQL models excluded (padelnomics has real models).
Web app routes/analytics unchanged (padelnomics-specific customizations).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-22 17:50:36 +01:00

3.7 KiB

Raw Blame History

Padelnomics Transform (SQLMesh)

4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.

Running

# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format

4-layer architecture

landing/                    ← raw files (extraction output)
  └── padelnomics/
      └── {year}/{etag}.csv.gz

raw/                        ← reads files verbatim
  └── raw.padelnomics

staging/                    ← type casting, deduplication
  └── staging.stg_padelnomics

foundation/                 ← business logic, dimensions, facts
  └── foundation.dim_category

serving/                    ← pre-aggregated for web app
  └── serving.padelnomics_metrics

raw/ — verbatim source reads

Reads landing zone files directly with read_csv(..., all_varchar=true)
No transformations, no business logic
Column names match the source exactly
Uses a macro (@padelnomics_glob()) so new landing files are picked up automatically
Naming: raw.<source>

staging/ — type casting and cleansing

One model per raw model (1:1)
Cast all columns to correct types: TRY_CAST(report_date AS DATE)
Deduplicate if source produces duplicates
Minimal renaming — only where raw names are genuinely unclear
Naming: staging.stg_<source>

foundation/ — business logic

Dimensions (dim_*): slowly changing attributes, one row per entity
Facts (fact_*): events and measurements, one row per event
May join across multiple staging models from different sources
Surrogate keys: MD5(business_key) for stable joins
Naming: foundation.dim_<entity>, foundation.fact_<event>

serving/ — analytics-ready aggregates

Pre-aggregated for specific web app query patterns
These are the only tables the web app reads
Queried from analytics.py via fetch_analytics()
Named to match what the frontend expects
Naming: serving.<purpose>

Adding a new data source

Add a landing zone directory in the extraction package

Add a glob macro in macros/__init__.py:

@macro()
def my_source_glob(evaluator) -> str:
    landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
    return f"'{landing_dir}/my_source/**/*.csv.gz'"

Add a raw model: models/raw/raw_my_source.sql
Add a staging model: models/staging/stg_my_source.sql
Join into foundation or serving models as needed

Model materialization

Layer	Default kind	Rationale
raw	FULL	Always re-reads all files; cheap with DuckDB parallel scan
staging	FULL	1:1 with raw; same cost
foundation	FULL	Business logic rarely changes; recompute is fast
serving	FULL	Small aggregates; web app needs latest at all times

For large historical tables, switch to kind INCREMENTAL_BY_TIME_RANGE with a time partition column. SQLMesh handles the incremental logic automatically.

Environment variables

Variable	Default	Description
`LANDING_DIR`	`data/landing`	Root of the landing zone
`DUCKDB_PATH`	`local.duckdb`	DuckDB file (SQLMesh exclusive write access)

The web app reads from a separate analytics.duckdb file via export_serving.py. Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file — SQLMesh holds an exclusive write lock during plan/run.

3.7 KiB Raw Blame History