Files
padelnomics/transform/sqlmesh_padelnomics
Deeman ea86940b78 feat: copier update v0.9.0 → v0.10.0
Pulls in template changes: export_serving.py for atomic DuckDB swap,
supervisor export step, SQLMesh glob macro, server provisioning script,
imprint template, and formatting improvements.

Template scaffold SQL models excluded (padelnomics has real models).
Web app routes/analytics unchanged (padelnomics-specific customizations).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 17:50:36 +01:00
..

Padelnomics Transform (SQLMesh)

4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.

Running

# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format

4-layer architecture

landing/                    ← raw files (extraction output)
  └── padelnomics/
      └── {year}/{etag}.csv.gz

raw/                        ← reads files verbatim
  └── raw.padelnomics

staging/                    ← type casting, deduplication
  └── staging.stg_padelnomics

foundation/                 ← business logic, dimensions, facts
  └── foundation.dim_category

serving/                    ← pre-aggregated for web app
  └── serving.padelnomics_metrics

raw/ — verbatim source reads

  • Reads landing zone files directly with read_csv(..., all_varchar=true)
  • No transformations, no business logic
  • Column names match the source exactly
  • Uses a macro (@padelnomics_glob()) so new landing files are picked up automatically
  • Naming: raw.<source>

staging/ — type casting and cleansing

  • One model per raw model (1:1)
  • Cast all columns to correct types: TRY_CAST(report_date AS DATE)
  • Deduplicate if source produces duplicates
  • Minimal renaming — only where raw names are genuinely unclear
  • Naming: staging.stg_<source>

foundation/ — business logic

  • Dimensions (dim_*): slowly changing attributes, one row per entity
  • Facts (fact_*): events and measurements, one row per event
  • May join across multiple staging models from different sources
  • Surrogate keys: MD5(business_key) for stable joins
  • Naming: foundation.dim_<entity>, foundation.fact_<event>

serving/ — analytics-ready aggregates

  • Pre-aggregated for specific web app query patterns
  • These are the only tables the web app reads
  • Queried from analytics.py via fetch_analytics()
  • Named to match what the frontend expects
  • Naming: serving.<purpose>

Adding a new data source

  1. Add a landing zone directory in the extraction package
  2. Add a glob macro in macros/__init__.py:
    @macro()
    def my_source_glob(evaluator) -> str:
        landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
        return f"'{landing_dir}/my_source/**/*.csv.gz'"
    
  3. Add a raw model: models/raw/raw_my_source.sql
  4. Add a staging model: models/staging/stg_my_source.sql
  5. Join into foundation or serving models as needed

Model materialization

Layer Default kind Rationale
raw FULL Always re-reads all files; cheap with DuckDB parallel scan
staging FULL 1:1 with raw; same cost
foundation FULL Business logic rarely changes; recompute is fast
serving FULL Small aggregates; web app needs latest at all times

For large historical tables, switch to kind INCREMENTAL_BY_TIME_RANGE with a time partition column. SQLMesh handles the incremental logic automatically.

Environment variables

Variable Default Description
LANDING_DIR data/landing Root of the landing zone
DUCKDB_PATH local.duckdb DuckDB file (SQLMesh exclusive write access)

The web app reads from a separate analytics.duckdb file via export_serving.py. Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file — SQLMesh holds an exclusive write lock during plan/run.