padelnomics/transform/sqlmesh_padelnomics/README.md

# Padelnomics Transform (SQLMesh)

4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.

## Running

```bash
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format
```

## 4-layer architecture

```
landing/                    <- raw files (extraction output)
  +-- padelnomics/
      +-- {year}/{etag}.csv.gz

raw/                        <- reads files verbatim
  +-- raw.padelnomics

staging/                    <- type casting, deduplication
  +-- staging.stg_padelnomics

foundation/                 <- business logic, dimensions, facts
  +-- foundation.dim_category

serving/                    <- pre-aggregated for web app
  +-- serving.padelnomics_metrics
```

### raw/ — verbatim source reads

- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
- No transformations, no business logic
- Column names match the source exactly
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
- Naming: `raw.<source>`

### staging/ — type casting and cleansing

- One model per raw model (1:1)
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
- Deduplicate if source produces duplicates
- Minimal renaming — only where raw names are genuinely unclear
- Naming: `staging.stg_<source>`

### foundation/ — business logic

- Dimensions (`dim_*`): slowly changing attributes, one row per entity
- Facts (`fact_*`): events and measurements, one row per event
- May join across multiple staging models from different sources
- Surrogate keys: `MD5(business_key)` for stable joins
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`

### serving/ — analytics-ready aggregates

- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads
- Queried from `analytics.py` via `fetch_analytics()`
- Named to match what the frontend expects
- Naming: `serving.<purpose>`

## Adding a new data source

1. Add a landing zone directory in the extraction package
2. Add a glob macro in `macros/__init__.py`:
   ```python
   @macro()
   def my_source_glob(evaluator) -> str:
       landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
       return f"'{landing_dir}/my_source/**/*.csv.gz'"
   ```
3. Add a raw model: `models/raw/raw_my_source.sql`
4. Add a staging model: `models/staging/stg_my_source.sql`
5. Join into foundation or serving models as needed

## Model materialization

| Layer | Default kind | Rationale |
|-------|-------------|-----------|
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
| staging | FULL | 1:1 with raw; same cost |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |

For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |

The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
SQLMesh holds an exclusive write lock during plan/run.