padelnomics/transform/sqlmesh_padelnomics/README.md

# Padelnomics Transform (SQLMesh)

3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.

## Running

```bash
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format

# Export serving tables to analytics.duckdb (run after SQLMesh)
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
    uv run python -m padelnomics.export_serving
```

## 3-layer architecture

```
landing/                    ← raw files (extraction output)
  ├── overpass/*/*/courts.json.gz
  ├── eurostat/*/*/urb_cpop1.json.gz
  └── playtomic/*/*/tenants.json.gz

staging/                    ← reads landing files directly, type casting, dedup
  ├── staging.stg_padel_courts
  ├── staging.stg_playtomic_venues
  └── staging.stg_population

foundation/                 ← business logic, dimensions, facts
  ├── foundation.dim_venues            ← conformed venue dimension (Playtomic + OSM)
  ├── foundation.dim_cities            ← conformed city dimension (venue-derived + Eurostat)
  ├── foundation.dim_venue_capacity    ← static capacity attributes per venue
  ├── foundation.fct_availability_slot ← event-grain: one row per deduplicated slot
  └── foundation.fct_daily_availability← venue-day aggregate: occupancy + revenue estimates

serving/                    ← pre-aggregated for web app
  ├── serving.city_market_profile
  └── serving.planner_defaults
```

### staging/ — read landing files + type casting

- Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)`
- Uses `@LANDING_DIR` variable for file path discovery
- Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)`
- Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
- Validates coordinates, nulls, and data quality inline
- Naming: `staging.stg_<source>`

### foundation/ — business logic

- Dimensions (`dim_*`): slowly changing attributes, one row per entity
- Facts (`fact_*`): events and measurements, one row per event
- May join across multiple staging models from different sources
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`

### serving/ — analytics-ready aggregates

- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads (via `analytics.duckdb`)
- Queried from `analytics.py` via `fetch_analytics()`
- Naming: `serving.<purpose>`

## Two-DuckDB architecture

```
data/lakehouse.duckdb       ← SQLMesh exclusive write (DUCKDB_PATH)
  ├── staging.*
  ├── foundation.*
  └── serving.*

data/analytics.duckdb       ← web app read-only (SERVING_DUCKDB_PATH)
  └── serving.*             ← atomically replaced by export_serving.py
```

SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run.
The web app needs read-only access at all times. `export_serving.py` copies
`serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`.
The web app detects the inode change on next query — no restart needed.

**Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.**

## Adding a new data source

1. Add an extractor in `extract/padelnomics_extract/` (see extraction README)
2. Add a staging model: `models/staging/stg_<source>.sql` that reads landing files directly
3. Join into foundation or serving models as needed

## Model materialization

| Layer | Default kind | Rationale |
|-------|-------------|-----------|
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |

For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column.

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
| `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) |
| `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |