Three deviations from the quart_saas_boilerplate methodology corrected:
1. Fix dim_cities LIKE join (data quality bug)
- Old: FROM eurostat_cities LEFT JOIN venue_counts LIKE '%country_code%'
→ cartesian product (2.6M rows vs ~5500 expected)
- New: FROM venue_cities (dim_venues) as primary table, Eurostat for
enrichment only. grain (country_code, city_slug).
- Also fixes REGEXP_REPLACE to LOWER() before regex so uppercase city
names aren't stripped to '-'
2. Rename fct_venue_capacity → dim_venue_capacity
- Static venue attributes with no time key are a dimension, not a fact
- No SQL logic changes; update fct_daily_availability reference
3. Add fct_availability_slot at event grain
- New: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
- Recheck dedup logic moves here from fct_daily_availability
- fct_daily_availability now reads fct_availability_slot (cleaner DAG)
Downstream fixes:
- city_market_profile, planner_defaults grain → (country_code, city_slug)
- pseo_city_costs_de, pseo_city_pricing add city_key composite natural key
(country_slug || '-' || city_slug) to avoid URL collisions across countries
- planner_defaults join in pseo_city_costs_de uses both country_code + city_slug
- Templates updated: natural_key city_slug → city_key
Added transform/sqlmesh_padelnomics/CLAUDE.md documenting data modeling rules,
conformed dimension map, and source integration architecture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
115 lines
4.5 KiB
Markdown
115 lines
4.5 KiB
Markdown
# Padelnomics Transform (SQLMesh)
|
|
|
|
3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
# From repo root — plan all changes (shows what will run)
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
|
|
|
|
# Apply to production
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
|
|
|
|
# Run model tests
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics test
|
|
|
|
# Format SQL
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics format
|
|
|
|
# Export serving tables to analytics.duckdb (run after SQLMesh)
|
|
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
|
|
uv run python -m padelnomics.export_serving
|
|
```
|
|
|
|
## 3-layer architecture
|
|
|
|
```
|
|
landing/ ← raw files (extraction output)
|
|
├── overpass/*/*/courts.json.gz
|
|
├── eurostat/*/*/urb_cpop1.json.gz
|
|
└── playtomic/*/*/tenants.json.gz
|
|
|
|
staging/ ← reads landing files directly, type casting, dedup
|
|
├── staging.stg_padel_courts
|
|
├── staging.stg_playtomic_venues
|
|
└── staging.stg_population
|
|
|
|
foundation/ ← business logic, dimensions, facts
|
|
├── foundation.dim_venues ← conformed venue dimension (Playtomic + OSM)
|
|
├── foundation.dim_cities ← conformed city dimension (venue-derived + Eurostat)
|
|
├── foundation.dim_venue_capacity ← static capacity attributes per venue
|
|
├── foundation.fct_availability_slot ← event-grain: one row per deduplicated slot
|
|
└── foundation.fct_daily_availability← venue-day aggregate: occupancy + revenue estimates
|
|
|
|
serving/ ← pre-aggregated for web app
|
|
├── serving.city_market_profile
|
|
└── serving.planner_defaults
|
|
```
|
|
|
|
### staging/ — read landing files + type casting
|
|
|
|
- Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)`
|
|
- Uses `@LANDING_DIR` variable for file path discovery
|
|
- Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)`
|
|
- Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
|
|
- Validates coordinates, nulls, and data quality inline
|
|
- Naming: `staging.stg_<source>`
|
|
|
|
### foundation/ — business logic
|
|
|
|
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
|
|
- Facts (`fact_*`): events and measurements, one row per event
|
|
- May join across multiple staging models from different sources
|
|
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
|
|
|
### serving/ — analytics-ready aggregates
|
|
|
|
- Pre-aggregated for specific web app query patterns
|
|
- These are the only tables the web app reads (via `analytics.duckdb`)
|
|
- Queried from `analytics.py` via `fetch_analytics()`
|
|
- Naming: `serving.<purpose>`
|
|
|
|
## Two-DuckDB architecture
|
|
|
|
```
|
|
data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH)
|
|
├── staging.*
|
|
├── foundation.*
|
|
└── serving.*
|
|
|
|
data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH)
|
|
└── serving.* ← atomically replaced by export_serving.py
|
|
```
|
|
|
|
SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run.
|
|
The web app needs read-only access at all times. `export_serving.py` copies
|
|
`serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`.
|
|
The web app detects the inode change on next query — no restart needed.
|
|
|
|
**Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.**
|
|
|
|
## Adding a new data source
|
|
|
|
1. Add an extractor in `extract/padelnomics_extract/` (see extraction README)
|
|
2. Add a staging model: `models/staging/stg_<source>.sql` that reads landing files directly
|
|
3. Join into foundation or serving models as needed
|
|
|
|
## Model materialization
|
|
|
|
| Layer | Default kind | Rationale |
|
|
|-------|-------------|-----------|
|
|
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
|
|
| foundation | FULL | Business logic rarely changes; recompute is fast |
|
|
| serving | FULL | Small aggregates; web app needs latest at all times |
|
|
|
|
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column.
|
|
|
|
## Environment variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
|
| `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
|
| `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |
|