# CLAUDE.md — padelnomics SQLMesh transform Data engineering guidance for working in this directory. Read the `data-engineer` skill (`/data-engineer`) before making modeling decisions. ## 3-layer architecture rules ### staging/ — read + cast + dedup only - Reads landing zone files directly: `read_json(@LANDING_DIR || '...', ...)` or `read_csv(...)` - Casts every column to the correct type here: `TRY_CAST(... AS DOUBLE)`, `TRY_CAST(... AS DATE)` - Deduplicates on the source's natural key if the source can produce duplicates - **No business logic.** No joins across sources. No derived metrics. - Naming: `staging.stg_` ### foundation/ — business logic, conformed dimensions and facts - **Dimensions (`dim_*`)**: one row per entity (venue, city, country). Slowly changing or static. - Conformed = shared across fact tables. `dim_cities` and `dim_venues` are conformed. - May integrate multiple staging sources (e.g. `dim_cities` joins venues + Eurostat + income). - Use `QUALIFY ROW_NUMBER()` to ensure exactly one row per grain. - Surrogate keys (if needed): `MD5(business_key)` for stable joins. - **Facts (`fact_*`)**: one row per **event or measurement**. Always have a time key. - `fct_availability_slot`: grain `(snapshot_date, tenant_id, resource_id, slot_start_time)` - `fct_daily_availability`: grain `(snapshot_date, tenant_id)` — aggregates fct_availability_slot - Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.) - **Dimension attributes** with no time key must be `dim_*`, not `fct_*`. - e.g. `dim_venue_capacity` — static venue capacity attributes, grain `tenant_id` ### serving/ — pre-aggregated, web app ready - Read by the web app via `analytics.duckdb` (exported by `export_serving.py`) - One model per query pattern / page type - Column names match what the frontend/template expects — no renaming at query time - Joins across foundation models to produce wide denormalized rows - Only tables with `serving.*` names are exported to `analytics.duckdb` ## Grain declarations Every model must declare its grain in the `MODEL(...)` block: ```sql MODEL ( name foundation.fct_availability_slot, kind FULL, grain (snapshot_date, tenant_id, resource_id, slot_start_time) ); ``` If a model's grain is a single column, use `grain column_name` (no parens). Grain must match reality — use `QUALIFY ROW_NUMBER()` to enforce it. ## Conformed dimensions in this project | Dimension | Grain | Used by | |-----------|-------|---------| | `foundation.dim_venues` | `venue_id` | `dim_cities`, `dim_venue_capacity`, `fct_daily_availability` (via capacity join) | | `foundation.dim_cities` | `(country_code, city_slug)` | `serving.city_market_profile` → all pSEO serving models | | `foundation.dim_locations` | `(country_code, geoname_id)` | `serving.location_opportunity_profile` — all GeoNames locations (pop ≥1K), incl. zero-court locations | | `foundation.dim_venue_capacity` | `tenant_id` | `foundation.fct_daily_availability` | ## Source integration map ``` stg_playtomic_venues ─┐ stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ──────────────→ city_market_profile stg_padel_courts ─┘ └→ dim_venue_capacity (Marktreife-Score) ↓ stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability ↓ venue_pricing_benchmarks ↓ stg_population ──→ dim_cities ─────────────────────────────┘ stg_income ──→ dim_cities stg_population_geonames ─┐ stg_padel_courts ─┤→ dim_locations ──→ location_opportunity_profile stg_tennis_courts ─┤ (Marktpotenzial-Score) stg_income ─┘ ``` ## Distance calculation pattern (ST_Distance_Sphere) Use a bounding-box pre-filter before calling `ST_Distance_Sphere` to avoid full cross-joins: ```sql -- Nearest padel court (km) per location SELECT l.geoname_id, MIN(ST_Distance_Sphere( ST_Point(l.lon, l.lat), ST_Point(p.lon, p.lat) ) / 1000.0) AS nearest_km FROM locations l JOIN padel_courts p ON ABS(l.lat - p.lat) < 0.5 -- ~55km pre-filter AND ABS(l.lon - p.lon) < 0.5 GROUP BY l.geoname_id ``` Requires `extensions: [spatial]` in `config.yaml` (already set). DuckDB spatial must `INSTALL spatial; LOAD spatial;` before `ST_Distance_Sphere` / `ST_Point` are available. ## Common pitfalls - **Don't add business logic to staging.** Even a CASE statement renaming values = business logic → move it to foundation. - **Don't aggregate in foundation facts.** `fct_availability_slot` is event-grain. The daily rollup lives in `fct_daily_availability`. If you need a different aggregation, add a new serving model — don't collapse the fact further. - **dim_cities population is approximate.** Eurostat uses city codes (DE001C) not names. Population enrichment succeeds for ~10% of cities. `market_score` degrades gracefully (population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name lookup extract. - **DuckDB lowercases column names at rest.** camelCase columns like `"ratePeak"` are stored as `ratepeak`. The content engine uses a case-insensitive reverse map to match DEFAULTS keys. - **Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.** SQLMesh holds an exclusive write lock during plan/run; the web app needs concurrent read access. ## Running ```bash # Preview changes (no writes) uv run sqlmesh -p transform/sqlmesh_padelnomics plan # Apply to dev environment uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply # Apply to prod virtual layer uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply # Export serving tables to analytics.duckdb DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \ SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \ uv run python -m padelnomics.export_serving ```