Three deviations from the quart_saas_boilerplate methodology corrected:
1. Fix dim_cities LIKE join (data quality bug)
- Old: FROM eurostat_cities LEFT JOIN venue_counts LIKE '%country_code%'
→ cartesian product (2.6M rows vs ~5500 expected)
- New: FROM venue_cities (dim_venues) as primary table, Eurostat for
enrichment only. grain (country_code, city_slug).
- Also fixes REGEXP_REPLACE to LOWER() before regex so uppercase city
names aren't stripped to '-'
2. Rename fct_venue_capacity → dim_venue_capacity
- Static venue attributes with no time key are a dimension, not a fact
- No SQL logic changes; update fct_daily_availability reference
3. Add fct_availability_slot at event grain
- New: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
- Recheck dedup logic moves here from fct_daily_availability
- fct_daily_availability now reads fct_availability_slot (cleaner DAG)
Downstream fixes:
- city_market_profile, planner_defaults grain → (country_code, city_slug)
- pseo_city_costs_de, pseo_city_pricing add city_key composite natural key
(country_slug || '-' || city_slug) to avoid URL collisions across countries
- planner_defaults join in pseo_city_costs_de uses both country_code + city_slug
- Templates updated: natural_key city_slug → city_key
Added transform/sqlmesh_padelnomics/CLAUDE.md documenting data modeling rules,
conformed dimension map, and source integration architecture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
109 lines
5.0 KiB
Markdown
109 lines
5.0 KiB
Markdown
# CLAUDE.md — padelnomics SQLMesh transform
|
|
|
|
Data engineering guidance for working in this directory. Read the `data-engineer` skill
|
|
(`/data-engineer`) before making modeling decisions.
|
|
|
|
## 3-layer architecture rules
|
|
|
|
### staging/ — read + cast + dedup only
|
|
|
|
- Reads landing zone files directly: `read_json(@LANDING_DIR || '...', ...)` or `read_csv(...)`
|
|
- Casts every column to the correct type here: `TRY_CAST(... AS DOUBLE)`, `TRY_CAST(... AS DATE)`
|
|
- Deduplicates on the source's natural key if the source can produce duplicates
|
|
- **No business logic.** No joins across sources. No derived metrics.
|
|
- Naming: `staging.stg_<source_dataset>`
|
|
|
|
### foundation/ — business logic, conformed dimensions and facts
|
|
|
|
- **Dimensions (`dim_*`)**: one row per entity (venue, city, country). Slowly changing or static.
|
|
- Conformed = shared across fact tables. `dim_cities` and `dim_venues` are conformed.
|
|
- May integrate multiple staging sources (e.g. `dim_cities` joins venues + Eurostat + income).
|
|
- Use `QUALIFY ROW_NUMBER()` to ensure exactly one row per grain.
|
|
- Surrogate keys (if needed): `MD5(business_key)` for stable joins.
|
|
- **Facts (`fact_*`)**: one row per **event or measurement**. Always have a time key.
|
|
- `fct_availability_slot`: grain `(snapshot_date, tenant_id, resource_id, slot_start_time)`
|
|
- `fct_daily_availability`: grain `(snapshot_date, tenant_id)` — aggregates fct_availability_slot
|
|
- Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.)
|
|
- **Dimension attributes** with no time key must be `dim_*`, not `fct_*`.
|
|
- e.g. `dim_venue_capacity` — static venue capacity attributes, grain `tenant_id`
|
|
|
|
### serving/ — pre-aggregated, web app ready
|
|
|
|
- Read by the web app via `analytics.duckdb` (exported by `export_serving.py`)
|
|
- One model per query pattern / page type
|
|
- Column names match what the frontend/template expects — no renaming at query time
|
|
- Joins across foundation models to produce wide denormalized rows
|
|
- Only tables with `serving.*` names are exported to `analytics.duckdb`
|
|
|
|
## Grain declarations
|
|
|
|
Every model must declare its grain in the `MODEL(...)` block:
|
|
|
|
```sql
|
|
MODEL (
|
|
name foundation.fct_availability_slot,
|
|
kind FULL,
|
|
grain (snapshot_date, tenant_id, resource_id, slot_start_time)
|
|
);
|
|
```
|
|
|
|
If a model's grain is a single column, use `grain column_name` (no parens).
|
|
Grain must match reality — use `QUALIFY ROW_NUMBER()` to enforce it.
|
|
|
|
## Conformed dimensions in this project
|
|
|
|
| Dimension | Grain | Used by |
|
|
|-----------|-------|---------|
|
|
| `foundation.dim_venues` | `venue_id` | `dim_cities`, `dim_venue_capacity`, `fct_daily_availability` (via capacity join) |
|
|
| `foundation.dim_cities` | `city_slug` | `serving.city_market_profile` → all pSEO serving models |
|
|
| `foundation.dim_venue_capacity` | `tenant_id` | `foundation.fct_daily_availability` |
|
|
|
|
## Source integration map
|
|
|
|
```
|
|
stg_playtomic_venues ─┐
|
|
stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ─→ city_market_profile
|
|
stg_padel_courts ─┘ └→ dim_venue_capacity
|
|
↓
|
|
stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability
|
|
↓
|
|
venue_pricing_benchmarks
|
|
↓
|
|
stg_population ──→ dim_cities ─────────────────────────────┘
|
|
stg_income ──→ dim_cities
|
|
```
|
|
|
|
## Common pitfalls
|
|
|
|
- **Don't add business logic to staging.** Even a CASE statement renaming values = business
|
|
logic → move it to foundation.
|
|
- **Don't aggregate in foundation facts.** `fct_availability_slot` is event-grain. The daily
|
|
rollup lives in `fct_daily_availability`. If you need a different aggregation, add a new
|
|
serving model — don't collapse the fact further.
|
|
- **dim_cities population is approximate.** Eurostat uses city codes (DE001C) not names.
|
|
Population enrichment succeeds for ~10% of cities. `market_score` degrades gracefully
|
|
(population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name
|
|
lookup extract.
|
|
- **DuckDB lowercases column names at rest.** camelCase columns like `"ratePeak"` are stored
|
|
as `ratepeak`. The content engine uses a case-insensitive reverse map to match DEFAULTS keys.
|
|
- **Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.** SQLMesh holds an
|
|
exclusive write lock during plan/run; the web app needs concurrent read access.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
# Preview changes (no writes)
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
|
|
|
|
# Apply to dev environment
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply
|
|
|
|
# Apply to prod virtual layer
|
|
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply
|
|
|
|
# Export serving tables to analytics.duckdb
|
|
DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \
|
|
SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \
|
|
uv run python -m padelnomics.export_serving
|
|
```
|