Three deviations from the quart_saas_boilerplate methodology corrected:
1. Fix dim_cities LIKE join (data quality bug)
- Old: FROM eurostat_cities LEFT JOIN venue_counts LIKE '%country_code%'
→ cartesian product (2.6M rows vs ~5500 expected)
- New: FROM venue_cities (dim_venues) as primary table, Eurostat for
enrichment only. grain (country_code, city_slug).
- Also fixes REGEXP_REPLACE to LOWER() before regex so uppercase city
names aren't stripped to '-'
2. Rename fct_venue_capacity → dim_venue_capacity
- Static venue attributes with no time key are a dimension, not a fact
- No SQL logic changes; update fct_daily_availability reference
3. Add fct_availability_slot at event grain
- New: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
- Recheck dedup logic moves here from fct_daily_availability
- fct_daily_availability now reads fct_availability_slot (cleaner DAG)
Downstream fixes:
- city_market_profile, planner_defaults grain → (country_code, city_slug)
- pseo_city_costs_de, pseo_city_pricing add city_key composite natural key
(country_slug || '-' || city_slug) to avoid URL collisions across countries
- planner_defaults join in pseo_city_costs_de uses both country_code + city_slug
- Templates updated: natural_key city_slug → city_key
Added transform/sqlmesh_padelnomics/CLAUDE.md documenting data modeling rules,
conformed dimension map, and source integration architecture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.0 KiB
5.0 KiB
CLAUDE.md — padelnomics SQLMesh transform
Data engineering guidance for working in this directory. Read the data-engineer skill
(/data-engineer) before making modeling decisions.
3-layer architecture rules
staging/ — read + cast + dedup only
- Reads landing zone files directly:
read_json(@LANDING_DIR || '...', ...)orread_csv(...) - Casts every column to the correct type here:
TRY_CAST(... AS DOUBLE),TRY_CAST(... AS DATE) - Deduplicates on the source's natural key if the source can produce duplicates
- No business logic. No joins across sources. No derived metrics.
- Naming:
staging.stg_<source_dataset>
foundation/ — business logic, conformed dimensions and facts
- Dimensions (
dim_*): one row per entity (venue, city, country). Slowly changing or static.- Conformed = shared across fact tables.
dim_citiesanddim_venuesare conformed. - May integrate multiple staging sources (e.g.
dim_citiesjoins venues + Eurostat + income). - Use
QUALIFY ROW_NUMBER()to ensure exactly one row per grain. - Surrogate keys (if needed):
MD5(business_key)for stable joins.
- Conformed = shared across fact tables.
- Facts (
fact_*): one row per event or measurement. Always have a time key.fct_availability_slot: grain(snapshot_date, tenant_id, resource_id, slot_start_time)fct_daily_availability: grain(snapshot_date, tenant_id)— aggregates fct_availability_slot- Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.)
- Dimension attributes with no time key must be
dim_*, notfct_*.- e.g.
dim_venue_capacity— static venue capacity attributes, graintenant_id
- e.g.
serving/ — pre-aggregated, web app ready
- Read by the web app via
analytics.duckdb(exported byexport_serving.py) - One model per query pattern / page type
- Column names match what the frontend/template expects — no renaming at query time
- Joins across foundation models to produce wide denormalized rows
- Only tables with
serving.*names are exported toanalytics.duckdb
Grain declarations
Every model must declare its grain in the MODEL(...) block:
MODEL (
name foundation.fct_availability_slot,
kind FULL,
grain (snapshot_date, tenant_id, resource_id, slot_start_time)
);
If a model's grain is a single column, use grain column_name (no parens).
Grain must match reality — use QUALIFY ROW_NUMBER() to enforce it.
Conformed dimensions in this project
| Dimension | Grain | Used by |
|---|---|---|
foundation.dim_venues |
venue_id |
dim_cities, dim_venue_capacity, fct_daily_availability (via capacity join) |
foundation.dim_cities |
city_slug |
serving.city_market_profile → all pSEO serving models |
foundation.dim_venue_capacity |
tenant_id |
foundation.fct_daily_availability |
Source integration map
stg_playtomic_venues ─┐
stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ─→ city_market_profile
stg_padel_courts ─┘ └→ dim_venue_capacity
↓
stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability
↓
venue_pricing_benchmarks
↓
stg_population ──→ dim_cities ─────────────────────────────┘
stg_income ──→ dim_cities
Common pitfalls
- Don't add business logic to staging. Even a CASE statement renaming values = business logic → move it to foundation.
- Don't aggregate in foundation facts.
fct_availability_slotis event-grain. The daily rollup lives infct_daily_availability. If you need a different aggregation, add a new serving model — don't collapse the fact further. - dim_cities population is approximate. Eurostat uses city codes (DE001C) not names.
Population enrichment succeeds for ~10% of cities.
market_scoredegrades gracefully (population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name lookup extract. - DuckDB lowercases column names at rest. camelCase columns like
"ratePeak"are stored asratepeak. The content engine uses a case-insensitive reverse map to match DEFAULTS keys. - Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file. SQLMesh holds an exclusive write lock during plan/run; the web app needs concurrent read access.
Running
# Preview changes (no writes)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
# Apply to dev environment
uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply
# Apply to prod virtual layer
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply
# Export serving tables to analytics.duckdb
DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \
SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \
uv run python -m padelnomics.export_serving