Files
padelnomics/transform/sqlmesh_padelnomics/CLAUDE.md
Deeman ebba46f700 refactor: align transform layer with template methodology
Three deviations from the quart_saas_boilerplate methodology corrected:

1. Fix dim_cities LIKE join (data quality bug)
   - Old: FROM eurostat_cities LEFT JOIN venue_counts LIKE '%country_code%'
     → cartesian product (2.6M rows vs ~5500 expected)
   - New: FROM venue_cities (dim_venues) as primary table, Eurostat for
     enrichment only. grain (country_code, city_slug).
   - Also fixes REGEXP_REPLACE to LOWER() before regex so uppercase city
     names aren't stripped to '-'

2. Rename fct_venue_capacity → dim_venue_capacity
   - Static venue attributes with no time key are a dimension, not a fact
   - No SQL logic changes; update fct_daily_availability reference

3. Add fct_availability_slot at event grain
   - New: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
   - Recheck dedup logic moves here from fct_daily_availability
   - fct_daily_availability now reads fct_availability_slot (cleaner DAG)

Downstream fixes:
- city_market_profile, planner_defaults grain → (country_code, city_slug)
- pseo_city_costs_de, pseo_city_pricing add city_key composite natural key
  (country_slug || '-' || city_slug) to avoid URL collisions across countries
- planner_defaults join in pseo_city_costs_de uses both country_code + city_slug
- Templates updated: natural_key city_slug → city_key

Added transform/sqlmesh_padelnomics/CLAUDE.md documenting data modeling rules,
conformed dimension map, and source integration architecture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 21:17:04 +01:00

5.0 KiB

CLAUDE.md — padelnomics SQLMesh transform

Data engineering guidance for working in this directory. Read the data-engineer skill (/data-engineer) before making modeling decisions.

3-layer architecture rules

staging/ — read + cast + dedup only

  • Reads landing zone files directly: read_json(@LANDING_DIR || '...', ...) or read_csv(...)
  • Casts every column to the correct type here: TRY_CAST(... AS DOUBLE), TRY_CAST(... AS DATE)
  • Deduplicates on the source's natural key if the source can produce duplicates
  • No business logic. No joins across sources. No derived metrics.
  • Naming: staging.stg_<source_dataset>

foundation/ — business logic, conformed dimensions and facts

  • Dimensions (dim_*): one row per entity (venue, city, country). Slowly changing or static.
    • Conformed = shared across fact tables. dim_cities and dim_venues are conformed.
    • May integrate multiple staging sources (e.g. dim_cities joins venues + Eurostat + income).
    • Use QUALIFY ROW_NUMBER() to ensure exactly one row per grain.
    • Surrogate keys (if needed): MD5(business_key) for stable joins.
  • Facts (fact_*): one row per event or measurement. Always have a time key.
    • fct_availability_slot: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
    • fct_daily_availability: grain (snapshot_date, tenant_id) — aggregates fct_availability_slot
    • Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.)
  • Dimension attributes with no time key must be dim_*, not fct_*.
    • e.g. dim_venue_capacity — static venue capacity attributes, grain tenant_id

serving/ — pre-aggregated, web app ready

  • Read by the web app via analytics.duckdb (exported by export_serving.py)
  • One model per query pattern / page type
  • Column names match what the frontend/template expects — no renaming at query time
  • Joins across foundation models to produce wide denormalized rows
  • Only tables with serving.* names are exported to analytics.duckdb

Grain declarations

Every model must declare its grain in the MODEL(...) block:

MODEL (
  name foundation.fct_availability_slot,
  kind FULL,
  grain (snapshot_date, tenant_id, resource_id, slot_start_time)
);

If a model's grain is a single column, use grain column_name (no parens). Grain must match reality — use QUALIFY ROW_NUMBER() to enforce it.

Conformed dimensions in this project

Dimension Grain Used by
foundation.dim_venues venue_id dim_cities, dim_venue_capacity, fct_daily_availability (via capacity join)
foundation.dim_cities city_slug serving.city_market_profile → all pSEO serving models
foundation.dim_venue_capacity tenant_id foundation.fct_daily_availability

Source integration map

stg_playtomic_venues  ─┐
stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ─→ city_market_profile
stg_padel_courts      ─┘               └→ dim_venue_capacity
                                                            ↓
stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability
                                                            ↓
                                               venue_pricing_benchmarks
                                                            ↓
stg_population ──→ dim_cities ─────────────────────────────┘
stg_income     ──→ dim_cities

Common pitfalls

  • Don't add business logic to staging. Even a CASE statement renaming values = business logic → move it to foundation.
  • Don't aggregate in foundation facts. fct_availability_slot is event-grain. The daily rollup lives in fct_daily_availability. If you need a different aggregation, add a new serving model — don't collapse the fact further.
  • dim_cities population is approximate. Eurostat uses city codes (DE001C) not names. Population enrichment succeeds for ~10% of cities. market_score degrades gracefully (population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name lookup extract.
  • DuckDB lowercases column names at rest. camelCase columns like "ratePeak" are stored as ratepeak. The content engine uses a case-insensitive reverse map to match DEFAULTS keys.
  • Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file. SQLMesh holds an exclusive write lock during plan/run; the web app needs concurrent read access.

Running

# Preview changes (no writes)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to dev environment
uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply

# Apply to prod virtual layer
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply

# Export serving tables to analytics.duckdb
DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \
  SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \
  uv run python -m padelnomics.export_serving