Files

Deeman 2e68cfbe4f feat(transform): individualise article costs with per-country Eurostat data

Add real per-country cost data to ~30 calculator fields so pSEO articles
show country-specific CAPEX/OPEX instead of hardcoded DE defaults.

Extractor:
- eurostat.py: add 8 new datasets (nrg_pc_205, nrg_pc_203, lc_lci_lev,
  5×prc_ppp_ind variants); add optional `dataset_code` field so multiple
  dict entries can share one Eurostat API endpoint

Staging (4 new models):
- stg_electricity_prices — EUR/kWh by country, semi-annual
- stg_gas_prices         — EUR/GJ by country, semi-annual
- stg_labour_costs       — EUR/hour by country, annual (future staffed scenario)
- stg_price_levels       — PLI indices (EU27=100) for 5 categories, annual

Foundation:
- dim_countries (new) — conformed country dimension; eliminates ~50-line CASE
  blocks duplicated in dim_cities/dim_locations; computes ~29 calculator cost
  override columns from PLI ratios and energy price ratios vs DE baseline;
  NULL for DE so calculator falls through to DEFAULTS unchanged
- dim_cities — replace country_name/slug CASE blocks + country_income CTE
  with JOIN dim_countries
- dim_locations — same refactor as dim_cities

Serving:
- pseo_city_costs_de — JOIN dim_countries; add 29 camelCase override columns
  auto-applied by calculator (electricity, heating, rentSqm, hallCostSqm, …)
- planner_defaults — JOIN dim_countries; same 29 cost columns flow through
  to /api/market-data endpoint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-04 10:09:48 +01:00

6.3 KiB

Raw Blame History

CLAUDE.md — padelnomics SQLMesh transform

Data engineering guidance for working in this directory. Read the data-engineer skill (/data-engineer) before making modeling decisions.

3-layer architecture rules

staging/ — read + cast + dedup only

Reads landing zone files directly: read_json(@LANDING_DIR || '...', ...) or read_csv(...)
Casts every column to the correct type here: TRY_CAST(... AS DOUBLE), TRY_CAST(... AS DATE)
Deduplicates on the source's natural key if the source can produce duplicates
No business logic. No joins across sources. No derived metrics.
Naming: staging.stg_<source_dataset>

foundation/ — business logic, conformed dimensions and facts

Dimensions (dim_*): one row per entity (venue, city, country). Slowly changing or static.
- Conformed = shared across fact tables. dim_cities and dim_venues are conformed.
- May integrate multiple staging sources (e.g. dim_cities joins venues + Eurostat + income).
- Use QUALIFY ROW_NUMBER() to ensure exactly one row per grain.
- Surrogate keys (if needed): MD5(business_key) for stable joins.
Facts (fact_*): one row per event or measurement. Always have a time key.
- fct_availability_slot: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
- fct_daily_availability: grain (snapshot_date, tenant_id) — aggregates fct_availability_slot
- Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.)
Dimension attributes with no time key must be dim_*, not fct_*.
- e.g. dim_venue_capacity — static venue capacity attributes, grain tenant_id

serving/ — pre-aggregated, web app ready

Read by the web app via analytics.duckdb (exported by export_serving.py)
One model per query pattern / page type
Column names match what the frontend/template expects — no renaming at query time
Joins across foundation models to produce wide denormalized rows
Only tables with serving.* names are exported to analytics.duckdb

Grain declarations

Every model must declare its grain in the MODEL(...) block:

MODEL (
  name foundation.fct_availability_slot,
  kind FULL,
  grain (snapshot_date, tenant_id, resource_id, slot_start_time)
);

If a model's grain is a single column, use grain column_name (no parens). Grain must match reality — use QUALIFY ROW_NUMBER() to enforce it.

Conformed dimensions in this project

Dimension	Grain	Used by
`foundation.dim_countries`	`country_code`	`dim_cities`, `dim_locations`, `pseo_city_costs_de`, `planner_defaults` — single source for country names, income, PLI/cost overrides
`foundation.dim_venues`	`venue_id`	`dim_cities`, `dim_venue_capacity`, `fct_daily_availability` (via capacity join)
`foundation.dim_cities`	`(country_code, city_slug)`	`serving.city_market_profile` → all pSEO serving models
`foundation.dim_locations`	`(country_code, geoname_id)`	`serving.location_opportunity_profile` — all GeoNames locations (pop ≥1K), incl. zero-court locations
`foundation.dim_venue_capacity`	`tenant_id`	`foundation.fct_daily_availability`

Source integration map

stg_playtomic_venues  ─┐
stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ──────────────→ city_market_profile
stg_padel_courts      ─┘               └→ dim_venue_capacity              (Marktreife-Score)
                                                            ↓
stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability
                                                            ↓
                                               venue_pricing_benchmarks
                                                            ↓
stg_population ──→ dim_cities ─────────────────────────────┘
stg_income     ──→ dim_cities

stg_population_geonames ─┐
stg_padel_courts         ─┤→ dim_locations ──→ location_opportunity_profile
stg_tennis_courts        ─┤                         (Marktpotenzial-Score)
stg_income               ─┘

Distance calculation pattern (ST_Distance_Sphere)

Use a bounding-box pre-filter before calling ST_Distance_Sphere to avoid full cross-joins:

-- Nearest padel court (km) per location
SELECT l.geoname_id,
  MIN(ST_Distance_Sphere(
    ST_Point(l.lon, l.lat), ST_Point(p.lon, p.lat)
  ) / 1000.0) AS nearest_km
FROM locations l
JOIN padel_courts p
  ON ABS(l.lat - p.lat) < 0.5   -- ~55km pre-filter
 AND ABS(l.lon - p.lon) < 0.5
GROUP BY l.geoname_id

Requires extensions: [spatial] in config.yaml (already set). DuckDB spatial must INSTALL spatial; LOAD spatial; before ST_Distance_Sphere / ST_Point are available.

Common pitfalls

Don't add business logic to staging. Even a CASE statement renaming values = business logic → move it to foundation.
Don't aggregate in foundation facts. fct_availability_slot is event-grain. The daily rollup lives in fct_daily_availability. If you need a different aggregation, add a new serving model — don't collapse the fact further.
dim_cities population is approximate. Eurostat uses city codes (DE001C) not names. Population enrichment succeeds for ~10% of cities. market_score degrades gracefully (population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name lookup extract.
DuckDB lowercases column names at rest. camelCase columns like "ratePeak" are stored as ratepeak. The content engine uses a case-insensitive reverse map to match DEFAULTS keys.
Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file. SQLMesh holds an exclusive write lock during plan/run; the web app needs concurrent read access.

Running

# Preview changes (no writes)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to dev environment
uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply

# Apply to prod virtual layer
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply

# Export serving tables to analytics.duckdb
DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \
  SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \
  uv run python src/padelnomics/export_serving.py

6.3 KiB Raw Blame History