Files
padelnomics/transform/sqlmesh_padelnomics/CLAUDE.md
Deeman 405efcfd19 docs: update docs and PROJECT.md for dual score pipeline
Task 8: documentation updates for the dual market score feature.

- CHANGELOG.md: comprehensive [Unreleased] entries for all additions
  (Marktpotenzial-Score, tennis courts, dim_locations, GeoNames expansion,
  DuckDB spatial, SOPS secrets, methodology page updates)
- docs/data-sources-inventory.md: add tennis courts Overpass row, update
  GeoNames entry (cities1000, username=padelnomics, higher score)
- transform/sqlmesh_padelnomics/CLAUDE.md: add dim_locations to conformed
  dimensions table, update source integration map with new pipeline branch,
  document ST_Distance_Sphere bounding-box pattern
- PROJECT.md: add dual score to In Progress, add Gemeinde pSEO + top-50
  ranking page to Next Up, add data backlog items (sports_centre, NUTS-3,
  opportunity map), add Decisions Log entry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 17:12:22 +01:00

6.1 KiB

CLAUDE.md — padelnomics SQLMesh transform

Data engineering guidance for working in this directory. Read the data-engineer skill (/data-engineer) before making modeling decisions.

3-layer architecture rules

staging/ — read + cast + dedup only

  • Reads landing zone files directly: read_json(@LANDING_DIR || '...', ...) or read_csv(...)
  • Casts every column to the correct type here: TRY_CAST(... AS DOUBLE), TRY_CAST(... AS DATE)
  • Deduplicates on the source's natural key if the source can produce duplicates
  • No business logic. No joins across sources. No derived metrics.
  • Naming: staging.stg_<source_dataset>

foundation/ — business logic, conformed dimensions and facts

  • Dimensions (dim_*): one row per entity (venue, city, country). Slowly changing or static.
    • Conformed = shared across fact tables. dim_cities and dim_venues are conformed.
    • May integrate multiple staging sources (e.g. dim_cities joins venues + Eurostat + income).
    • Use QUALIFY ROW_NUMBER() to ensure exactly one row per grain.
    • Surrogate keys (if needed): MD5(business_key) for stable joins.
  • Facts (fact_*): one row per event or measurement. Always have a time key.
    • fct_availability_slot: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
    • fct_daily_availability: grain (snapshot_date, tenant_id) — aggregates fct_availability_slot
    • Facts reference conformed dimensions by their natural key (tenant_id, city_slug, etc.)
  • Dimension attributes with no time key must be dim_*, not fct_*.
    • e.g. dim_venue_capacity — static venue capacity attributes, grain tenant_id

serving/ — pre-aggregated, web app ready

  • Read by the web app via analytics.duckdb (exported by export_serving.py)
  • One model per query pattern / page type
  • Column names match what the frontend/template expects — no renaming at query time
  • Joins across foundation models to produce wide denormalized rows
  • Only tables with serving.* names are exported to analytics.duckdb

Grain declarations

Every model must declare its grain in the MODEL(...) block:

MODEL (
  name foundation.fct_availability_slot,
  kind FULL,
  grain (snapshot_date, tenant_id, resource_id, slot_start_time)
);

If a model's grain is a single column, use grain column_name (no parens). Grain must match reality — use QUALIFY ROW_NUMBER() to enforce it.

Conformed dimensions in this project

Dimension Grain Used by
foundation.dim_venues venue_id dim_cities, dim_venue_capacity, fct_daily_availability (via capacity join)
foundation.dim_cities (country_code, city_slug) serving.city_market_profile → all pSEO serving models
foundation.dim_locations (country_code, geoname_id) serving.location_opportunity_profile — all GeoNames locations (pop ≥1K), incl. zero-court locations
foundation.dim_venue_capacity tenant_id foundation.fct_daily_availability

Source integration map

stg_playtomic_venues  ─┐
stg_playtomic_resources─┤→ dim_venues ─┬→ dim_cities ──────────────→ city_market_profile
stg_padel_courts      ─┘               └→ dim_venue_capacity              (Marktreife-Score)
                                                            ↓
stg_playtomic_availability ──→ fct_availability_slot ──→ fct_daily_availability
                                                            ↓
                                               venue_pricing_benchmarks
                                                            ↓
stg_population ──→ dim_cities ─────────────────────────────┘
stg_income     ──→ dim_cities

stg_population_geonames ─┐
stg_padel_courts         ─┤→ dim_locations ──→ location_opportunity_profile
stg_tennis_courts        ─┤                         (Marktpotenzial-Score)
stg_income               ─┘

Distance calculation pattern (ST_Distance_Sphere)

Use a bounding-box pre-filter before calling ST_Distance_Sphere to avoid full cross-joins:

-- Nearest padel court (km) per location
SELECT l.geoname_id,
  MIN(ST_Distance_Sphere(
    ST_Point(l.lon, l.lat), ST_Point(p.lon, p.lat)
  ) / 1000.0) AS nearest_km
FROM locations l
JOIN padel_courts p
  ON ABS(l.lat - p.lat) < 0.5   -- ~55km pre-filter
 AND ABS(l.lon - p.lon) < 0.5
GROUP BY l.geoname_id

Requires extensions: [spatial] in config.yaml (already set). DuckDB spatial must INSTALL spatial; LOAD spatial; before ST_Distance_Sphere / ST_Point are available.

Common pitfalls

  • Don't add business logic to staging. Even a CASE statement renaming values = business logic → move it to foundation.
  • Don't aggregate in foundation facts. fct_availability_slot is event-grain. The daily rollup lives in fct_daily_availability. If you need a different aggregation, add a new serving model — don't collapse the fact further.
  • dim_cities population is approximate. Eurostat uses city codes (DE001C) not names. Population enrichment succeeds for ~10% of cities. market_score degrades gracefully (population component = 0) for unmatched cities. To improve: add a Eurostat city-code→name lookup extract.
  • DuckDB lowercases column names at rest. camelCase columns like "ratePeak" are stored as ratepeak. The content engine uses a case-insensitive reverse map to match DEFAULTS keys.
  • Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file. SQLMesh holds an exclusive write lock during plan/run; the web app needs concurrent read access.

Running

# Preview changes (no writes)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan

# Apply to dev environment
uv run sqlmesh -p transform/sqlmesh_padelnomics plan --auto-apply

# Apply to prod virtual layer
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod --auto-apply

# Export serving tables to analytics.duckdb
DUCKDB_PATH=$(pwd)/data/lakehouse.duckdb \
  SERVING_DUCKDB_PATH=$(pwd)/analytics.duckdb \
  uv run python -m padelnomics.export_serving