17 Commits

Author SHA1 Message Date
Deeman
4e82907a70 refactor(transform): conform geographic dimension hierarchy via city_slug
Propagates the conformed city key (city_slug) from dim_venues through the
full pricing pipeline, eliminating 3 fragile LOWER(TRIM(...)) fuzzy string
joins with deterministic key joins.

Changes (cascading, task-by-task):
- dim_venues: add city_slug computed column (REGEXP_REPLACE slug derivation)
- dim_venue_capacity: join foundation.dim_venues instead of stg_playtomic_venues;
  carry city_slug alongside country_code/city
- fct_daily_availability: carry city_slug from dim_venue_capacity
- venue_pricing_benchmarks: carry city_slug from fct_daily_availability;
  add to venue_stats GROUP BY and final SELECT/GROUP BY
- city_market_profile: join vpb on city_slug = city_slug (was LOWER(TRIM))
- planner_defaults: add city_slug to city_benchmarks CTE; join on city_slug
- pseo_city_pricing: join city_market_profile on city_slug (was LOWER(TRIM))
- pipeline_routes._DAG: dim_venue_capacity now depends on dim_venues, not stg_playtomic_venues

Result: dim_venues.city_slug → dim_cities.(country_code, city_slug) forms a
fully conformed geographic hierarchy with no fuzzy string comparisons.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 13:23:03 +01:00
Deeman
c3531bd75d feat(data): Phase 2b complete — EU NUTS-2 spatial join + US state income
- stg_regional_income: expanded NUTS-1+2 (LENGTH IN 3,4), nuts_code rename, nuts_level
- stg_nuts2_boundaries: new — ST_Read GISCO GeoJSON, bbox columns for spatial pre-filter
- stg_income_usa: new — Census ACS state-level income staging model
- dim_locations: spatial join replaces admin1_to_nuts1 VALUES CTE; us_income CTE with
  PPS normalisation (income/80610×30000); income cascade: NUTS-2→NUTS-1→US state→country
- init_landing_seeds: compress=False for ST_Read files; gisco GeoJSON + census income seeds
- CHANGELOG + PROJECT.md updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 11:03:16 +01:00
Deeman
5ade38eeaf feat(data): Phase 2a — NUTS-1 regional income for opportunity score
- eurostat.py: add nama_10r_2hhinc dataset config; append filter params to
  request URL so server pre-filters the large cube before download
- stg_regional_income.sql: new staging model — reads nama_10r_2hhinc.json.gz,
  filters to NUTS-1 codes (3-char), normalises EL→GR / UK→GB
- dim_locations.sql: add admin1_to_nuts1 VALUES CTE (16 German Bundesländer)
  + regional_income CTE; final SELECT uses COALESCE(regional, country) income
- init_landing_seeds.py: add empty seed for nama_10r_2hhinc.json.gz

Munich/Bayern now scores ~29K PPS vs Chemnitz/Sachsen ~19K PPS instead of
both inheriting the same national average (~25.5K PPS).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 10:26:15 +01:00
Deeman
3aa30ab419 feat(sql): dim_cities — GeoNames spatial population fallback
Adds a coordinate-based population lookup as a fallback when string name
matching fails (~29% of cities). Uses bbox pre-filter (0.14° ≈ 15 km) then
ST_Distance_Sphere to find the nearest GeoNames location in the same country.

Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich.

Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string >
GeoNames spatial > 0.

Coverage: 70.5% → 98.5% (5,401 / 5,481 cities with population > 0).
Key cities before/after:
  Wien:   0 → 1,691,468
  Milano: 0 → 1,371,498
  München: already matched by string; verified still correct at 1,488,719

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 08:47:26 +01:00
Deeman
7186d4582a feat(sql): thread opportunity_score from location_opportunity_profile into pSEO serving chain
- dim_cities: add geoname_id to geonames_pop CTE and final SELECT
  Creates FK between dim_cities (city-with-padel-venues) and dim_locations (all GeoNames),
  enabling joins to location_opportunity_profile for the first time.
- city_market_profile: pass geoname_id through base CTE and final SELECT
- pseo_city_costs_de: LEFT JOIN location_opportunity_profile on (country_code, geoname_id),
  add opportunity_score to output columns
- pseo_country_overview: add avg_opportunity_score, top_opportunity_score, top_opportunity_slugs,
  top_opportunity_names aggregates

Cities with no GeoNames name match get opportunity_score = NULL; templates guard with
{% if opportunity_score %}.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 20:29:57 +01:00
Deeman
55f179ba54 fix(transform): increase geonames object size limit and remove stale column ref
- stg_population_geonames: add maximum_object_size=40MB to read_json() call;
  geonames cities_global.json.gz is ~30MB, exceeding DuckDB's 16MB default
- dim_locations: remove stale 'population_year AS population_year' column ref;
  stg_population_geonames has ref_year, not population_year — caused BinderException

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:56:05 +01:00
Deeman
ebfdc84a94 feat(transform): add dim_locations + dual market scoring models
dim_locations (foundation):
- Seeded from stg_population_geonames (all locations, not venue-dependent)
- Grain: (country_code, geoname_id)
- Enriched with: padel venues within 5km, nearest court distance (ST_Distance_Sphere),
  tennis courts within 25km, country income
- Covers zero-court Gemeinden for opportunity scoring

location_opportunity_profile (serving) — Padelnomics Marktpotenzial-Score:
- Answers "Where should I build?" — no padel_venue_count filter
- Formula: population (25) + income (20) + supply gap inverted (30) +
           catchment gap (15) + tennis culture (10) = 100pts
- Sorted by opportunity_score DESC

city_market_profile (serving) — Padelnomics Marktreife-Score:
- Add saturation discount (×0.85 when venues_per_100k > 8)
- Update header comment to reference Marktreife-Score branding
- Kept WHERE padel_venue_count > 0 (established markets only)
- column name market_score unchanged (avoids downstream breakage)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 16:28:16 +01:00
Deeman
0960990373 feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors
Part A: Data Layer — Sprints 1-5

Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
  (city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
  replaces hardcoded 0::BIGINT population

Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
  30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
  is available to the scoring formula

Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)

Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)

Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
  with case/whitespace-insensitive city name matching

Registered all 4 new CLI entrypoints in pyproject.toml and all.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:07:08 +01:00
Deeman
ebba46f700 refactor: align transform layer with template methodology
Three deviations from the quart_saas_boilerplate methodology corrected:

1. Fix dim_cities LIKE join (data quality bug)
   - Old: FROM eurostat_cities LEFT JOIN venue_counts LIKE '%country_code%'
     → cartesian product (2.6M rows vs ~5500 expected)
   - New: FROM venue_cities (dim_venues) as primary table, Eurostat for
     enrichment only. grain (country_code, city_slug).
   - Also fixes REGEXP_REPLACE to LOWER() before regex so uppercase city
     names aren't stripped to '-'

2. Rename fct_venue_capacity → dim_venue_capacity
   - Static venue attributes with no time key are a dimension, not a fact
   - No SQL logic changes; update fct_daily_availability reference

3. Add fct_availability_slot at event grain
   - New: grain (snapshot_date, tenant_id, resource_id, slot_start_time)
   - Recheck dedup logic moves here from fct_daily_availability
   - fct_daily_availability now reads fct_availability_slot (cleaner DAG)

Downstream fixes:
- city_market_profile, planner_defaults grain → (country_code, city_slug)
- pseo_city_costs_de, pseo_city_pricing add city_key composite natural key
  (country_slug || '-' || city_slug) to avoid URL collisions across countries
- planner_defaults join in pseo_city_costs_de uses both country_code + city_slug
- Templates updated: natural_key city_slug → city_key

Added transform/sqlmesh_padelnomics/CLAUDE.md documenting data modeling rules,
conformed dimension map, and source integration architecture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 21:17:04 +01:00
Deeman
b517e3e58d feat(transform): add country_name_en + country_slug to dim_cities, pass through city_market_profile
Prerequisite for all pSEO serving models. Adds CASE-based country_name_en
and URL-safe country_slug to foundation.dim_cities, then selects them through
serving.city_market_profile so downstream models inherit them automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 18:37:43 +01:00
Deeman
a1faddbed6 feat: Python supervisor + feature flags
Supervisor (replaces supervisor.sh):
- supervisor.py — cron-based pipeline orchestration, reads workflows.toml
  on every tick, runs due extractors in topological waves with parallel
  execution, then SQLMesh transform + serving export
- workflows.toml — workflow registry: overpass (monthly), eurostat (monthly),
  playtomic_tenants (weekly), playtomic_availability (daily),
  playtomic_recheck (hourly 6–23)
- padelnomics-supervisor.service — updated ExecStart to Python supervisor

Extraction enhancements:
- proxy.py — optional round-robin/sticky proxy rotation via PROXY_URLS env
- playtomic_availability.py — parallel fetch (EXTRACT_WORKERS), recheck mode
  (main_recheck) re-queries imminent slots for accurate occupancy measurement
- _shared.py — realistic browser User-Agent on all extractor sessions
- stg_playtomic_availability.sql — reads morning + recheck snapshots, tags each
- fct_daily_availability.sql — prefers recheck over morning for same slot

Feature flags (replaces WAITLIST_MODE env var):
- migration 0019 — feature_flags table, 5 initial flags:
  markets (on), payments/planner_export/supplier_signup/lead_unlock (off)
- core.py — is_flag_enabled() + feature_gate() decorator
- routes — payments, markets, planner_export, supplier_signup, lead_unlock gated
- admin flags UI — /admin/flags toggle page + nav link
- app.py — flag() injected as Jinja2 global

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:53:45 +01:00
Deeman
13c86ebf84 Merge branch 'worktree-extraction-overhaul'
# Conflicts:
#	transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql
#	transform/sqlmesh_padelnomics/models/staging/stg_playtomic_venues.sql
2026-02-23 01:01:26 +01:00
Deeman
79f7fc6fad feat: Playtomic pricing/occupancy pipeline + email i18n + audience restructure
Three workstreams:

1. Playtomic full data extraction & transform pipeline:
   - Expand venue bounding boxes from 4 to 23 regions (global coverage)
   - New staging models for court resources, opening hours, and slot-level
     availability with real prices from the Playtomic API
   - Foundation fact tables for venue capacity and daily occupancy/revenue
   - City-level pricing benchmarks replacing hardcoded country estimates
   - Planner defaults now use 3-tier cascade: city data → country → fallback

2. Transactional email i18n:
   - _t() helper in worker.py with ~70 translation keys (EN + DE)
   - All 8 email handlers translated, lang passed in task payloads

3. Resend audiences restructured to 3 named audiences (free plan limit)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 00:54:53 +01:00
Deeman
5a1bb21624 fix: eurostat JSON-stat parsing + staging model corrections
Eurostat JSON-stat format (4-7 dimension sparse dict with 583K values)
causes DuckDB OOM — pre-process in extractor to flat records.
Also fix dim_cities unused CTE bug and playtomic venue lat/lon path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:52:25 +01:00
Deeman
2db66efe77 feat: migrate transform to 3-layer architecture with per-layer schemas
Remove raw/ layer — staging models now read landing JSON directly.
Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*.
Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH.
Supervisor gets daily sleep interval between pipeline runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:04:40 +01:00
Deeman
18ee24818b feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides
Sync template from 29ac25b → v0.9.0 (29 template commits). Due to
template's _subdirectory migration, new files were manually rendered
rather than auto-merged by copier.

New files:
- .claude/CLAUDE.md + coding_philosophy.md (agent instructions)
- extract utils.py: SQLite state tracking for extraction runs
- extract/transform READMEs: architecture & pattern documentation
- infra/supervisor: systemd service + orchestration script
- Per-layer model READMEs (raw, staging, foundation, serving)

Also fixes copier-answers.yml (adds 4 feature toggles, removes stale
payment_provider key) and scopes CLAUDE.md gitignore to root only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:44:48 +01:00
Deeman
4ae00b35d1 refactor: flatten padelnomics/padelnomics/ → repo root
git mv all tracked files from the nested padelnomics/ workspace
directory to the git repo root. Merged .gitignore files.
No code changes — pure path rename.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 00:44:40 +01:00