feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors

Part A: Data Layer — Sprints 1-5 Sprint 1 — Eurostat SDMX city labels (unblocks EU population): - New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist (city_code → city_name mapping) with ETag dedup - New staging model: stg_city_labels.sql — grain city_code - Updated dim_cities.sql — joins Eurostat population via city code lookup; replaces hardcoded 0::BIGINT population Sprint 2 — Market score formula v2: - city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200), 30pt demand (occupancy or density), 15pt data confidence - Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate is available to the scoring formula Sprint 3 — US Census ACS extractor: - New extractor: census_usa.py — ACS 5-year place population (vintage 2023) - New staging model: stg_population_usa.sql — grain (place_fips, ref_year) Sprint 4 — ONS UK extractor: - New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API - New staging model: stg_population_uk.sql — grain (lad_code, ref_year) Sprint 5 — GeoNames global extractor: - New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop - New staging model: stg_population_geonames.sql — grain geoname_id - dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0) with case/whitespace-insensitive city name matching Registered all 4 new CLI entrypoints in pyproject.toml and all.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:07:08 +01:00
parent e76b6b4715
commit 0960990373
12 changed files with 860 additions and 32 deletions
--- a/transform/sqlmesh_padelnomics/models/serving/city_market_profile.sql
+++ b/transform/sqlmesh_padelnomics/models/serving/city_market_profile.sql
@@ -1,10 +1,11 @@
 -- One Big Table: per-city padel market intelligence.
 -- Consumed by: SEO article generation, planner city-select pre-fill, API endpoints.
 --
-- Market score (0–100) is a simple composite:
--   40% population (log-scaled, city > 500K = max)
--   40% venue density (courts per 100K residents)
--   20% data confidence (completeness of both population + venue data)
+-- Market score v2 (0–100):
+--   30 pts  population  — log-scaled to 1M+ city ceiling (was 40pts/500K)
+--   25 pts  income PPS  — normalised to 200 ceiling (covers CH/NO/LU outliers)
+--   30 pts  demand      — observed occupancy if available, else venue density
+--   15 pts  data quality — completeness discount, not a market signal

 MODEL (
  name serving.city_market_profile,
@@ -37,19 +38,41 @@ WITH base AS (
      WHEN c.population > 0 AND c.padel_venue_count > 0 THEN 1.0
      WHEN c.population > 0 OR  c.padel_venue_count > 0 THEN 0.5
      ELSE 0.0
-    END                          AS data_confidence
+    END                          AS data_confidence,
+    -- Pricing / occupancy from Playtomic (NULL when no availability data)
+    vpb.median_hourly_rate,
+    vpb.median_peak_rate,
+    vpb.median_offpeak_rate,
+    vpb.median_occupancy_rate,
+    vpb.median_daily_revenue_per_venue,
+    vpb.price_currency
  FROM foundation.dim_cities c
+  LEFT JOIN serving.venue_pricing_benchmarks vpb
+    ON c.country_code = vpb.country_code
+    AND LOWER(TRIM(c.city_name)) = LOWER(TRIM(vpb.city))
  WHERE c.padel_venue_count > 0
 ),
 scored AS (
  SELECT *,
    ROUND(
-      -- Population component (log scale, 500K+ city → 40 pts)
-      40.0 * LEAST(1.0, LN(GREATEST(population, 1)) / LN(500000))
-      -- Density component (5 courts/100K → 40 pts)
-    + 40.0 * LEAST(1.0, COALESCE(venues_per_100k, 0) / 5.0)
-      -- Confidence component
-    + 20.0 * data_confidence
+      -- Population (30 pts): log-scale, 1M+ city = full marks.
+      -- LN(1) = 0 so unpopulated cities score 0 here — they still score on demand.
+      30.0 * LEAST(1.0, LN(GREATEST(population, 1)) / LN(1000000))
+      -- Economic power (25 pts): income PPS normalised to 200 ceiling.
+      -- 200 covers high-income outliers (CH ~190, NO ~180, LU ~200+).
+      -- Drives pricing power and willingness-to-pay directly.
+      + 25.0 * LEAST(1.0, COALESCE(median_income_pps, 100) / 200.0)
+      -- Demand evidence (30 pts): observed occupancy is the best signal
+      -- (proves real demand). If unavailable, venue density is the proxy
+      -- (proves market exists; caps at 4/100K to avoid penalising dense cities).
+      + 30.0 * CASE
+          WHEN median_occupancy_rate IS NOT NULL
+            THEN LEAST(1.0, median_occupancy_rate / 0.65)
+          ELSE LEAST(1.0, COALESCE(venues_per_100k, 0) / 4.0)
+        END
+      -- Data quality (15 pts): measures completeness, not market quality.
+      -- Reduced from 20pts — kept as confidence discount, not market signal.
+      + 15.0 * data_confidence
    , 1)                         AS market_score
  FROM base
 )
@@ -69,16 +92,12 @@ SELECT
  s.market_score,
  s.median_income_pps,
  s.income_year,
-  -- Playtomic pricing/occupancy (NULL when no availability data)
-  vpb.median_hourly_rate,
-  vpb.median_peak_rate,
-  vpb.median_offpeak_rate,
-  vpb.median_occupancy_rate,
-  vpb.median_daily_revenue_per_venue,
-  vpb.price_currency,
+  s.median_hourly_rate,
+  s.median_peak_rate,
+  s.median_offpeak_rate,
+  s.median_occupancy_rate,
+  s.median_daily_revenue_per_venue,
+  s.price_currency,
  CURRENT_DATE                   AS refreshed_date
 FROM scored s
-LEFT JOIN serving.venue_pricing_benchmarks vpb
-  ON s.country_code = vpb.country_code
-  AND LOWER(TRIM(s.city_name)) = LOWER(TRIM(vpb.city))
 ORDER BY s.market_score DESC