merge: opportunity score data quality improvements

Phase 0 — income ceiling fix (opportunity_score): PPS normalisation /200→/35000; economic power now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere) Phase 1b — overpass_tennis in workflows.toml: Monthly schedule added; was only in combined extractor Phase 2b — dim_cities spatial population fallback: GeoNames spatial CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches: Wien→1.69M, Milano→1.37M, München→1.49M Coverage: 70.5% → 98.5% (5,401/5,481 cities with population)
2026-02-27 08:52:35 +01:00
parent eef3ad2954 e32f7ba4b8
commit 5fa8a98903
5 changed files with 63 additions and 15 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 ## [Unreleased]
 ### Changed
 - **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
 - **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
 ### Added
 - **overpass_tennis** workflow added to `infra/supervisor/workflows.toml` — tennis courts extraction was only in the combined `all.py` extractor; now scheduled monthly by the supervisor so it runs automatically in production.
 - **Market Score v3 (Marktreife-Score recalibration)** — fixes ranking inversion where early-stage markets (Germany 1/100k) outscored mature markets (Spain 36/100k):
  - **Formula rewrite** (`city_market_profile.sql`): supply development now 40 pts (log-scaled density LN(d+1)/LN(21) × count gate min(1,count/5)); demand evidence 25 pts (occupancy or 40% density proxy); population reduced to 15 pts (context); income to 10 pts (context); data quality to 10 pts; saturation discount removed
  - **Count gate** eliminates small-town inflation: a single venue in a 5k-resident town can no longer outscore Berlin (was 92.7 → now 43.9 for Bernau bei Berlin)
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -1,7 +1,7 @@
 # Padelnomics — Project Tracker
 > Move tasks across columns as you work. Add new tasks at the top of the relevant column.
-> Last updated: 2026-02-27.
+> Last updated: 2026-02-27 (opportunity score data quality improvements).
 ---
@@ -89,6 +89,9 @@
 - [x] **Opportunity Score integration** — `opportunity_score` (Marktpotenzial) wired into city + country templates; `geoname_id` threaded through SQL chain (dim_cities → city_market_profile → pseo_city_costs_de); 71.4% city match rate; stats strip, intro paragraphs, market tables, and FAQ updated in both DE + EN
 - [x] **Market Score v3 recalibration** — fixes ranking inversion (Germany 1/100k was outscoring Spain 36/100k); log-scaled density + count gate replaces linear formula; saturation discount removed; template thresholds updated across all 3 pSEO templates; verified: Málaga 70.1, Barcelona 67.4, Madrid 66.9, Amsterdam 58.4, Bernau 43.9 (was 92.7), Berlin 42.2, London 44.1
 - [x] **Opportunity Score v2** — supply gap ceiling raised 4→8/100k (gentler gradient, accounts for 87% data undercount); formula documentation added (DuckDB LEAST NULL behaviour, income saturation, tennis data gap)
 - [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0` → `/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)
 - [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0
 - [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor
 ### Data Pipeline (DaaS)
 - [x] Overpass API extractor (OSM padel courts)
--- a/infra/supervisor/workflows.toml
+++ b/infra/supervisor/workflows.toml
@@ -13,6 +13,10 @@
 module = "padelnomics_extract.overpass"
 schedule = "monthly"
 [overpass_tennis]
 module = "padelnomics_extract.overpass_tennis"
 schedule = "monthly"
 [eurostat]
 module = "padelnomics_extract.eurostat"
 schedule = "monthly"
--- a/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql
+++ b/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql
@@ -12,7 +12,9 @@
 --   stg_population_uk    → ONS LAD population
 --   stg_population_geonames → GeoNames global fallback
 --
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
+-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
 -- GeoNames spatial fallback: finds nearest location within ~15km when string name match fails.
 -- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich (~29% of cities).
 -- City name matching is case/whitespace-insensitive within each country.
 --
 -- Grain: (country_code, city_slug) — two cities in different countries can share a
@@ -75,9 +77,33 @@ uk_pop AS (
 ),
 -- GeoNames global fallback (all cities ≥50K)
 geonames_pop AS (
-  SELECT geoname_id, city_name, country_code, population, ref_year
+  SELECT geoname_id, city_name, country_code, lat, lon, population, ref_year
  FROM staging.stg_population_geonames
  QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1
 ),
 -- GeoNames spatial fallback: for cities where string name match fails,
 -- find the nearest GeoNames location within ~15km.
 -- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich.
 -- Uses bbox pre-filter (ABS < 0.14°) then exact sphere distance, picks nearest.
 geonames_spatial AS (
  SELECT
    vc.country_code,
    vc.city_slug,
    gn.geoname_id                                            AS spatial_geoname_id,
    gn.population                                            AS spatial_population,
    gn.ref_year                                              AS spatial_ref_year
  FROM venue_cities vc
  JOIN geonames_pop gn
    ON vc.country_code = gn.country_code
    AND ABS(vc.centroid_lat - gn.lat) < 0.14   -- ~15km bbox pre-filter
    AND ABS(vc.centroid_lon - gn.lon) < 0.14
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY vc.country_code, vc.city_slug
    ORDER BY ST_Distance_Sphere(
      ST_Point(vc.centroid_lon, vc.centroid_lat),
      ST_Point(gn.lon, gn.lat)
    )
  ) = 1
 )
 SELECT
  vc.country_code,
@@ -135,13 +161,14 @@ SELECT
  ))                                                         AS country_slug,
  vc.centroid_lat                                            AS lat,
  vc.centroid_lon                                            AS lon,
-  -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
+  -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
-  -- City name match is case/whitespace-insensitive within each country.
+  -- Spatial fallback activates only when all string matches fail (~29% of cities).
  COALESCE(
    ep.population,
    usa.population,
    uk.population,
    gn.population,
    gs.spatial_population,
    0
  )::BIGINT                                                  AS population,
  COALESCE(
@@ -149,14 +176,15 @@ SELECT
    usa.ref_year,
    uk.ref_year,
    gn.ref_year,
    gs.spatial_ref_year,
    0
  )::INTEGER                                                 AS population_year,
  vc.padel_venue_count,
  ci.median_income_pps,
  ci.income_year,
  -- GeoNames ID: FK to dim_locations / location_opportunity_profile.
-  -- NULL when city name doesn't match any GeoNames entry.
+  -- String match preferred; spatial fallback used when name doesn't match (Milano→Milan, etc.)
-  gn.geoname_id
+  COALESCE(gn.geoname_id, gs.spatial_geoname_id)            AS geoname_id
 FROM venue_cities vc
 LEFT JOIN country_income ci ON vc.country_code = ci.country_code
 -- Eurostat EU population (via city code→name lookup)
@@ -171,10 +199,14 @@ LEFT JOIN us_pop usa
 LEFT JOIN uk_pop uk
  ON vc.country_code = uk.country_code
  AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name))
-- GeoNames global fallback
+-- GeoNames string match (primary)
 LEFT JOIN geonames_pop gn
  ON vc.country_code = gn.country_code
  AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name))
 -- GeoNames spatial fallback (nearest within ~15km, for when name match fails)
 LEFT JOIN geonames_spatial gs
  ON vc.country_code = gs.country_code
  AND vc.city_slug = gs.city_slug
 -- Enforce grain: if two cities in the same country have the same slug
 -- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues
 QUALIFY ROW_NUMBER() OVER (
--- a/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql
+++ b/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql
@@ -8,11 +8,10 @@
 --
 --   25 pts  addressable market — log-scaled population, ceiling 500K
 --           (opportunity peaks in mid-size cities; megacities already served)
--   20 pts  economic power     — country income PPS, normalised to 200
+--   20 pts  economic power     — country income PPS, normalised to 35,000
--                               NOTE: PPS values are country-level constants in the range
+--                               EU PPS values range 18k-37k; /35k gives real spread.
--                               18k-37k — ALL EU countries saturate this component (20/20).
+--                               DE ≈ 13.2pts, ES ≈ 10.7pts, SE ≈ 14.3pts.
--                               Component is a flat uplift per country until city-level
+--                               Previously /200 caused all countries to saturate at 20/20.
 --                               income data becomes available.
 --   30 pts  supply gap         — INVERTED venue density; 0 courts/100K = full marks.
 --                               Ceiling raised to 8/100K (was 4) for a gentler gradient
 --                               and to account for ~87% data undercount vs FIP totals.
@@ -57,9 +56,13 @@ SELECT
    -- that can support a court but aren't already saturated by large-city operators.
    25.0 * LEAST(1.0, LN(GREATEST(l.population, 1)) / LN(500000))
-    -- Economic power (20 pts): country-level income PPS normalised to 200.
+    -- Economic power (20 pts): country-level income PPS normalised to 35,000.
    -- Drives willingness-to-pay for court fees (€20-35/hr target range).
-    + 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 100) / 200.0)
+    -- EU PPS values range 18k-37k; ceiling 35k gives meaningful spread.
    -- v1 used /200 which caused LEAST(1.0, 115) = 1.0 for ALL countries (flat, no differentiation).
    -- v2: /35000 → DE 0.66×20=13.2pts, ES 0.53×20=10.7pts, SE 0.71×20=14.3pts.
    -- Default 15000 for missing data = reasonable developing-market assumption (~0.43).
    + 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 15000) / 35000.0)
    -- Supply gap (30 pts): INVERTED venue density.
    -- 0 courts/100K = full 30 pts (white space); ≥8/100K = 0 pts (served market).