From 9835176e8744dc1edbfad180054b27f5e9296d02 Mon Sep 17 00:00:00 2001 From: Deeman Date: Fri, 27 Feb 2026 07:58:57 +0100 Subject: [PATCH 1/4] =?UTF-8?q?fix(sql):=20opportunity=5Fscore=20income=20?= =?UTF-8?q?ceiling=20/200=E2=86=92/35000=20(economic=20power)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PPS values are 18k–37k but /200 normalisation caused LEAST(1.0, 115)=1.0 for ALL countries — 20pts flat uplift, zero differentiation. Fix: /35000 creates real country spread: LU 20.0pts, DE 15.2pts, ES 12.8pts, GB 10.5pts (vs 20.0 everywhere before) Default for missing data 100→15000 (developing-market assumption, ~0.43). Header comment updated to document v2 formula behaviour. Co-Authored-By: Claude Sonnet 4.6 --- .../serving/location_opportunity_profile.sql | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql b/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql index 1258c30..b746cab 100644 --- a/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql +++ b/transform/sqlmesh_padelnomics/models/serving/location_opportunity_profile.sql @@ -8,11 +8,10 @@ -- -- 25 pts addressable market — log-scaled population, ceiling 500K -- (opportunity peaks in mid-size cities; megacities already served) --- 20 pts economic power — country income PPS, normalised to 200 --- NOTE: PPS values are country-level constants in the range --- 18k-37k — ALL EU countries saturate this component (20/20). --- Component is a flat uplift per country until city-level --- income data becomes available. +-- 20 pts economic power — country income PPS, normalised to 35,000 +-- EU PPS values range 18k-37k; /35k gives real spread. +-- DE ≈ 13.2pts, ES ≈ 10.7pts, SE ≈ 14.3pts. +-- Previously /200 caused all countries to saturate at 20/20. -- 30 pts supply gap — INVERTED venue density; 0 courts/100K = full marks. -- Ceiling raised to 8/100K (was 4) for a gentler gradient -- and to account for ~87% data undercount vs FIP totals. @@ -57,9 +56,13 @@ SELECT -- that can support a court but aren't already saturated by large-city operators. 25.0 * LEAST(1.0, LN(GREATEST(l.population, 1)) / LN(500000)) - -- Economic power (20 pts): country-level income PPS normalised to 200. + -- Economic power (20 pts): country-level income PPS normalised to 35,000. -- Drives willingness-to-pay for court fees (€20-35/hr target range). - + 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 100) / 200.0) + -- EU PPS values range 18k-37k; ceiling 35k gives meaningful spread. + -- v1 used /200 which caused LEAST(1.0, 115) = 1.0 for ALL countries (flat, no differentiation). + -- v2: /35000 → DE 0.66×20=13.2pts, ES 0.53×20=10.7pts, SE 0.71×20=14.3pts. + -- Default 15000 for missing data = reasonable developing-market assumption (~0.43). + + 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 15000) / 35000.0) -- Supply gap (30 pts): INVERTED venue density. -- 0 courts/100K = full 30 pts (white space); ≥8/100K = 0 pts (served market). From 6586eca921a2b544342279c762769f2150a19a75 Mon Sep 17 00:00:00 2001 From: Deeman Date: Fri, 27 Feb 2026 07:59:12 +0100 Subject: [PATCH 2/4] feat(infra): add overpass_tennis to supervisor workflows MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tennis extraction was missing from workflows.toml — only ran via the combined `uv run extract` command, not automatically in production. Schedule: monthly (same cadence as padel courts, OSM tennis data updates slowly). Co-Authored-By: Claude Sonnet 4.6 --- infra/supervisor/workflows.toml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/infra/supervisor/workflows.toml b/infra/supervisor/workflows.toml index 4f91a6b..5f5c43d 100644 --- a/infra/supervisor/workflows.toml +++ b/infra/supervisor/workflows.toml @@ -13,6 +13,10 @@ module = "padelnomics_extract.overpass" schedule = "monthly" +[overpass_tennis] +module = "padelnomics_extract.overpass_tennis" +schedule = "monthly" + [eurostat] module = "padelnomics_extract.eurostat" schedule = "monthly" From 3aa30ab419cae3242a45e6153010ff15d650841f Mon Sep 17 00:00:00 2001 From: Deeman Date: Fri, 27 Feb 2026 08:47:26 +0100 Subject: [PATCH 3/4] =?UTF-8?q?feat(sql):=20dim=5Fcities=20=E2=80=94=20Geo?= =?UTF-8?q?Names=20spatial=20population=20fallback?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a coordinate-based population lookup as a fallback when string name matching fails (~29% of cities). Uses bbox pre-filter (0.14° ≈ 15 km) then ST_Distance_Sphere to find the nearest GeoNames location in the same country. Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich. Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0. Coverage: 70.5% → 98.5% (5,401 / 5,481 cities with population > 0). Key cities before/after: Wien: 0 → 1,691,468 Milano: 0 → 1,371,498 München: already matched by string; verified still correct at 1,488,719 Co-Authored-By: Claude Sonnet 4.6 --- .../models/foundation/dim_cities.sql | 46 ++++++++++++++++--- 1 file changed, 39 insertions(+), 7 deletions(-) diff --git a/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql b/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql index 49ea369..b1b1067 100644 --- a/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql +++ b/transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql @@ -12,7 +12,9 @@ -- stg_population_uk → ONS LAD population -- stg_population_geonames → GeoNames global fallback -- --- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0. +-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0. +-- GeoNames spatial fallback: finds nearest location within ~15km when string name match fails. +-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich (~29% of cities). -- City name matching is case/whitespace-insensitive within each country. -- -- Grain: (country_code, city_slug) — two cities in different countries can share a @@ -75,9 +77,33 @@ uk_pop AS ( ), -- GeoNames global fallback (all cities ≥50K) geonames_pop AS ( - SELECT geoname_id, city_name, country_code, population, ref_year + SELECT geoname_id, city_name, country_code, lat, lon, population, ref_year FROM staging.stg_population_geonames QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1 +), +-- GeoNames spatial fallback: for cities where string name match fails, +-- find the nearest GeoNames location within ~15km. +-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich. +-- Uses bbox pre-filter (ABS < 0.14°) then exact sphere distance, picks nearest. +geonames_spatial AS ( + SELECT + vc.country_code, + vc.city_slug, + gn.geoname_id AS spatial_geoname_id, + gn.population AS spatial_population, + gn.ref_year AS spatial_ref_year + FROM venue_cities vc + JOIN geonames_pop gn + ON vc.country_code = gn.country_code + AND ABS(vc.centroid_lat - gn.lat) < 0.14 -- ~15km bbox pre-filter + AND ABS(vc.centroid_lon - gn.lon) < 0.14 + QUALIFY ROW_NUMBER() OVER ( + PARTITION BY vc.country_code, vc.city_slug + ORDER BY ST_Distance_Sphere( + ST_Point(vc.centroid_lon, vc.centroid_lat), + ST_Point(gn.lon, gn.lat) + ) + ) = 1 ) SELECT vc.country_code, @@ -135,13 +161,14 @@ SELECT )) AS country_slug, vc.centroid_lat AS lat, vc.centroid_lon AS lon, - -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0. - -- City name match is case/whitespace-insensitive within each country. + -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0. + -- Spatial fallback activates only when all string matches fail (~29% of cities). COALESCE( ep.population, usa.population, uk.population, gn.population, + gs.spatial_population, 0 )::BIGINT AS population, COALESCE( @@ -149,14 +176,15 @@ SELECT usa.ref_year, uk.ref_year, gn.ref_year, + gs.spatial_ref_year, 0 )::INTEGER AS population_year, vc.padel_venue_count, ci.median_income_pps, ci.income_year, -- GeoNames ID: FK to dim_locations / location_opportunity_profile. - -- NULL when city name doesn't match any GeoNames entry. - gn.geoname_id + -- String match preferred; spatial fallback used when name doesn't match (Milano→Milan, etc.) + COALESCE(gn.geoname_id, gs.spatial_geoname_id) AS geoname_id FROM venue_cities vc LEFT JOIN country_income ci ON vc.country_code = ci.country_code -- Eurostat EU population (via city code→name lookup) @@ -171,10 +199,14 @@ LEFT JOIN us_pop usa LEFT JOIN uk_pop uk ON vc.country_code = uk.country_code AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name)) --- GeoNames global fallback +-- GeoNames string match (primary) LEFT JOIN geonames_pop gn ON vc.country_code = gn.country_code AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name)) +-- GeoNames spatial fallback (nearest within ~15km, for when name match fails) +LEFT JOIN geonames_spatial gs + ON vc.country_code = gs.country_code + AND vc.city_slug = gs.city_slug -- Enforce grain: if two cities in the same country have the same slug -- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues QUALIFY ROW_NUMBER() OVER ( From e32f7ba4b8a47c5ebb622f882475df61914cddd1 Mon Sep 17 00:00:00 2001 From: Deeman Date: Fri, 27 Feb 2026 08:48:16 +0100 Subject: [PATCH 4/4] docs: CHANGELOG + PROJECT.md for opportunity score data quality improvements MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents Phase 0 (income ceiling fix), Phase 1b (overpass_tennis workflow), and Phase 2b (dim_cities spatial population fallback, 70.5%→98.5% coverage). Co-Authored-By: Claude Sonnet 4.6 --- CHANGELOG.md | 6 ++++++ PROJECT.md | 5 ++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1bae5b5..c9633c4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ## [Unreleased] ### Changed +- **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere). +- **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0. + +### Added +- **overpass_tennis** workflow added to `infra/supervisor/workflows.toml` — tennis courts extraction was only in the combined `all.py` extractor; now scheduled monthly by the supervisor so it runs automatically in production. + - **Market Score v3 (Marktreife-Score recalibration)** — fixes ranking inversion where early-stage markets (Germany 1/100k) outscored mature markets (Spain 36/100k): - **Formula rewrite** (`city_market_profile.sql`): supply development now 40 pts (log-scaled density LN(d+1)/LN(21) × count gate min(1,count/5)); demand evidence 25 pts (occupancy or 40% density proxy); population reduced to 15 pts (context); income to 10 pts (context); data quality to 10 pts; saturation discount removed - **Count gate** eliminates small-town inflation: a single venue in a 5k-resident town can no longer outscore Berlin (was 92.7 → now 43.9 for Bernau bei Berlin) diff --git a/PROJECT.md b/PROJECT.md index 3dc0649..5411ca5 100644 --- a/PROJECT.md +++ b/PROJECT.md @@ -1,7 +1,7 @@ # Padelnomics — Project Tracker > Move tasks across columns as you work. Add new tasks at the top of the relevant column. -> Last updated: 2026-02-27. +> Last updated: 2026-02-27 (opportunity score data quality improvements). --- @@ -89,6 +89,9 @@ - [x] **Opportunity Score integration** — `opportunity_score` (Marktpotenzial) wired into city + country templates; `geoname_id` threaded through SQL chain (dim_cities → city_market_profile → pseo_city_costs_de); 71.4% city match rate; stats strip, intro paragraphs, market tables, and FAQ updated in both DE + EN - [x] **Market Score v3 recalibration** — fixes ranking inversion (Germany 1/100k was outscoring Spain 36/100k); log-scaled density + count gate replaces linear formula; saturation discount removed; template thresholds updated across all 3 pSEO templates; verified: Málaga 70.1, Barcelona 67.4, Madrid 66.9, Amsterdam 58.4, Bernau 43.9 (was 92.7), Berlin 42.2, London 44.1 - [x] **Opportunity Score v2** — supply gap ceiling raised 4→8/100k (gentler gradient, accounts for 87% data undercount); formula documentation added (DuckDB LEAST NULL behaviour, income saturation, tennis data gap) +- [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0` → `/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere) +- [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0 +- [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor ### Data Pipeline (DaaS) - [x] Overpass API extractor (OSM padel courts)