merge: opportunity score data quality improvements

Phase 0 — income ceiling fix (opportunity_score):
  PPS normalisation /200→/35000; economic power now differentiates
  countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)

Phase 1b — overpass_tennis in workflows.toml:
  Monthly schedule added; was only in combined extractor

Phase 2b — dim_cities spatial population fallback:
  GeoNames spatial CTE (ST_Distance_Sphere, 0.14° bbox) resolves
  localization mismatches: Wien→1.69M, Milano→1.37M, München→1.49M
  Coverage: 70.5% → 98.5% (5,401/5,481 cities with population)
This commit is contained in:
Deeman
2026-02-27 08:52:35 +01:00
5 changed files with 63 additions and 15 deletions

View File

@@ -7,6 +7,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
## [Unreleased] ## [Unreleased]
### Changed ### Changed
- **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
- **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
### Added
- **overpass_tennis** workflow added to `infra/supervisor/workflows.toml` — tennis courts extraction was only in the combined `all.py` extractor; now scheduled monthly by the supervisor so it runs automatically in production.
- **Market Score v3 (Marktreife-Score recalibration)** — fixes ranking inversion where early-stage markets (Germany 1/100k) outscored mature markets (Spain 36/100k): - **Market Score v3 (Marktreife-Score recalibration)** — fixes ranking inversion where early-stage markets (Germany 1/100k) outscored mature markets (Spain 36/100k):
- **Formula rewrite** (`city_market_profile.sql`): supply development now 40 pts (log-scaled density LN(d+1)/LN(21) × count gate min(1,count/5)); demand evidence 25 pts (occupancy or 40% density proxy); population reduced to 15 pts (context); income to 10 pts (context); data quality to 10 pts; saturation discount removed - **Formula rewrite** (`city_market_profile.sql`): supply development now 40 pts (log-scaled density LN(d+1)/LN(21) × count gate min(1,count/5)); demand evidence 25 pts (occupancy or 40% density proxy); population reduced to 15 pts (context); income to 10 pts (context); data quality to 10 pts; saturation discount removed
- **Count gate** eliminates small-town inflation: a single venue in a 5k-resident town can no longer outscore Berlin (was 92.7 → now 43.9 for Bernau bei Berlin) - **Count gate** eliminates small-town inflation: a single venue in a 5k-resident town can no longer outscore Berlin (was 92.7 → now 43.9 for Bernau bei Berlin)

View File

@@ -1,7 +1,7 @@
# Padelnomics — Project Tracker # Padelnomics — Project Tracker
> Move tasks across columns as you work. Add new tasks at the top of the relevant column. > Move tasks across columns as you work. Add new tasks at the top of the relevant column.
> Last updated: 2026-02-27. > Last updated: 2026-02-27 (opportunity score data quality improvements).
--- ---
@@ -89,6 +89,9 @@
- [x] **Opportunity Score integration**`opportunity_score` (Marktpotenzial) wired into city + country templates; `geoname_id` threaded through SQL chain (dim_cities → city_market_profile → pseo_city_costs_de); 71.4% city match rate; stats strip, intro paragraphs, market tables, and FAQ updated in both DE + EN - [x] **Opportunity Score integration**`opportunity_score` (Marktpotenzial) wired into city + country templates; `geoname_id` threaded through SQL chain (dim_cities → city_market_profile → pseo_city_costs_de); 71.4% city match rate; stats strip, intro paragraphs, market tables, and FAQ updated in both DE + EN
- [x] **Market Score v3 recalibration** — fixes ranking inversion (Germany 1/100k was outscoring Spain 36/100k); log-scaled density + count gate replaces linear formula; saturation discount removed; template thresholds updated across all 3 pSEO templates; verified: Málaga 70.1, Barcelona 67.4, Madrid 66.9, Amsterdam 58.4, Bernau 43.9 (was 92.7), Berlin 42.2, London 44.1 - [x] **Market Score v3 recalibration** — fixes ranking inversion (Germany 1/100k was outscoring Spain 36/100k); log-scaled density + count gate replaces linear formula; saturation discount removed; template thresholds updated across all 3 pSEO templates; verified: Málaga 70.1, Barcelona 67.4, Madrid 66.9, Amsterdam 58.4, Bernau 43.9 (was 92.7), Berlin 42.2, London 44.1
- [x] **Opportunity Score v2** — supply gap ceiling raised 4→8/100k (gentler gradient, accounts for 87% data undercount); formula documentation added (DuckDB LEAST NULL behaviour, income saturation, tennis data gap) - [x] **Opportunity Score v2** — supply gap ceiling raised 4→8/100k (gentler gradient, accounts for 87% data undercount); formula documentation added (DuckDB LEAST NULL behaviour, income saturation, tennis data gap)
- [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0``/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)
- [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0
- [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor
### Data Pipeline (DaaS) ### Data Pipeline (DaaS)
- [x] Overpass API extractor (OSM padel courts) - [x] Overpass API extractor (OSM padel courts)

View File

@@ -13,6 +13,10 @@
module = "padelnomics_extract.overpass" module = "padelnomics_extract.overpass"
schedule = "monthly" schedule = "monthly"
[overpass_tennis]
module = "padelnomics_extract.overpass_tennis"
schedule = "monthly"
[eurostat] [eurostat]
module = "padelnomics_extract.eurostat" module = "padelnomics_extract.eurostat"
schedule = "monthly" schedule = "monthly"

View File

@@ -12,7 +12,9 @@
-- stg_population_uk → ONS LAD population -- stg_population_uk → ONS LAD population
-- stg_population_geonames → GeoNames global fallback -- stg_population_geonames → GeoNames global fallback
-- --
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0. -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
-- GeoNames spatial fallback: finds nearest location within ~15km when string name match fails.
-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich (~29% of cities).
-- City name matching is case/whitespace-insensitive within each country. -- City name matching is case/whitespace-insensitive within each country.
-- --
-- Grain: (country_code, city_slug) — two cities in different countries can share a -- Grain: (country_code, city_slug) — two cities in different countries can share a
@@ -75,9 +77,33 @@ uk_pop AS (
), ),
-- GeoNames global fallback (all cities ≥50K) -- GeoNames global fallback (all cities ≥50K)
geonames_pop AS ( geonames_pop AS (
SELECT geoname_id, city_name, country_code, population, ref_year SELECT geoname_id, city_name, country_code, lat, lon, population, ref_year
FROM staging.stg_population_geonames FROM staging.stg_population_geonames
QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1 QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1
),
-- GeoNames spatial fallback: for cities where string name match fails,
-- find the nearest GeoNames location within ~15km.
-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich.
-- Uses bbox pre-filter (ABS < 0.14°) then exact sphere distance, picks nearest.
geonames_spatial AS (
SELECT
vc.country_code,
vc.city_slug,
gn.geoname_id AS spatial_geoname_id,
gn.population AS spatial_population,
gn.ref_year AS spatial_ref_year
FROM venue_cities vc
JOIN geonames_pop gn
ON vc.country_code = gn.country_code
AND ABS(vc.centroid_lat - gn.lat) < 0.14 -- ~15km bbox pre-filter
AND ABS(vc.centroid_lon - gn.lon) < 0.14
QUALIFY ROW_NUMBER() OVER (
PARTITION BY vc.country_code, vc.city_slug
ORDER BY ST_Distance_Sphere(
ST_Point(vc.centroid_lon, vc.centroid_lat),
ST_Point(gn.lon, gn.lat)
)
) = 1
) )
SELECT SELECT
vc.country_code, vc.country_code,
@@ -135,13 +161,14 @@ SELECT
)) AS country_slug, )) AS country_slug,
vc.centroid_lat AS lat, vc.centroid_lat AS lat,
vc.centroid_lon AS lon, vc.centroid_lon AS lon,
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0. -- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
-- City name match is case/whitespace-insensitive within each country. -- Spatial fallback activates only when all string matches fail (~29% of cities).
COALESCE( COALESCE(
ep.population, ep.population,
usa.population, usa.population,
uk.population, uk.population,
gn.population, gn.population,
gs.spatial_population,
0 0
)::BIGINT AS population, )::BIGINT AS population,
COALESCE( COALESCE(
@@ -149,14 +176,15 @@ SELECT
usa.ref_year, usa.ref_year,
uk.ref_year, uk.ref_year,
gn.ref_year, gn.ref_year,
gs.spatial_ref_year,
0 0
)::INTEGER AS population_year, )::INTEGER AS population_year,
vc.padel_venue_count, vc.padel_venue_count,
ci.median_income_pps, ci.median_income_pps,
ci.income_year, ci.income_year,
-- GeoNames ID: FK to dim_locations / location_opportunity_profile. -- GeoNames ID: FK to dim_locations / location_opportunity_profile.
-- NULL when city name doesn't match any GeoNames entry. -- String match preferred; spatial fallback used when name doesn't match (Milano→Milan, etc.)
gn.geoname_id COALESCE(gn.geoname_id, gs.spatial_geoname_id) AS geoname_id
FROM venue_cities vc FROM venue_cities vc
LEFT JOIN country_income ci ON vc.country_code = ci.country_code LEFT JOIN country_income ci ON vc.country_code = ci.country_code
-- Eurostat EU population (via city code→name lookup) -- Eurostat EU population (via city code→name lookup)
@@ -171,10 +199,14 @@ LEFT JOIN us_pop usa
LEFT JOIN uk_pop uk LEFT JOIN uk_pop uk
ON vc.country_code = uk.country_code ON vc.country_code = uk.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name)) AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name))
-- GeoNames global fallback -- GeoNames string match (primary)
LEFT JOIN geonames_pop gn LEFT JOIN geonames_pop gn
ON vc.country_code = gn.country_code ON vc.country_code = gn.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name)) AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name))
-- GeoNames spatial fallback (nearest within ~15km, for when name match fails)
LEFT JOIN geonames_spatial gs
ON vc.country_code = gs.country_code
AND vc.city_slug = gs.city_slug
-- Enforce grain: if two cities in the same country have the same slug -- Enforce grain: if two cities in the same country have the same slug
-- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues -- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues
QUALIFY ROW_NUMBER() OVER ( QUALIFY ROW_NUMBER() OVER (

View File

@@ -8,11 +8,10 @@
-- --
-- 25 pts addressable market — log-scaled population, ceiling 500K -- 25 pts addressable market — log-scaled population, ceiling 500K
-- (opportunity peaks in mid-size cities; megacities already served) -- (opportunity peaks in mid-size cities; megacities already served)
-- 20 pts economic power — country income PPS, normalised to 200 -- 20 pts economic power — country income PPS, normalised to 35,000
-- NOTE: PPS values are country-level constants in the range -- EU PPS values range 18k-37k; /35k gives real spread.
-- 18k-37k — ALL EU countries saturate this component (20/20). -- DE ≈ 13.2pts, ES ≈ 10.7pts, SE ≈ 14.3pts.
-- Component is a flat uplift per country until city-level -- Previously /200 caused all countries to saturate at 20/20.
-- income data becomes available.
-- 30 pts supply gap — INVERTED venue density; 0 courts/100K = full marks. -- 30 pts supply gap — INVERTED venue density; 0 courts/100K = full marks.
-- Ceiling raised to 8/100K (was 4) for a gentler gradient -- Ceiling raised to 8/100K (was 4) for a gentler gradient
-- and to account for ~87% data undercount vs FIP totals. -- and to account for ~87% data undercount vs FIP totals.
@@ -57,9 +56,13 @@ SELECT
-- that can support a court but aren't already saturated by large-city operators. -- that can support a court but aren't already saturated by large-city operators.
25.0 * LEAST(1.0, LN(GREATEST(l.population, 1)) / LN(500000)) 25.0 * LEAST(1.0, LN(GREATEST(l.population, 1)) / LN(500000))
-- Economic power (20 pts): country-level income PPS normalised to 200. -- Economic power (20 pts): country-level income PPS normalised to 35,000.
-- Drives willingness-to-pay for court fees (€20-35/hr target range). -- Drives willingness-to-pay for court fees (€20-35/hr target range).
+ 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 100) / 200.0) -- EU PPS values range 18k-37k; ceiling 35k gives meaningful spread.
-- v1 used /200 which caused LEAST(1.0, 115) = 1.0 for ALL countries (flat, no differentiation).
-- v2: /35000 → DE 0.66×20=13.2pts, ES 0.53×20=10.7pts, SE 0.71×20=14.3pts.
-- Default 15000 for missing data = reasonable developing-market assumption (~0.43).
+ 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 15000) / 35000.0)
-- Supply gap (30 pts): INVERTED venue density. -- Supply gap (30 pts): INVERTED venue density.
-- 0 courts/100K = full 30 pts (white space); ≥8/100K = 0 pts (served market). -- 0 courts/100K = full 30 pts (white space); ≥8/100K = 0 pts (served market).