merge: opportunity score data quality improvements
Phase 0 — income ceiling fix (opportunity_score): PPS normalisation /200→/35000; economic power now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere) Phase 1b — overpass_tennis in workflows.toml: Monthly schedule added; was only in combined extractor Phase 2b — dim_cities spatial population fallback: GeoNames spatial CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches: Wien→1.69M, Milano→1.37M, München→1.49M Coverage: 70.5% → 98.5% (5,401/5,481 cities with population)
This commit is contained in:
@@ -7,6 +7,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
||||
## [Unreleased]
|
||||
|
||||
### Changed
|
||||
- **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
|
||||
- **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
|
||||
|
||||
### Added
|
||||
- **overpass_tennis** workflow added to `infra/supervisor/workflows.toml` — tennis courts extraction was only in the combined `all.py` extractor; now scheduled monthly by the supervisor so it runs automatically in production.
|
||||
|
||||
- **Market Score v3 (Marktreife-Score recalibration)** — fixes ranking inversion where early-stage markets (Germany 1/100k) outscored mature markets (Spain 36/100k):
|
||||
- **Formula rewrite** (`city_market_profile.sql`): supply development now 40 pts (log-scaled density LN(d+1)/LN(21) × count gate min(1,count/5)); demand evidence 25 pts (occupancy or 40% density proxy); population reduced to 15 pts (context); income to 10 pts (context); data quality to 10 pts; saturation discount removed
|
||||
- **Count gate** eliminates small-town inflation: a single venue in a 5k-resident town can no longer outscore Berlin (was 92.7 → now 43.9 for Bernau bei Berlin)
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Padelnomics — Project Tracker
|
||||
|
||||
> Move tasks across columns as you work. Add new tasks at the top of the relevant column.
|
||||
> Last updated: 2026-02-27.
|
||||
> Last updated: 2026-02-27 (opportunity score data quality improvements).
|
||||
|
||||
---
|
||||
|
||||
@@ -89,6 +89,9 @@
|
||||
- [x] **Opportunity Score integration** — `opportunity_score` (Marktpotenzial) wired into city + country templates; `geoname_id` threaded through SQL chain (dim_cities → city_market_profile → pseo_city_costs_de); 71.4% city match rate; stats strip, intro paragraphs, market tables, and FAQ updated in both DE + EN
|
||||
- [x] **Market Score v3 recalibration** — fixes ranking inversion (Germany 1/100k was outscoring Spain 36/100k); log-scaled density + count gate replaces linear formula; saturation discount removed; template thresholds updated across all 3 pSEO templates; verified: Málaga 70.1, Barcelona 67.4, Madrid 66.9, Amsterdam 58.4, Bernau 43.9 (was 92.7), Berlin 42.2, London 44.1
|
||||
- [x] **Opportunity Score v2** — supply gap ceiling raised 4→8/100k (gentler gradient, accounts for 87% data undercount); formula documentation added (DuckDB LEAST NULL behaviour, income saturation, tennis data gap)
|
||||
- [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0` → `/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)
|
||||
- [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0
|
||||
- [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor
|
||||
|
||||
### Data Pipeline (DaaS)
|
||||
- [x] Overpass API extractor (OSM padel courts)
|
||||
|
||||
@@ -13,6 +13,10 @@
|
||||
module = "padelnomics_extract.overpass"
|
||||
schedule = "monthly"
|
||||
|
||||
[overpass_tennis]
|
||||
module = "padelnomics_extract.overpass_tennis"
|
||||
schedule = "monthly"
|
||||
|
||||
[eurostat]
|
||||
module = "padelnomics_extract.eurostat"
|
||||
schedule = "monthly"
|
||||
|
||||
@@ -12,7 +12,9 @@
|
||||
-- stg_population_uk → ONS LAD population
|
||||
-- stg_population_geonames → GeoNames global fallback
|
||||
--
|
||||
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
|
||||
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
|
||||
-- GeoNames spatial fallback: finds nearest location within ~15km when string name match fails.
|
||||
-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich (~29% of cities).
|
||||
-- City name matching is case/whitespace-insensitive within each country.
|
||||
--
|
||||
-- Grain: (country_code, city_slug) — two cities in different countries can share a
|
||||
@@ -75,9 +77,33 @@ uk_pop AS (
|
||||
),
|
||||
-- GeoNames global fallback (all cities ≥50K)
|
||||
geonames_pop AS (
|
||||
SELECT geoname_id, city_name, country_code, population, ref_year
|
||||
SELECT geoname_id, city_name, country_code, lat, lon, population, ref_year
|
||||
FROM staging.stg_population_geonames
|
||||
QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1
|
||||
),
|
||||
-- GeoNames spatial fallback: for cities where string name match fails,
|
||||
-- find the nearest GeoNames location within ~15km.
|
||||
-- Fixes localization mismatches: Milano≠Milan, Wien≠Vienna, München≠Munich.
|
||||
-- Uses bbox pre-filter (ABS < 0.14°) then exact sphere distance, picks nearest.
|
||||
geonames_spatial AS (
|
||||
SELECT
|
||||
vc.country_code,
|
||||
vc.city_slug,
|
||||
gn.geoname_id AS spatial_geoname_id,
|
||||
gn.population AS spatial_population,
|
||||
gn.ref_year AS spatial_ref_year
|
||||
FROM venue_cities vc
|
||||
JOIN geonames_pop gn
|
||||
ON vc.country_code = gn.country_code
|
||||
AND ABS(vc.centroid_lat - gn.lat) < 0.14 -- ~15km bbox pre-filter
|
||||
AND ABS(vc.centroid_lon - gn.lon) < 0.14
|
||||
QUALIFY ROW_NUMBER() OVER (
|
||||
PARTITION BY vc.country_code, vc.city_slug
|
||||
ORDER BY ST_Distance_Sphere(
|
||||
ST_Point(vc.centroid_lon, vc.centroid_lat),
|
||||
ST_Point(gn.lon, gn.lat)
|
||||
)
|
||||
) = 1
|
||||
)
|
||||
SELECT
|
||||
vc.country_code,
|
||||
@@ -135,13 +161,14 @@ SELECT
|
||||
)) AS country_slug,
|
||||
vc.centroid_lat AS lat,
|
||||
vc.centroid_lon AS lon,
|
||||
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
|
||||
-- City name match is case/whitespace-insensitive within each country.
|
||||
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
|
||||
-- Spatial fallback activates only when all string matches fail (~29% of cities).
|
||||
COALESCE(
|
||||
ep.population,
|
||||
usa.population,
|
||||
uk.population,
|
||||
gn.population,
|
||||
gs.spatial_population,
|
||||
0
|
||||
)::BIGINT AS population,
|
||||
COALESCE(
|
||||
@@ -149,14 +176,15 @@ SELECT
|
||||
usa.ref_year,
|
||||
uk.ref_year,
|
||||
gn.ref_year,
|
||||
gs.spatial_ref_year,
|
||||
0
|
||||
)::INTEGER AS population_year,
|
||||
vc.padel_venue_count,
|
||||
ci.median_income_pps,
|
||||
ci.income_year,
|
||||
-- GeoNames ID: FK to dim_locations / location_opportunity_profile.
|
||||
-- NULL when city name doesn't match any GeoNames entry.
|
||||
gn.geoname_id
|
||||
-- String match preferred; spatial fallback used when name doesn't match (Milano→Milan, etc.)
|
||||
COALESCE(gn.geoname_id, gs.spatial_geoname_id) AS geoname_id
|
||||
FROM venue_cities vc
|
||||
LEFT JOIN country_income ci ON vc.country_code = ci.country_code
|
||||
-- Eurostat EU population (via city code→name lookup)
|
||||
@@ -171,10 +199,14 @@ LEFT JOIN us_pop usa
|
||||
LEFT JOIN uk_pop uk
|
||||
ON vc.country_code = uk.country_code
|
||||
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name))
|
||||
-- GeoNames global fallback
|
||||
-- GeoNames string match (primary)
|
||||
LEFT JOIN geonames_pop gn
|
||||
ON vc.country_code = gn.country_code
|
||||
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name))
|
||||
-- GeoNames spatial fallback (nearest within ~15km, for when name match fails)
|
||||
LEFT JOIN geonames_spatial gs
|
||||
ON vc.country_code = gs.country_code
|
||||
AND vc.city_slug = gs.city_slug
|
||||
-- Enforce grain: if two cities in the same country have the same slug
|
||||
-- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues
|
||||
QUALIFY ROW_NUMBER() OVER (
|
||||
|
||||
@@ -8,11 +8,10 @@
|
||||
--
|
||||
-- 25 pts addressable market — log-scaled population, ceiling 500K
|
||||
-- (opportunity peaks in mid-size cities; megacities already served)
|
||||
-- 20 pts economic power — country income PPS, normalised to 200
|
||||
-- NOTE: PPS values are country-level constants in the range
|
||||
-- 18k-37k — ALL EU countries saturate this component (20/20).
|
||||
-- Component is a flat uplift per country until city-level
|
||||
-- income data becomes available.
|
||||
-- 20 pts economic power — country income PPS, normalised to 35,000
|
||||
-- EU PPS values range 18k-37k; /35k gives real spread.
|
||||
-- DE ≈ 13.2pts, ES ≈ 10.7pts, SE ≈ 14.3pts.
|
||||
-- Previously /200 caused all countries to saturate at 20/20.
|
||||
-- 30 pts supply gap — INVERTED venue density; 0 courts/100K = full marks.
|
||||
-- Ceiling raised to 8/100K (was 4) for a gentler gradient
|
||||
-- and to account for ~87% data undercount vs FIP totals.
|
||||
@@ -57,9 +56,13 @@ SELECT
|
||||
-- that can support a court but aren't already saturated by large-city operators.
|
||||
25.0 * LEAST(1.0, LN(GREATEST(l.population, 1)) / LN(500000))
|
||||
|
||||
-- Economic power (20 pts): country-level income PPS normalised to 200.
|
||||
-- Economic power (20 pts): country-level income PPS normalised to 35,000.
|
||||
-- Drives willingness-to-pay for court fees (€20-35/hr target range).
|
||||
+ 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 100) / 200.0)
|
||||
-- EU PPS values range 18k-37k; ceiling 35k gives meaningful spread.
|
||||
-- v1 used /200 which caused LEAST(1.0, 115) = 1.0 for ALL countries (flat, no differentiation).
|
||||
-- v2: /35000 → DE 0.66×20=13.2pts, ES 0.53×20=10.7pts, SE 0.71×20=14.3pts.
|
||||
-- Default 15000 for missing data = reasonable developing-market assumption (~0.43).
|
||||
+ 20.0 * LEAST(1.0, COALESCE(l.median_income_pps, 15000) / 35000.0)
|
||||
|
||||
-- Supply gap (30 pts): INVERTED venue density.
|
||||
-- 0 courts/100K = full 30 pts (white space); ≥8/100K = 0 pts (served market).
|
||||
|
||||
Reference in New Issue
Block a user