feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors

Part A: Data Layer — Sprints 1-5

Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
  (city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
  replaces hardcoded 0::BIGINT population

Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
  30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
  is available to the scoring formula

Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)

Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)

Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
  with case/whitespace-insensitive city name matching

Registered all 4 new CLI entrypoints in pyproject.toml and all.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-24 00:07:08 +01:00
parent e76b6b4715
commit 0960990373
12 changed files with 860 additions and 32 deletions

View File

@@ -3,14 +3,17 @@
-- tracks cities where padel venues actually exist, not an administrative city list.
--
-- Conformed dimension: used by city_market_profile and all pSEO serving models.
-- Integrates two sources:
-- dim_venues → city list, venue count, coordinates (Playtomic + OSM)
-- stg_income → country-level median income (Eurostat)
-- Integrates four sources:
-- dim_venues → city list, venue count, coordinates (Playtomic + OSM)
-- stg_income → country-level median income (Eurostat)
-- stg_city_labels → Eurostat city_code → city_name mapping (EU cities)
-- stg_population → Eurostat city-level population (EU, joined via city code)
-- stg_population_usa → US Census ACS place population
-- stg_population_uk → ONS LAD population
-- stg_population_geonames → GeoNames global fallback
--
-- Population note: Eurostat uses coded identifiers (e.g. DE001C = Berlin) with no
-- city name column in the dataset we extract. City-level population requires a
-- separate code→name lookup extract (future improvement). Population is set to 0
-- until that source is available; market_score degrades gracefully.
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
-- City name matching is case/whitespace-insensitive within each country.
--
-- Grain: (country_code, city_slug) — two cities in different countries can share a
-- city name. QUALIFY enforces no duplicate (country_code, city_slug) pairs.
@@ -42,6 +45,39 @@ country_income AS (
SELECT country_code, median_income_pps, ref_year AS income_year
FROM staging.stg_income
QUALIFY ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY ref_year DESC) = 1
),
-- Eurostat EU population: join city labels (code→name) with population values.
-- QUALIFY keeps only the most recent year per (country, city name).
eurostat_pop AS (
SELECT
cl.country_code,
cl.city_name,
p.population,
p.ref_year
FROM staging.stg_city_labels cl
JOIN staging.stg_population p ON cl.city_code = p.city_code
QUALIFY ROW_NUMBER() OVER (
PARTITION BY cl.country_code, cl.city_name
ORDER BY p.ref_year DESC
) = 1
),
-- US Census ACS population (place-level, filtered to ≥50K)
us_pop AS (
SELECT city_name, country_code, population, ref_year
FROM staging.stg_population_usa
QUALIFY ROW_NUMBER() OVER (PARTITION BY place_fips ORDER BY ref_year DESC) = 1
),
-- ONS UK Local Authority District population
uk_pop AS (
SELECT lad_name AS city_name, country_code, population, ref_year
FROM staging.stg_population_uk
QUALIFY ROW_NUMBER() OVER (PARTITION BY lad_code ORDER BY ref_year DESC) = 1
),
-- GeoNames global fallback (all cities ≥50K)
geonames_pop AS (
SELECT city_name, country_code, population, ref_year
FROM staging.stg_population_geonames
QUALIFY ROW_NUMBER() OVER (PARTITION BY geoname_id ORDER BY ref_year DESC) = 1
)
SELECT
vc.country_code,
@@ -99,15 +135,43 @@ SELECT
)) AS country_slug,
vc.centroid_lat AS lat,
vc.centroid_lon AS lon,
-- Population: requires code→name Eurostat lookup (not yet extracted); defaults to 0.
-- market_score uses LOG(GREATEST(population, 1)) so 0 degrades score gracefully.
0::BIGINT AS population,
0::INTEGER AS population_year,
-- Population cascade: Eurostat EU > US Census > ONS UK > GeoNames > 0.
-- City name match is case/whitespace-insensitive within each country.
COALESCE(
ep.population,
usa.population,
uk.population,
gn.population,
0
)::BIGINT AS population,
COALESCE(
ep.ref_year,
usa.ref_year,
uk.ref_year,
gn.ref_year,
0
)::INTEGER AS population_year,
vc.padel_venue_count,
ci.median_income_pps,
ci.income_year
FROM venue_cities vc
LEFT JOIN country_income ci ON vc.country_code = ci.country_code
-- Eurostat EU population (via city code→name lookup)
LEFT JOIN eurostat_pop ep
ON vc.country_code = ep.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(ep.city_name))
-- US Census population
LEFT JOIN us_pop usa
ON vc.country_code = usa.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(usa.city_name))
-- ONS UK population
LEFT JOIN uk_pop uk
ON vc.country_code = uk.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(uk.city_name))
-- GeoNames global fallback
LEFT JOIN geonames_pop gn
ON vc.country_code = gn.country_code
AND LOWER(TRIM(vc.city_name)) = LOWER(TRIM(gn.city_name))
-- Enforce grain: if two cities in the same country have the same slug
-- (e.g. 'São Paulo' and 'Sao Paulo'), keep the one with more venues
QUALIFY ROW_NUMBER() OVER (