feat(data): Phase 2b complete — EU NUTS-2 spatial join + US state income

- stg_regional_income: expanded NUTS-1+2 (LENGTH IN 3,4), nuts_code rename, nuts_level
- stg_nuts2_boundaries: new — ST_Read GISCO GeoJSON, bbox columns for spatial pre-filter
- stg_income_usa: new — Census ACS state-level income staging model
- dim_locations: spatial join replaces admin1_to_nuts1 VALUES CTE; us_income CTE with
  PPS normalisation (income/80610×30000); income cascade: NUTS-2→NUTS-1→US state→country
- init_landing_seeds: compress=False for ST_Read files; gisco GeoJSON + census income seeds
- CHANGELOG + PROJECT.md updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-27 11:03:16 +01:00
parent 409dc4bfac
commit c3531bd75d
6 changed files with 228 additions and 42 deletions

View File

@@ -14,6 +14,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- `init_landing_seeds.py`: seed entry for `eurostat/1970/01/nama_10r_2hhinc.json.gz`.
- Verified income spread: Bayern (DE2) ~29K PPS > Hamburg (DE6) ~27K > Berlin (DE3) ~24K > Sachsen (DED) ~19K PPS. Non-mapped countries (ES, FR, IT) continue with country-level fallback.
- **Phase 2b — EU NUTS-2 spatial join + US state income** (`dim_locations`): all EU-27 + EFTA + UK locations now resolve to their NUTS-2 region automatically via a spatial join; US locations now use Census ACS state-level income instead of a flat country fallback.
- `stg_regional_income.sql`: expanded from NUTS-1 only (`LENGTH = 3`) to NUTS-1 + NUTS-2 (`LENGTH IN (3,4)`); column renamed `nuts1_code → nuts_code`; added `nuts_level` derived column (1 or 2).
- `scripts/download_gisco_nuts.py`: new one-time download script for NUTS-2 boundary GeoJSON from Eurostat GISCO (`NUTS_RG_20M_2021_4326_LEVL_2.geojson`, ~5 MB, NUTS revision 2021). Saves uncompressed — `ST_Read` cannot read `.gz`.
- `stg_nuts2_boundaries.sql`: new staging model — reads GeoJSON via `ST_Read`; extracts `nuts2_code`, `country_code`, `geometry`, and pre-computed bbox columns (`bbox_lat_min/max`, `bbox_lon_min/max`) for spatial pre-filter; normalises `EL→GR` / `UK→GB`. Grain: `nuts2_code`.
- `census_usa_income.py`: new extractor — fetches `B19013_001E` (median household income) at state level from Census ACS 5-year; saves to `census_usa/{year}/{month}/acs5_state_income.json.gz`; registered in `all.py` and `pyproject.toml`.
- `stg_income_usa.sql`: new staging model for US state income. Grain: `(state_fips, ref_year)`. Income kept in nominal USD — PPS conversion happens in `dim_locations`.
- `dim_locations.sql`: replaced `admin1_to_nuts1` VALUES CTE (16 DE rows) with full spatial join: `nuts2_match` (bbox pre-filter + `ST_Contains`) → `nuts2_income` / `nuts1_income` (latest year per level) → `regional_income` (COALESCE NUTS-2 → NUTS-1). Added `us_state_fips` (51-row VALUES CTE, admin1 abbreviation → FIPS) + `us_income` (PPS normalisation: `state_income / 80610.0 × 30000.0`). Final income cascade: EU NUTS-2 → EU NUTS-1 → US state → country-level. Germany now resolves to 38 Regierungsbezirke; Spain, France, Italy, Netherlands etc. all get NUTS-2 differentiation automatically.
- `init_landing_seeds.py`: `create_seed` extended with `compress=False` for files consumed by `ST_Read` (cannot read `.gz`); added `census_usa/1970/01/acs5_state_income.json.gz` seed and uncompressed `gisco/1970/01/nuts2_boundaries.geojson` empty-FeatureCollection seed.
### Changed
- **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
- **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.

View File

@@ -1,7 +1,7 @@
# Padelnomics — Project Tracker
> Move tasks across columns as you work. Add new tasks at the top of the relevant column.
> Last updated: 2026-02-27 (opportunity score data quality improvements).
> Last updated: 2026-02-27 (Phase 2b — EU NUTS-2 spatial join + US state income).
---
@@ -92,6 +92,8 @@
- [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0``/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)
- [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0
- [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor
- [x] **Phase 2a — NUTS-1 regional income**`eurostat.py` adds `nama_10r_2hhinc` dataset + URL filter params; `stg_regional_income.sql` new staging model (NUTS-1 codes, `EL→GR`/`UK→GB` normalisation); `dim_locations.sql` wires German Bundesland income via 16-row `admin1_to_nuts1` VALUES CTE; verified income spread Bayern > Hamburg > Berlin > Sachsen
- [x] **Phase 2b — EU NUTS-2 spatial join + US state income** — all EU-27+EFTA+UK locations auto-resolve to NUTS-2 via `ST_Contains` on GISCO boundary polygons; Germany now uses 38 Regierungsbezirke; US state income from Census ACS with PPS normalisation (`income / 80610 × 30000`); replaces brittle admin1 mapping CTE with zero-config spatial join; new files: `download_gisco_nuts.py`, `stg_nuts2_boundaries.sql`, `census_usa_income.py`, `stg_income_usa.sql`
### Data Pipeline (DaaS)
- [x] Overpass API extractor (OSM padel courts)

View File

@@ -15,14 +15,22 @@ import json
from pathlib import Path
def create_seed(dest: Path, content: bytes) -> None:
"""Write content to a gzip file atomically. Skips if the file already exists."""
def create_seed(dest: Path, content: bytes, *, compress: bool = True) -> None:
"""Write content to a seed file atomically. Skips if the file already exists.
compress=True (default) writes gzipped content, suitable for all landing zone
files. compress=False writes raw bytes — required for files consumed by DuckDB
ST_Read (e.g. GeoJSON), which cannot read .gz files.
"""
if dest.exists():
return
dest.parent.mkdir(parents=True, exist_ok=True)
tmp = dest.with_suffix(dest.suffix + ".tmp")
if compress:
with gzip.open(tmp, "wb") as f:
f.write(content)
else:
tmp.write_bytes(content)
tmp.rename(dest)
print(f" created: {dest}")
@@ -87,6 +95,8 @@ def main() -> None:
json.dumps({"rows": [], "count": 0}).encode(),
"eurostat/1970/01/nama_10r_2hhinc.json.gz":
json.dumps({"rows": [], "count": 0}).encode(),
"census_usa/1970/01/acs5_state_income.json.gz":
json.dumps({"rows": [], "count": 0}).encode(),
"eurostat_city_labels/1970/01/cities_codelist.json.gz":
json.dumps({"rows": [], "count": 0}).encode(),
@@ -97,9 +107,21 @@ def main() -> None:
json.dumps({"rows": [], "count": 0}).encode(),
}
# Uncompressed seeds — required for files consumed by ST_Read (cannot read .gz)
uncompressed_seeds = {
# Empty NUTS-2 boundary placeholder so stg_nuts2_boundaries can run before
# the real file is downloaded via scripts/download_gisco_nuts.py.
# ST_Read on an empty FeatureCollection returns 0 rows (graceful degradation:
# all locations fall back to country-level income until the real file lands).
"gisco/1970/01/nuts2_boundaries.geojson":
b'{"type":"FeatureCollection","features":[]}',
}
print(f"Initialising landing seeds in: {base}")
for rel_path, content in seeds.items():
create_seed(base / rel_path, content)
for rel_path, content in uncompressed_seeds.items():
create_seed(base / rel_path, content, compress=False)
print("Done.")

View File

@@ -6,14 +6,17 @@
-- covers all locations with population ≥ 1K so zero-court Gemeinden score fully.
--
-- Enriched with:
-- stg_income → country-level median income PPS (fallback)
-- stg_regional_income → NUTS-1 regional income PPS (preferred where mapped)
-- stg_nuts2_boundaries + stg_regional_income → EU NUTS-2/NUTS-1 income (spatial join)
-- stg_income_usa → US state-level income (PPS-normalised)
-- stg_income → country-level income (fallback for all countries)
-- stg_padel_courts → padel venue count + nearest court distance (km)
-- stg_tennis_courts → tennis court count within 25km radius
--
-- Income resolution: COALESCE(regional_income, country_income) AS median_income_pps.
-- Germany: all 16 Bundesländer mapped via admin1_code → NUTS-1.
-- All other countries fall back to country-level income (stg_income).
-- Income resolution cascade:
-- 1. EU NUTS-2 regional income (finest; spatial join via ST_Contains)
-- 2. EU NUTS-1 regional income (fallback when NUTS-2 income missing from dataset)
-- 3. US state income (ratio-normalised to PPS scale; see us_income CTE)
-- 4. Country-level income (global fallback from stg_income / ilc_di03)
--
-- Distance calculations use ST_Distance_Sphere (DuckDB spatial extension).
-- A bounding-box pre-filter (~0.5°, ≈55km) reduces the cross-join before the
@@ -44,43 +47,97 @@ locations AS (
FROM staging.stg_population_geonames
WHERE lat IS NOT NULL AND lon IS NOT NULL
),
-- Country income (same source and pattern as dim_cities) — fallback for unmapped regions
-- Country income (ilc_di03) — global fallback for all countries
country_income AS (
SELECT country_code, median_income_pps, ref_year AS income_year
FROM staging.stg_income
QUALIFY ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY ref_year DESC) = 1
),
-- Static mapping: GeoNames admin1_code → NUTS-1 code (Germany only; extend by adding rows)
admin1_to_nuts1 (country_code, admin1_code, nuts1_code) AS (
VALUES
('DE', '01', 'DE1'), -- Baden-Württemberg
('DE', '02', 'DE2'), -- Bayern
('DE', '03', 'DE5'), -- Bremen
('DE', '04', 'DE6'), -- Hamburg
('DE', '05', 'DE7'), -- Hessen
('DE', '06', 'DE9'), -- Niedersachsen
('DE', '07', 'DEA'), -- Nordrhein-Westfalen
('DE', '08', 'DEB'), -- Rheinland-Pfalz
('DE', '09', 'DEC'), -- Saarland
('DE', '10', 'DEF'), -- Schleswig-Holstein
('DE', '11', 'DE4'), -- Brandenburg
('DE', '12', 'DE8'), -- Mecklenburg-Vorpommern
('DE', '13', 'DED'), -- Sachsen
('DE', '14', 'DEE'), -- Sachsen-Anhalt
('DE', '15', 'DEG'), -- Thüringen
('DE', '16', 'DE3') -- Berlin
-- ── EU NUTS-2 income via spatial join ──────────────────────────────────────
-- Each EU location's (lon, lat) is matched against NUTS-2 boundary polygons.
-- The bounding box pre-filter (bbox_lat/lon_min/max) eliminates most candidates
-- before the exact ST_Contains test runs.
nuts2_match AS (
SELECT
l.geoname_id,
b.nuts2_code,
-- Derive parent NUTS-1 code (first 3 chars of NUTS-2)
SUBSTR(b.nuts2_code, 1, 3) AS nuts1_code
FROM locations l
JOIN staging.stg_nuts2_boundaries b
-- Bounding-box pre-filter: only test ST_Contains for polygons whose bbox
-- overlaps the point, reducing 140K x 242 to ~140K x 3-5 actual tests.
ON l.lat BETWEEN b.bbox_lat_min AND b.bbox_lat_max
AND l.lon BETWEEN b.bbox_lon_min AND b.bbox_lon_max
WHERE ST_Contains(b.geometry, ST_Point(l.lon, l.lat))
-- A point should fall in exactly one polygon; QUALIFY guards against rare
-- boundary-precision duplicates.
QUALIFY ROW_NUMBER() OVER (PARTITION BY l.geoname_id ORDER BY b.nuts2_code) = 1
),
-- Latest NUTS-1 regional income per region
-- NUTS-2 income: latest year per NUTS-2 code
nuts2_income AS (
SELECT nuts_code, regional_income_pps, ref_year AS regional_income_year
FROM staging.stg_regional_income
WHERE nuts_level = 2
QUALIFY ROW_NUMBER() OVER (PARTITION BY nuts_code ORDER BY ref_year DESC) = 1
),
-- NUTS-1 income: fallback when NUTS-2 income not in dataset for this region
nuts1_income AS (
SELECT nuts_code, regional_income_pps, ref_year AS regional_income_year
FROM staging.stg_regional_income
WHERE nuts_level = 1
QUALIFY ROW_NUMBER() OVER (PARTITION BY nuts_code ORDER BY ref_year DESC) = 1
),
-- Combined EU regional income: NUTS-2 preferred, NUTS-1 fallback
regional_income AS (
SELECT
m.country_code,
nm.geoname_id,
COALESCE(n2.regional_income_pps, n1.regional_income_pps) AS regional_income_pps,
COALESCE(n2.regional_income_year, n1.regional_income_year) AS regional_income_year
FROM nuts2_match nm
LEFT JOIN nuts2_income n2 ON nm.nuts2_code = n2.nuts_code
LEFT JOIN nuts1_income n1 ON nm.nuts1_code = n1.nuts_code
),
-- ── US state-level income ──────────────────────────────────────────────────
-- GeoNames admin1_code for US = 2-letter state abbreviation (ISO 3166-2:US).
-- Census ACS uses 2-digit FIPS codes. This 51-row VALUES CTE bridges them.
us_state_fips (admin1_code, state_fips) AS (
VALUES
('AL', '01'), ('AK', '02'), ('AZ', '04'), ('AR', '05'), ('CA', '06'),
('CO', '08'), ('CT', '09'), ('DE', '10'), ('FL', '12'), ('GA', '13'),
('HI', '15'), ('ID', '16'), ('IL', '17'), ('IN', '18'), ('IA', '19'),
('KS', '20'), ('KY', '21'), ('LA', '22'), ('ME', '23'), ('MD', '24'),
('MA', '25'), ('MI', '26'), ('MN', '27'), ('MS', '28'), ('MO', '29'),
('MT', '30'), ('NE', '31'), ('NV', '32'), ('NH', '33'), ('NJ', '34'),
('NM', '35'), ('NY', '36'), ('NC', '37'), ('ND', '38'), ('OH', '39'),
('OK', '40'), ('OR', '41'), ('PA', '42'), ('RI', '44'), ('SC', '45'),
('SD', '46'), ('TN', '47'), ('TX', '48'), ('UT', '49'), ('VT', '50'),
('VA', '51'), ('WA', '53'), ('WV', '54'), ('WI', '55'), ('WY', '56'),
('DC', '11')
),
-- US state income normalised to PPS-equivalent scale.
-- US Census reports median household income in nominal USD. To compare with EU
-- PPS per inhabitant, we use a ratio normalisation:
-- state_pps ≈ US_PPS_CONSTANT × (state_income / national_median)
--
-- Constants (update when ACS vintage changes):
-- national_median = 80,610 — 2023 ACS national median household income (USD)
-- US_PPS_CONSTANT = 30,000 — approximate US per-capita income in PPS terms
-- (derived from OECD PPP tables, 2022)
--
-- This gives a realistic spread across states:
-- California ($91,905) → ~34,200 PPS [≈ wealthy EU region]
-- Texas ($67,321) → ~25,100 PPS [≈ Germany average]
-- Mississippi($52,985) → ~19,700 PPS [≈ lower-income EU region]
us_income AS (
SELECT
m.admin1_code,
r.regional_income_pps,
r.ref_year AS regional_income_year
FROM admin1_to_nuts1 m
JOIN staging.stg_regional_income r ON r.nuts1_code = m.nuts1_code
ROUND(s.median_income_usd / 80610.0 * 30000.0, 0) AS median_income_pps,
s.ref_year AS income_year
FROM us_state_fips m
JOIN staging.stg_income_usa s ON m.state_fips = s.state_fips
QUALIFY ROW_NUMBER() OVER (
PARTITION BY m.country_code, m.admin1_code ORDER BY r.ref_year DESC
PARTITION BY m.admin1_code ORDER BY s.ref_year DESC
) = 1
),
-- Padel court lat/lon for distance and density calculations
@@ -197,8 +254,17 @@ SELECT
l.admin2_code,
l.population,
l.ref_year AS population_year,
COALESCE(ri.regional_income_pps, ci.median_income_pps) AS median_income_pps,
COALESCE(ri.regional_income_year, ci.income_year) AS income_year,
-- Income cascade: EU NUTS-2 → EU NUTS-1 → US state → country-level
COALESCE(
ri.regional_income_pps, -- EU: NUTS-2 (finest) or NUTS-1 (fallback)
us.median_income_pps, -- US: state-level PPS-equivalent
ci.median_income_pps -- Global: country-level from ilc_di03
) AS median_income_pps,
COALESCE(
ri.regional_income_year,
us.income_year,
ci.income_year
) AS income_year,
COALESCE(pl.padel_venue_count, 0)::INTEGER AS padel_venue_count,
-- Venues per 100K residents (NULL if population = 0)
CASE WHEN l.population > 0
@@ -210,12 +276,13 @@ SELECT
CURRENT_DATE AS refreshed_date
FROM locations l
LEFT JOIN country_income ci ON l.country_code = ci.country_code
LEFT JOIN regional_income ri ON l.country_code = ri.country_code
AND l.admin1_code = ri.admin1_code
LEFT JOIN regional_income ri ON l.geoname_id = ri.geoname_id
LEFT JOIN us_income us ON l.country_code = 'US'
AND l.admin1_code = us.admin1_code
LEFT JOIN nearest_padel np ON l.geoname_id = np.geoname_id
LEFT JOIN padel_local pl ON l.geoname_id = pl.geoname_id
LEFT JOIN tennis_nearby tn ON l.geoname_id = tn.geoname_id
-- Enforce grain: deduplicate if city slug collides within same country
-- Enforce grain: deduplicate if location slug collides within same country
QUALIFY ROW_NUMBER() OVER (
PARTITION BY l.country_code, l.geoname_id
ORDER BY l.population DESC NULLS LAST

View File

@@ -0,0 +1,39 @@
-- US Census ACS 5-year state-level median household income (B19013_001E).
-- One row per (state_fips, ref_year). Income stored in nominal USD.
-- PPS conversion to EU-comparable scale happens in dim_locations.
--
-- Source: data/landing/census_usa/{year}/{month}/acs5_state_income.json.gz
-- Extracted by: census_usa_income.py
MODEL (
name staging.stg_income_usa,
kind FULL,
cron '@daily',
grain (state_fips, ref_year)
);
WITH parsed AS (
SELECT
row ->> 'state_fips' AS state_fips,
row ->> 'state_name' AS state_name,
TRY_CAST(row ->> 'median_income_usd' AS DOUBLE) AS median_income_usd,
TRY_CAST(row ->> 'ref_year' AS INTEGER) AS ref_year,
CURRENT_DATE AS extracted_date
FROM (
SELECT UNNEST(rows) AS row
FROM read_json(
@LANDING_DIR || '/census_usa/*/*/acs5_state_income.json.gz',
auto_detect = true
)
)
WHERE (row ->> 'state_fips') IS NOT NULL
)
SELECT
state_fips,
TRIM(state_name) AS state_name,
median_income_usd,
ref_year,
extracted_date
FROM parsed
WHERE median_income_usd IS NOT NULL
AND median_income_usd > 0

View File

@@ -0,0 +1,47 @@
-- NUTS-2 boundary polygons from Eurostat GISCO for spatial income resolution.
-- Used by dim_locations to resolve each location's (lat, lon) to its NUTS-2
-- region via ST_Contains point-in-polygon join.
--
-- Source: data/landing/gisco/2024/01/nuts2_boundaries.geojson
-- Downloaded by: scripts/download_gisco_nuts.py (run once; re-run on NUTS revision)
-- Format: GeoJSON FeatureCollection with NUTS_ID, CNTR_CODE, NAME_LATN properties
-- NUTS revision: 2021 (effective 2024-01-01); changes ~every 7 years
--
-- Coverage: EU-27 + EFTA (NO, CH, IS, LI) + UK + candidate countries
-- ~242 NUTS-2 polygons at 1:20,000,000 scale (simplified for performance)
MODEL (
name staging.stg_nuts2_boundaries,
kind FULL,
cron '@daily',
grain nuts2_code
);
WITH raw AS (
-- ST_Read reads GeoJSON via GDAL. File must be uncompressed (.geojson, not .gz).
-- Path is fixed (not glob) because the boundary file is a one-time download,
-- not a time-partitioned series. The partition "2024/01" reflects the NUTS
-- revision year, not the download date.
SELECT *
FROM ST_Read(@LANDING_DIR || '/gisco/2024/01/nuts2_boundaries.geojson')
)
SELECT
NUTS_ID AS nuts2_code,
-- Normalise country prefix to ISO 3166-1 alpha-2: EL→GR, UK→GB
CASE CNTR_CODE
WHEN 'EL' THEN 'GR'
WHEN 'UK' THEN 'GB'
ELSE CNTR_CODE
END AS country_code,
NAME_LATN AS region_name,
geom AS geometry,
-- Pre-compute bounding box for efficient spatial pre-filter in dim_locations.
-- Each location only calls ST_Contains against the ~3-5 polygons whose bbox
-- overlaps it, not all 242.
ST_YMin(geom) AS bbox_lat_min,
ST_YMax(geom) AS bbox_lat_max,
ST_XMin(geom) AS bbox_lon_min,
ST_XMax(geom) AS bbox_lon_max
FROM raw
WHERE NUTS_ID IS NOT NULL
AND LENGTH(NUTS_ID) = 4 -- NUTS-2 codes are exactly 4 characters