feat(data): Phase 2b complete — EU NUTS-2 spatial join + US state income

- stg_regional_income: expanded NUTS-1+2 (LENGTH IN 3,4), nuts_code rename, nuts_level
- stg_nuts2_boundaries: new — ST_Read GISCO GeoJSON, bbox columns for spatial pre-filter
- stg_income_usa: new — Census ACS state-level income staging model
- dim_locations: spatial join replaces admin1_to_nuts1 VALUES CTE; us_income CTE with
  PPS normalisation (income/80610×30000); income cascade: NUTS-2→NUTS-1→US state→country
- init_landing_seeds: compress=False for ST_Read files; gisco GeoJSON + census income seeds
- CHANGELOG + PROJECT.md updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-27 11:03:16 +01:00
parent 409dc4bfac
commit c3531bd75d
6 changed files with 228 additions and 42 deletions

View File

@@ -14,6 +14,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- `init_landing_seeds.py`: seed entry for `eurostat/1970/01/nama_10r_2hhinc.json.gz`.
- Verified income spread: Bayern (DE2) ~29K PPS > Hamburg (DE6) ~27K > Berlin (DE3) ~24K > Sachsen (DED) ~19K PPS. Non-mapped countries (ES, FR, IT) continue with country-level fallback.
- **Phase 2b — EU NUTS-2 spatial join + US state income** (`dim_locations`): all EU-27 + EFTA + UK locations now resolve to their NUTS-2 region automatically via a spatial join; US locations now use Census ACS state-level income instead of a flat country fallback.
- `stg_regional_income.sql`: expanded from NUTS-1 only (`LENGTH = 3`) to NUTS-1 + NUTS-2 (`LENGTH IN (3,4)`); column renamed `nuts1_code → nuts_code`; added `nuts_level` derived column (1 or 2).
- `scripts/download_gisco_nuts.py`: new one-time download script for NUTS-2 boundary GeoJSON from Eurostat GISCO (`NUTS_RG_20M_2021_4326_LEVL_2.geojson`, ~5 MB, NUTS revision 2021). Saves uncompressed — `ST_Read` cannot read `.gz`.
- `stg_nuts2_boundaries.sql`: new staging model — reads GeoJSON via `ST_Read`; extracts `nuts2_code`, `country_code`, `geometry`, and pre-computed bbox columns (`bbox_lat_min/max`, `bbox_lon_min/max`) for spatial pre-filter; normalises `EL→GR` / `UK→GB`. Grain: `nuts2_code`.
- `census_usa_income.py`: new extractor — fetches `B19013_001E` (median household income) at state level from Census ACS 5-year; saves to `census_usa/{year}/{month}/acs5_state_income.json.gz`; registered in `all.py` and `pyproject.toml`.
- `stg_income_usa.sql`: new staging model for US state income. Grain: `(state_fips, ref_year)`. Income kept in nominal USD — PPS conversion happens in `dim_locations`.
- `dim_locations.sql`: replaced `admin1_to_nuts1` VALUES CTE (16 DE rows) with full spatial join: `nuts2_match` (bbox pre-filter + `ST_Contains`) → `nuts2_income` / `nuts1_income` (latest year per level) → `regional_income` (COALESCE NUTS-2 → NUTS-1). Added `us_state_fips` (51-row VALUES CTE, admin1 abbreviation → FIPS) + `us_income` (PPS normalisation: `state_income / 80610.0 × 30000.0`). Final income cascade: EU NUTS-2 → EU NUTS-1 → US state → country-level. Germany now resolves to 38 Regierungsbezirke; Spain, France, Italy, Netherlands etc. all get NUTS-2 differentiation automatically.
- `init_landing_seeds.py`: `create_seed` extended with `compress=False` for files consumed by `ST_Read` (cannot read `.gz`); added `census_usa/1970/01/acs5_state_income.json.gz` seed and uncompressed `gisco/1970/01/nuts2_boundaries.geojson` empty-FeatureCollection seed.
### Changed
- **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
- **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.