feat(data): Phase 2b complete — EU NUTS-2 spatial join + US state income

- stg_regional_income: expanded NUTS-1+2 (LENGTH IN 3,4), nuts_code rename, nuts_level - stg_nuts2_boundaries: new — ST_Read GISCO GeoJSON, bbox columns for spatial pre-filter - stg_income_usa: new — Census ACS state-level income staging model - dim_locations: spatial join replaces admin1_to_nuts1 VALUES CTE; us_income CTE with PPS normalisation (income/80610×30000); income cascade: NUTS-2→NUTS-1→US state→country - init_landing_seeds: compress=False for ST_Read files; gisco GeoJSON + census income seeds - CHANGELOG + PROJECT.md updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 11:03:16 +01:00
parent 409dc4bfac
commit c3531bd75d
6 changed files with 228 additions and 42 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -14,6 +14,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
  - `init_landing_seeds.py`: seed entry for `eurostat/1970/01/nama_10r_2hhinc.json.gz`.
  - Verified income spread: Bayern (DE2) ~29K PPS > Hamburg (DE6) ~27K > Berlin (DE3) ~24K > Sachsen (DED) ~19K PPS. Non-mapped countries (ES, FR, IT) continue with country-level fallback.

+- **Phase 2b — EU NUTS-2 spatial join + US state income** (`dim_locations`): all EU-27 + EFTA + UK locations now resolve to their NUTS-2 region automatically via a spatial join; US locations now use Census ACS state-level income instead of a flat country fallback.
+  - `stg_regional_income.sql`: expanded from NUTS-1 only (`LENGTH = 3`) to NUTS-1 + NUTS-2 (`LENGTH IN (3,4)`); column renamed `nuts1_code → nuts_code`; added `nuts_level` derived column (1 or 2).
+  - `scripts/download_gisco_nuts.py`: new one-time download script for NUTS-2 boundary GeoJSON from Eurostat GISCO (`NUTS_RG_20M_2021_4326_LEVL_2.geojson`, ~5 MB, NUTS revision 2021). Saves uncompressed — `ST_Read` cannot read `.gz`.
+  - `stg_nuts2_boundaries.sql`: new staging model — reads GeoJSON via `ST_Read`; extracts `nuts2_code`, `country_code`, `geometry`, and pre-computed bbox columns (`bbox_lat_min/max`, `bbox_lon_min/max`) for spatial pre-filter; normalises `EL→GR` / `UK→GB`. Grain: `nuts2_code`.
+  - `census_usa_income.py`: new extractor — fetches `B19013_001E` (median household income) at state level from Census ACS 5-year; saves to `census_usa/{year}/{month}/acs5_state_income.json.gz`; registered in `all.py` and `pyproject.toml`.
+  - `stg_income_usa.sql`: new staging model for US state income. Grain: `(state_fips, ref_year)`. Income kept in nominal USD — PPS conversion happens in `dim_locations`.
+  - `dim_locations.sql`: replaced `admin1_to_nuts1` VALUES CTE (16 DE rows) with full spatial join: `nuts2_match` (bbox pre-filter + `ST_Contains`) → `nuts2_income` / `nuts1_income` (latest year per level) → `regional_income` (COALESCE NUTS-2 → NUTS-1). Added `us_state_fips` (51-row VALUES CTE, admin1 abbreviation → FIPS) + `us_income` (PPS normalisation: `state_income / 80610.0 × 30000.0`). Final income cascade: EU NUTS-2 → EU NUTS-1 → US state → country-level. Germany now resolves to 38 Regierungsbezirke; Spain, France, Italy, Netherlands etc. all get NUTS-2 differentiation automatically.
+  - `init_landing_seeds.py`: `create_seed` extended with `compress=False` for files consumed by `ST_Read` (cannot read `.gz`); added `census_usa/1970/01/acs5_state_income.json.gz` seed and uncompressed `gisco/1970/01/nuts2_boundaries.geojson` empty-FeatureCollection seed.
+
 ### Changed
 - **Opportunity Score v2 — income ceiling fix** (`location_opportunity_profile.sql`): income PPS normalisation changed from `/200.0` (caused LEAST(1.0, 115)=1.0 for ALL countries — no differentiation) to `/35000.0` with country-spread-matched ceiling. Default for missing data changed from 100 to 15000 (developing-market assumption). Country scores now reflect real PPS spread: LU 20.0, SE 14.3, DE 13.2, ES 10.7, GB 10.5 pts (was 20.0 everywhere).
 - **dim_cities population coverage 70.5% → 98.5%** — added GeoNames spatial fallback CTE that finds the nearest GeoNames location within ~15 km when string name matching fails (~29% of cities). Fixes localization mismatches (Milano≠Milan, Wien≠Vienna, München≠Munich): Wien 0→1,691,468; Milano 0→1,371,498. Population cascade now: Eurostat EU > US Census > ONS UK > GeoNames string > GeoNames spatial > 0.
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -1,7 +1,7 @@
 # Padelnomics — Project Tracker

 > Move tasks across columns as you work. Add new tasks at the top of the relevant column.
-> Last updated: 2026-02-27 (opportunity score data quality improvements).
+> Last updated: 2026-02-27 (Phase 2b — EU NUTS-2 spatial join + US state income).

 ---

@@ -92,6 +92,8 @@
 - [x] **Opportunity Score v2 — income ceiling fix** — PPS normalisation `/200.0` → `/35000.0`; economic power component now differentiates countries (DE 13.2, ES 10.7, SE 14.3 pts; was 20.0 everywhere)
 - [x] **dim_cities population coverage 70.5% → 98.5%** — GeoNames spatial fallback CTE (ST_Distance_Sphere, 0.14° bbox) resolves localization mismatches (Wien→Vienna 1.69M, Milano→Milan 1.37M); population cascade: Eurostat > Census > ONS > GeoNames string > GeoNames spatial > 0
 - [x] **overpass_tennis added to supervisor workflows** — monthly schedule in `workflows.toml`; was only in combined extractor
+- [x] **Phase 2a — NUTS-1 regional income** — `eurostat.py` adds `nama_10r_2hhinc` dataset + URL filter params; `stg_regional_income.sql` new staging model (NUTS-1 codes, `EL→GR`/`UK→GB` normalisation); `dim_locations.sql` wires German Bundesland income via 16-row `admin1_to_nuts1` VALUES CTE; verified income spread Bayern > Hamburg > Berlin > Sachsen
+- [x] **Phase 2b — EU NUTS-2 spatial join + US state income** — all EU-27+EFTA+UK locations auto-resolve to NUTS-2 via `ST_Contains` on GISCO boundary polygons; Germany now uses 38 Regierungsbezirke; US state income from Census ACS with PPS normalisation (`income / 80610 × 30000`); replaces brittle admin1 mapping CTE with zero-config spatial join; new files: `download_gisco_nuts.py`, `stg_nuts2_boundaries.sql`, `census_usa_income.py`, `stg_income_usa.sql`

 ### Data Pipeline (DaaS)
 - [x] Overpass API extractor (OSM padel courts)
--- a/scripts/init_landing_seeds.py
+++ b/scripts/init_landing_seeds.py
@@ -15,14 +15,22 @@ import json
 from pathlib import Path


-def create_seed(dest: Path, content: bytes) -> None:
-    """Write content to a gzip file atomically. Skips if the file already exists."""
+def create_seed(dest: Path, content: bytes, *, compress: bool = True) -> None:
+    """Write content to a seed file atomically. Skips if the file already exists.
+
+    compress=True (default) writes gzipped content, suitable for all landing zone
+    files. compress=False writes raw bytes — required for files consumed by DuckDB
+    ST_Read (e.g. GeoJSON), which cannot read .gz files.
+    """
    if dest.exists():
        return
    dest.parent.mkdir(parents=True, exist_ok=True)
    tmp = dest.with_suffix(dest.suffix + ".tmp")
+    if compress:
        with gzip.open(tmp, "wb") as f:
            f.write(content)
+    else:
+        tmp.write_bytes(content)
    tmp.rename(dest)
    print(f"  created: {dest}")

@@ -87,6 +95,8 @@ def main() -> None:
            json.dumps({"rows": [], "count": 0}).encode(),
        "eurostat/1970/01/nama_10r_2hhinc.json.gz":
            json.dumps({"rows": [], "count": 0}).encode(),
+        "census_usa/1970/01/acs5_state_income.json.gz":
+            json.dumps({"rows": [], "count": 0}).encode(),
        "eurostat_city_labels/1970/01/cities_codelist.json.gz":
            json.dumps({"rows": [], "count": 0}).encode(),

@@ -97,9 +107,21 @@ def main() -> None:
            json.dumps({"rows": [], "count": 0}).encode(),
    }

+    # Uncompressed seeds — required for files consumed by ST_Read (cannot read .gz)
+    uncompressed_seeds = {
+        # Empty NUTS-2 boundary placeholder so stg_nuts2_boundaries can run before
+        # the real file is downloaded via scripts/download_gisco_nuts.py.
+        # ST_Read on an empty FeatureCollection returns 0 rows (graceful degradation:
+        # all locations fall back to country-level income until the real file lands).
+        "gisco/1970/01/nuts2_boundaries.geojson":
+            b'{"type":"FeatureCollection","features":[]}',
+    }
+
    print(f"Initialising landing seeds in: {base}")
    for rel_path, content in seeds.items():
        create_seed(base / rel_path, content)
+    for rel_path, content in uncompressed_seeds.items():
+        create_seed(base / rel_path, content, compress=False)
    print("Done.")


--- a/transform/sqlmesh_padelnomics/models/foundation/dim_locations.sql
+++ b/transform/sqlmesh_padelnomics/models/foundation/dim_locations.sql
@@ -6,14 +6,17 @@
 -- covers all locations with population ≥ 1K so zero-court Gemeinden score fully.
 --
 -- Enriched with:
--   stg_income           → country-level median income PPS (fallback)
--   stg_regional_income  → NUTS-1 regional income PPS (preferred where mapped)
+--   stg_nuts2_boundaries + stg_regional_income → EU NUTS-2/NUTS-1 income (spatial join)
+--   stg_income_usa       → US state-level income (PPS-normalised)
+--   stg_income           → country-level income (fallback for all countries)
 --   stg_padel_courts     → padel venue count + nearest court distance (km)
 --   stg_tennis_courts    → tennis court count within 25km radius
 --
-- Income resolution: COALESCE(regional_income, country_income) AS median_income_pps.
-- Germany: all 16 Bundesländer mapped via admin1_code → NUTS-1.
-- All other countries fall back to country-level income (stg_income).
+-- Income resolution cascade:
+--   1. EU NUTS-2 regional income  (finest; spatial join via ST_Contains)
+--   2. EU NUTS-1 regional income  (fallback when NUTS-2 income missing from dataset)
+--   3. US state income            (ratio-normalised to PPS scale; see us_income CTE)
+--   4. Country-level income       (global fallback from stg_income / ilc_di03)
 --
 -- Distance calculations use ST_Distance_Sphere (DuckDB spatial extension).
 -- A bounding-box pre-filter (~0.5°, ≈55km) reduces the cross-join before the
@@ -44,43 +47,97 @@ locations AS (
  FROM staging.stg_population_geonames
  WHERE lat IS NOT NULL AND lon IS NOT NULL
 ),
-- Country income (same source and pattern as dim_cities) — fallback for unmapped regions
+-- Country income (ilc_di03) — global fallback for all countries
 country_income AS (
  SELECT country_code, median_income_pps, ref_year AS income_year
  FROM staging.stg_income
  QUALIFY ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY ref_year DESC) = 1
 ),
-- Static mapping: GeoNames admin1_code → NUTS-1 code (Germany only; extend by adding rows)
-admin1_to_nuts1 (country_code, admin1_code, nuts1_code) AS (
-  VALUES
-    ('DE', '01', 'DE1'),  -- Baden-Württemberg
-    ('DE', '02', 'DE2'),  -- Bayern
-    ('DE', '03', 'DE5'),  -- Bremen
-    ('DE', '04', 'DE6'),  -- Hamburg
-    ('DE', '05', 'DE7'),  -- Hessen
-    ('DE', '06', 'DE9'),  -- Niedersachsen
-    ('DE', '07', 'DEA'),  -- Nordrhein-Westfalen
-    ('DE', '08', 'DEB'),  -- Rheinland-Pfalz
-    ('DE', '09', 'DEC'),  -- Saarland
-    ('DE', '10', 'DEF'),  -- Schleswig-Holstein
-    ('DE', '11', 'DE4'),  -- Brandenburg
-    ('DE', '12', 'DE8'),  -- Mecklenburg-Vorpommern
-    ('DE', '13', 'DED'),  -- Sachsen
-    ('DE', '14', 'DEE'),  -- Sachsen-Anhalt
-    ('DE', '15', 'DEG'),  -- Thüringen
-    ('DE', '16', 'DE3')   -- Berlin
+-- ── EU NUTS-2 income via spatial join ──────────────────────────────────────
+-- Each EU location's (lon, lat) is matched against NUTS-2 boundary polygons.
+-- The bounding box pre-filter (bbox_lat/lon_min/max) eliminates most candidates
+-- before the exact ST_Contains test runs.
+nuts2_match AS (
+  SELECT
+    l.geoname_id,
+    b.nuts2_code,
+    -- Derive parent NUTS-1 code (first 3 chars of NUTS-2)
+    SUBSTR(b.nuts2_code, 1, 3)                   AS nuts1_code
+  FROM locations l
+  JOIN staging.stg_nuts2_boundaries b
+    -- Bounding-box pre-filter: only test ST_Contains for polygons whose bbox
+    -- overlaps the point, reducing 140K x 242 to ~140K x 3-5 actual tests.
+    ON l.lat BETWEEN b.bbox_lat_min AND b.bbox_lat_max
+    AND l.lon BETWEEN b.bbox_lon_min AND b.bbox_lon_max
+  WHERE ST_Contains(b.geometry, ST_Point(l.lon, l.lat))
+  -- A point should fall in exactly one polygon; QUALIFY guards against rare
+  -- boundary-precision duplicates.
+  QUALIFY ROW_NUMBER() OVER (PARTITION BY l.geoname_id ORDER BY b.nuts2_code) = 1
 ),
-- Latest NUTS-1 regional income per region
+-- NUTS-2 income: latest year per NUTS-2 code
+nuts2_income AS (
+  SELECT nuts_code, regional_income_pps, ref_year AS regional_income_year
+  FROM staging.stg_regional_income
+  WHERE nuts_level = 2
+  QUALIFY ROW_NUMBER() OVER (PARTITION BY nuts_code ORDER BY ref_year DESC) = 1
+),
+-- NUTS-1 income: fallback when NUTS-2 income not in dataset for this region
+nuts1_income AS (
+  SELECT nuts_code, regional_income_pps, ref_year AS regional_income_year
+  FROM staging.stg_regional_income
+  WHERE nuts_level = 1
+  QUALIFY ROW_NUMBER() OVER (PARTITION BY nuts_code ORDER BY ref_year DESC) = 1
+),
+-- Combined EU regional income: NUTS-2 preferred, NUTS-1 fallback
 regional_income AS (
  SELECT
-    m.country_code,
+    nm.geoname_id,
+    COALESCE(n2.regional_income_pps, n1.regional_income_pps)           AS regional_income_pps,
+    COALESCE(n2.regional_income_year, n1.regional_income_year)         AS regional_income_year
+  FROM nuts2_match nm
+  LEFT JOIN nuts2_income n2  ON nm.nuts2_code = n2.nuts_code
+  LEFT JOIN nuts1_income n1  ON nm.nuts1_code = n1.nuts_code
+),
+-- ── US state-level income ──────────────────────────────────────────────────
+-- GeoNames admin1_code for US = 2-letter state abbreviation (ISO 3166-2:US).
+-- Census ACS uses 2-digit FIPS codes. This 51-row VALUES CTE bridges them.
+us_state_fips (admin1_code, state_fips) AS (
+  VALUES
+    ('AL', '01'), ('AK', '02'), ('AZ', '04'), ('AR', '05'), ('CA', '06'),
+    ('CO', '08'), ('CT', '09'), ('DE', '10'), ('FL', '12'), ('GA', '13'),
+    ('HI', '15'), ('ID', '16'), ('IL', '17'), ('IN', '18'), ('IA', '19'),
+    ('KS', '20'), ('KY', '21'), ('LA', '22'), ('ME', '23'), ('MD', '24'),
+    ('MA', '25'), ('MI', '26'), ('MN', '27'), ('MS', '28'), ('MO', '29'),
+    ('MT', '30'), ('NE', '31'), ('NV', '32'), ('NH', '33'), ('NJ', '34'),
+    ('NM', '35'), ('NY', '36'), ('NC', '37'), ('ND', '38'), ('OH', '39'),
+    ('OK', '40'), ('OR', '41'), ('PA', '42'), ('RI', '44'), ('SC', '45'),
+    ('SD', '46'), ('TN', '47'), ('TX', '48'), ('UT', '49'), ('VT', '50'),
+    ('VA', '51'), ('WA', '53'), ('WV', '54'), ('WI', '55'), ('WY', '56'),
+    ('DC', '11')
+),
+-- US state income normalised to PPS-equivalent scale.
+-- US Census reports median household income in nominal USD. To compare with EU
+-- PPS per inhabitant, we use a ratio normalisation:
+--   state_pps ≈ US_PPS_CONSTANT × (state_income / national_median)
+--
+-- Constants (update when ACS vintage changes):
+--   national_median = 80,610  — 2023 ACS national median household income (USD)
+--   US_PPS_CONSTANT = 30,000  — approximate US per-capita income in PPS terms
+--                              (derived from OECD PPP tables, 2022)
+--
+-- This gives a realistic spread across states:
+--   California ($91,905) → ~34,200 PPS   [≈ wealthy EU region]
+--   Texas      ($67,321) → ~25,100 PPS   [≈ Germany average]
+--   Mississippi($52,985) → ~19,700 PPS   [≈ lower-income EU region]
+us_income AS (
+  SELECT
    m.admin1_code,
-    r.regional_income_pps,
-    r.ref_year AS regional_income_year
-  FROM admin1_to_nuts1 m
-  JOIN staging.stg_regional_income r ON r.nuts1_code = m.nuts1_code
+    ROUND(s.median_income_usd / 80610.0 * 30000.0, 0)  AS median_income_pps,
+    s.ref_year                                           AS income_year
+  FROM us_state_fips m
+  JOIN staging.stg_income_usa s ON m.state_fips = s.state_fips
  QUALIFY ROW_NUMBER() OVER (
-    PARTITION BY m.country_code, m.admin1_code ORDER BY r.ref_year DESC
+    PARTITION BY m.admin1_code ORDER BY s.ref_year DESC
  ) = 1
 ),
 -- Padel court lat/lon for distance and density calculations
@@ -197,8 +254,17 @@ SELECT
  l.admin2_code,
  l.population,
  l.ref_year                                                             AS population_year,
-  COALESCE(ri.regional_income_pps, ci.median_income_pps)               AS median_income_pps,
-  COALESCE(ri.regional_income_year, ci.income_year)                    AS income_year,
+  -- Income cascade: EU NUTS-2 → EU NUTS-1 → US state → country-level
+  COALESCE(
+    ri.regional_income_pps,    -- EU: NUTS-2 (finest) or NUTS-1 (fallback)
+    us.median_income_pps,      -- US: state-level PPS-equivalent
+    ci.median_income_pps       -- Global: country-level from ilc_di03
+  )                                                                      AS median_income_pps,
+  COALESCE(
+    ri.regional_income_year,
+    us.income_year,
+    ci.income_year
+  )                                                                      AS income_year,
  COALESCE(pl.padel_venue_count, 0)::INTEGER                            AS padel_venue_count,
  -- Venues per 100K residents (NULL if population = 0)
  CASE WHEN l.population > 0
@@ -210,12 +276,13 @@ SELECT
  CURRENT_DATE                                                          AS refreshed_date
 FROM locations l
 LEFT JOIN country_income ci   ON l.country_code = ci.country_code
-LEFT JOIN regional_income ri  ON l.country_code = ri.country_code
-                              AND l.admin1_code  = ri.admin1_code
+LEFT JOIN regional_income ri  ON l.geoname_id   = ri.geoname_id
+LEFT JOIN us_income us        ON l.country_code  = 'US'
+                              AND l.admin1_code  = us.admin1_code
 LEFT JOIN nearest_padel np    ON l.geoname_id   = np.geoname_id
 LEFT JOIN padel_local pl      ON l.geoname_id   = pl.geoname_id
 LEFT JOIN tennis_nearby tn    ON l.geoname_id   = tn.geoname_id
-- Enforce grain: deduplicate if city slug collides within same country
+-- Enforce grain: deduplicate if location slug collides within same country
 QUALIFY ROW_NUMBER() OVER (
  PARTITION BY l.country_code, l.geoname_id
  ORDER BY l.population DESC NULLS LAST
--- a/transform/sqlmesh_padelnomics/models/staging/stg_income_usa.sql
+++ b/transform/sqlmesh_padelnomics/models/staging/stg_income_usa.sql
@@ -0,0 +1,39 @@
+-- US Census ACS 5-year state-level median household income (B19013_001E).
+-- One row per (state_fips, ref_year). Income stored in nominal USD.
+-- PPS conversion to EU-comparable scale happens in dim_locations.
+--
+-- Source: data/landing/census_usa/{year}/{month}/acs5_state_income.json.gz
+-- Extracted by: census_usa_income.py
+
+MODEL (
+  name staging.stg_income_usa,
+  kind FULL,
+  cron '@daily',
+  grain (state_fips, ref_year)
+);
+
+WITH parsed AS (
+  SELECT
+    row ->> 'state_fips'                             AS state_fips,
+    row ->> 'state_name'                             AS state_name,
+    TRY_CAST(row ->> 'median_income_usd' AS DOUBLE)  AS median_income_usd,
+    TRY_CAST(row ->> 'ref_year'          AS INTEGER) AS ref_year,
+    CURRENT_DATE                                     AS extracted_date
+  FROM (
+    SELECT UNNEST(rows) AS row
+    FROM read_json(
+      @LANDING_DIR || '/census_usa/*/*/acs5_state_income.json.gz',
+      auto_detect = true
+    )
+  )
+  WHERE (row ->> 'state_fips') IS NOT NULL
+)
+SELECT
+  state_fips,
+  TRIM(state_name)    AS state_name,
+  median_income_usd,
+  ref_year,
+  extracted_date
+FROM parsed
+WHERE median_income_usd IS NOT NULL
+  AND median_income_usd > 0
--- a/transform/sqlmesh_padelnomics/models/staging/stg_nuts2_boundaries.sql
+++ b/transform/sqlmesh_padelnomics/models/staging/stg_nuts2_boundaries.sql
@@ -0,0 +1,47 @@
+-- NUTS-2 boundary polygons from Eurostat GISCO for spatial income resolution.
+-- Used by dim_locations to resolve each location's (lat, lon) to its NUTS-2
+-- region via ST_Contains point-in-polygon join.
+--
+-- Source: data/landing/gisco/2024/01/nuts2_boundaries.geojson
+-- Downloaded by: scripts/download_gisco_nuts.py (run once; re-run on NUTS revision)
+-- Format: GeoJSON FeatureCollection with NUTS_ID, CNTR_CODE, NAME_LATN properties
+-- NUTS revision: 2021 (effective 2024-01-01); changes ~every 7 years
+--
+-- Coverage: EU-27 + EFTA (NO, CH, IS, LI) + UK + candidate countries
+-- ~242 NUTS-2 polygons at 1:20,000,000 scale (simplified for performance)
+
+MODEL (
+  name staging.stg_nuts2_boundaries,
+  kind FULL,
+  cron '@daily',
+  grain nuts2_code
+);
+
+WITH raw AS (
+  -- ST_Read reads GeoJSON via GDAL. File must be uncompressed (.geojson, not .gz).
+  -- Path is fixed (not glob) because the boundary file is a one-time download,
+  -- not a time-partitioned series. The partition "2024/01" reflects the NUTS
+  -- revision year, not the download date.
+  SELECT *
+  FROM ST_Read(@LANDING_DIR || '/gisco/2024/01/nuts2_boundaries.geojson')
+)
+SELECT
+  NUTS_ID                                      AS nuts2_code,
+  -- Normalise country prefix to ISO 3166-1 alpha-2: EL→GR, UK→GB
+  CASE CNTR_CODE
+    WHEN 'EL' THEN 'GR'
+    WHEN 'UK' THEN 'GB'
+    ELSE CNTR_CODE
+  END                                          AS country_code,
+  NAME_LATN                                    AS region_name,
+  geom                                         AS geometry,
+  -- Pre-compute bounding box for efficient spatial pre-filter in dim_locations.
+  -- Each location only calls ST_Contains against the ~3-5 polygons whose bbox
+  -- overlaps it, not all 242.
+  ST_YMin(geom)                                AS bbox_lat_min,
+  ST_YMax(geom)                                AS bbox_lat_max,
+  ST_XMin(geom)                                AS bbox_lon_min,
+  ST_XMax(geom)                                AS bbox_lon_max
+FROM raw
+WHERE NUTS_ID IS NOT NULL
+  AND LENGTH(NUTS_ID) = 4  -- NUTS-2 codes are exactly 4 characters