docs(inventory): document population pipeline implementation findings

- Add section 9 to data-sources-inventory.md covering live API quirks: Eurostat SDMX city labels response shape, ONS CSV download path (observations API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format - Add population coverage summary table and DuckDB glob limitation note - fix(extract): census_usa + geonames write empty placeholder when credentials absent so SQLMesh staging models don't fail with "no files found" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 01:01:10 +01:00
parent 06cbdf80dc
commit 1762188f08
3 changed files with 154 additions and 2 deletions
--- a/research/data-sources-inventory.md
+++ b/research/data-sources-inventory.md
@@ -579,6 +579,148 @@ Token-based REST API. Free tier includes 50k requests/month and last 6 months of

 ---

+## 9. Live Implementation Findings (Feb 2026)
+
+Findings from implementing the population pipeline extractors and staging models.
+Pipeline stack: Python extractors → landing zone JSON.gz → SQLMesh / DuckDB.
+
+---
+
+### 9.1 Eurostat SDMX City Labels Codelist
+
+| Field | Value |
+|-------|-------|
+| Endpoint | `https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/codelist/ESTAT/CITIES?format=JSON` |
+| Credentials | None required |
+| Response size | ~190 KB |
+| Dedup | ETag header — only re-downloads when changed |
+
+The SDMX 2.1 codelist endpoint for `ESTAT/CITIES` returns a **compact dimension JSON**, not the SDMX 2.1 full XML/JSON structure. The useful content is:
+
+```json
+{"category": {"label": {"DE001C": "Berlin", "DE002C": "Hamburg", ...}}}
+```
+
+This is a flat `{city_code: city_name}` dict with ~1,800 entries covering EU cities.
+
+**Critical finding**: The response does NOT match the SDMX 2.1 spec's `data["structure"]["codelists"]` path. The correct path is `data["category"]["label"]`.
+
+Country-level entries (e.g. `"DE"`, `"FR"`) appear in the dict without digits — filter them by requiring `any(c.isdigit() for c in city_code)` to keep only proper city codes like `DE001C`.
+
+**Result:** 1,771 city codes extracted. Provides the missing city_code → city_name mapping needed to join Eurostat population data (`urb_cpop1`, which uses city codes) against dim_cities (which uses city names).
+
+---
+
+### 9.2 Eurostat Population — `urb_cpop1`
+
+(Existing extractor `eurostat.py`; no changes required.)
+
+The `urb_cpop1` dataset provides city-level population estimates at the `geoLevel=city` dimension. Data is keyed by city code (e.g. `DE001C`), not city name — the SDMX city labels codelist (9.1) provides the bridge.
+
+**Population pipeline join:** `stg_city_labels` + `stg_population` → `dim_cities` populates `~75%` of EU/UK cities matched by name. Cities without a matching city code in Eurostat (smaller/newer cities) fall through to the GeoNames fallback.
+
+---
+
+### 9.3 ONS UK — Mid-Year Population Estimates
+
+| Field | Value |
+|-------|-------|
+| Dataset | `mid-year-pop-est` |
+| Edition | `mid-2022-england-wales` |
+| Download URL | `https://api.beta.ons.gov.uk/v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions/{N}/downloads/csv/href` |
+| Credentials | None required |
+| File size | ~68 MB uncompressed CSV |
+| Reference year | 2022 (mid-year estimate) |
+
+**Critical finding**: The ONS observations API endpoint (`/observations?geography=*&age=0`) returns 404 for the datasets documented in the developer portal. The correct approach for bulk data is the **CSV download** path, reached by:
+
+1. `GET /v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions` — list versions, pick `max(version)`
+2. `GET versions[latest]["downloads"]["csv"]["href"]` — download the ~68 MB CSV
+
+CSV format: one row per `(year × LAD × sex × age-group)`. Filter:
+- `sex = 'all'` (aggregate row — do not sum individual sex rows as that double-counts)
+- `calendar-years = '2022'` (target year from edition name)
+- LAD codes starting with `E0`, `W0`, `S1`, `N0` (English/Welsh/Scottish/NI districts; excludes region/country aggregate codes)
+
+Sum the `v4_0` column per `administrative-geography` (LAD code) to get total population.
+
+**Result:** 316 UK Local Authority Districts with population ≥ 50,000, ref_year = 2022.
+
+---
+
+### 9.4 US Census Bureau — ACS 5-Year Place Population
+
+| Field | Value |
+|-------|-------|
+| Endpoint | `https://api.census.gov/data/2023/acs/acs5?get=B01003_001E,NAME&for=place:*&in=state:*` |
+| Credentials | `CENSUS_API_KEY` (free — register at https://api.census.gov/data/key_signup.html) |
+| Variable | `B01003_001E` = total population (ACS concept: Total Population) |
+| Vintage | 2023 (released late 2024) |
+| Coverage | ~30,000 Census places across all 50 states + DC |
+
+Response is a JSON array: first row = headers `["B01003_001E", "NAME", "state", "place"]`, subsequent rows = data.
+
+Place names follow the pattern `"Los Angeles city, California"` — strip the suffix (` city`, ` town`, ` CDP`, ` borough`, ` village`, ` municipality`) and take the part before the first comma.
+
+Filtered to population ≥ 50,000: ~1,500 US cities.
+
+**Status:** Extractor implemented. Requires `CENSUS_API_KEY` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so the SQLMesh staging model does not fail. US population data will be empty until the key is added.
+
+---
+
+### 9.5 GeoNames — cities15000 Global Bulk
+
+| Field | Value |
+|-------|-------|
+| Download URL | `https://download.geonames.org/export/dump/cities15000.zip` |
+| Credentials | `GEONAMES_USERNAME` (free — register at https://www.geonames.org/login) |
+| File size | ~1.5 MB compressed, ~26,000 entries (all cities ≥ 15,000 pop) |
+| Update frequency | Monthly (GeoNames updates continuously) |
+| License | CC BY 4.0 |
+
+The username is passed as `?username=...` in the URL query string (signals ToS acceptance to GeoNames; no auth gate).
+
+Tab-separated format (19 columns). Relevant columns:
+- col 0: `geoname_id` — stable numeric ID
+- col 1: `name` — Unicode name
+- col 2: `asciiname` — ASCII transliteration (preferred for matching)
+- col 7: `feature_code` — filter to `{PPLC, PPLA, PPLA2, PPL}` (excludes airports, parks)
+- col 8: `country_code` — ISO 2-letter
+- col 14: `population`
+
+Filtered to pop ≥ 50,000 and valid feature codes: ~7,000–9,000 cities globally.
+
+**Status:** Extractor implemented. Requires `GEONAMES_USERNAME` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so staging models do not fail. Acts as the final fallback in `dim_cities` for any city not matched by Eurostat, US Census, or ONS.
+
+---
+
+### 9.6 DuckDB `read_json()` glob limitation
+
+**Finding:** DuckDB's `glob()` is a **table function** (returns rows), not a scalar function. It cannot be used as an argument to `read_json()` as `read_json(glob('/path/*'))` or via a subquery `read_json((SELECT list(file) FROM glob(...)))`.
+
+**Workaround used:** Extractors that skip due to missing credentials write a `{"rows": [], "count": 0}` placeholder file, ensuring at least one file always exists for each source's glob pattern. The SQLMesh staging models use string glob patterns like:
+
+```sql
+read_json(@LANDING_DIR || '/census_usa/*/*/acs5_places.json.gz', auto_detect = true)
+```
+
+This pattern requires the files to physically exist. The placeholder approach is cleaner than conditional SQL in the model.
+
+---
+
+### 9.7 Population Coverage Summary (Feb 2026)
+
+| Source | Region | Cities extracted | Credentials needed |
+|--------|--------|------------------|--------------------|
+| Eurostat `urb_cpop1` + SDMX city labels | EU-27 + EEA | ~1,400 cities | None |
+| ONS mid-year estimates | England & Wales | 316 LADs | None |
+| US Census ACS 5-year | United States | ~1,500 places | `CENSUS_API_KEY` (free) |
+| GeoNames cities15000 | Global fallback | ~7,500 cities | `GEONAMES_USERNAME` (free) |
+
+Population cascade in `dim_cities`: Eurostat → US Census → ONS → GeoNames → 0.
+
+---
+
 ## Sources

 - [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)