docs(inventory): document population pipeline implementation findings

- Add section 9 to data-sources-inventory.md covering live API quirks:
  Eurostat SDMX city labels response shape, ONS CSV download path (observations
  API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format
- Add population coverage summary table and DuckDB glob limitation note
- fix(extract): census_usa + geonames write empty placeholder when credentials
  absent so SQLMesh staging models don't fail with "no files found"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-24 01:01:10 +01:00
parent 06cbdf80dc
commit 1762188f08
3 changed files with 154 additions and 2 deletions

View File

@@ -579,6 +579,148 @@ Token-based REST API. Free tier includes 50k requests/month and last 6 months of
---
## 9. Live Implementation Findings (Feb 2026)
Findings from implementing the population pipeline extractors and staging models.
Pipeline stack: Python extractors → landing zone JSON.gz → SQLMesh / DuckDB.
---
### 9.1 Eurostat SDMX City Labels Codelist
| Field | Value |
|-------|-------|
| Endpoint | `https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/codelist/ESTAT/CITIES?format=JSON` |
| Credentials | None required |
| Response size | ~190 KB |
| Dedup | ETag header — only re-downloads when changed |
The SDMX 2.1 codelist endpoint for `ESTAT/CITIES` returns a **compact dimension JSON**, not the SDMX 2.1 full XML/JSON structure. The useful content is:
```json
{"category": {"label": {"DE001C": "Berlin", "DE002C": "Hamburg", ...}}}
```
This is a flat `{city_code: city_name}` dict with ~1,800 entries covering EU cities.
**Critical finding**: The response does NOT match the SDMX 2.1 spec's `data["structure"]["codelists"]` path. The correct path is `data["category"]["label"]`.
Country-level entries (e.g. `"DE"`, `"FR"`) appear in the dict without digits — filter them by requiring `any(c.isdigit() for c in city_code)` to keep only proper city codes like `DE001C`.
**Result:** 1,771 city codes extracted. Provides the missing city_code → city_name mapping needed to join Eurostat population data (`urb_cpop1`, which uses city codes) against dim_cities (which uses city names).
---
### 9.2 Eurostat Population — `urb_cpop1`
(Existing extractor `eurostat.py`; no changes required.)
The `urb_cpop1` dataset provides city-level population estimates at the `geoLevel=city` dimension. Data is keyed by city code (e.g. `DE001C`), not city name — the SDMX city labels codelist (9.1) provides the bridge.
**Population pipeline join:** `stg_city_labels` + `stg_population``dim_cities` populates `~75%` of EU/UK cities matched by name. Cities without a matching city code in Eurostat (smaller/newer cities) fall through to the GeoNames fallback.
---
### 9.3 ONS UK — Mid-Year Population Estimates
| Field | Value |
|-------|-------|
| Dataset | `mid-year-pop-est` |
| Edition | `mid-2022-england-wales` |
| Download URL | `https://api.beta.ons.gov.uk/v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions/{N}/downloads/csv/href` |
| Credentials | None required |
| File size | ~68 MB uncompressed CSV |
| Reference year | 2022 (mid-year estimate) |
**Critical finding**: The ONS observations API endpoint (`/observations?geography=*&age=0`) returns 404 for the datasets documented in the developer portal. The correct approach for bulk data is the **CSV download** path, reached by:
1. `GET /v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions` — list versions, pick `max(version)`
2. `GET versions[latest]["downloads"]["csv"]["href"]` — download the ~68 MB CSV
CSV format: one row per `(year × LAD × sex × age-group)`. Filter:
- `sex = 'all'` (aggregate row — do not sum individual sex rows as that double-counts)
- `calendar-years = '2022'` (target year from edition name)
- LAD codes starting with `E0`, `W0`, `S1`, `N0` (English/Welsh/Scottish/NI districts; excludes region/country aggregate codes)
Sum the `v4_0` column per `administrative-geography` (LAD code) to get total population.
**Result:** 316 UK Local Authority Districts with population ≥ 50,000, ref_year = 2022.
---
### 9.4 US Census Bureau — ACS 5-Year Place Population
| Field | Value |
|-------|-------|
| Endpoint | `https://api.census.gov/data/2023/acs/acs5?get=B01003_001E,NAME&for=place:*&in=state:*` |
| Credentials | `CENSUS_API_KEY` (free — register at https://api.census.gov/data/key_signup.html) |
| Variable | `B01003_001E` = total population (ACS concept: Total Population) |
| Vintage | 2023 (released late 2024) |
| Coverage | ~30,000 Census places across all 50 states + DC |
Response is a JSON array: first row = headers `["B01003_001E", "NAME", "state", "place"]`, subsequent rows = data.
Place names follow the pattern `"Los Angeles city, California"` — strip the suffix (` city`, ` town`, ` CDP`, ` borough`, ` village`, ` municipality`) and take the part before the first comma.
Filtered to population ≥ 50,000: ~1,500 US cities.
**Status:** Extractor implemented. Requires `CENSUS_API_KEY` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so the SQLMesh staging model does not fail. US population data will be empty until the key is added.
---
### 9.5 GeoNames — cities15000 Global Bulk
| Field | Value |
|-------|-------|
| Download URL | `https://download.geonames.org/export/dump/cities15000.zip` |
| Credentials | `GEONAMES_USERNAME` (free — register at https://www.geonames.org/login) |
| File size | ~1.5 MB compressed, ~26,000 entries (all cities ≥ 15,000 pop) |
| Update frequency | Monthly (GeoNames updates continuously) |
| License | CC BY 4.0 |
The username is passed as `?username=...` in the URL query string (signals ToS acceptance to GeoNames; no auth gate).
Tab-separated format (19 columns). Relevant columns:
- col 0: `geoname_id` — stable numeric ID
- col 1: `name` — Unicode name
- col 2: `asciiname` — ASCII transliteration (preferred for matching)
- col 7: `feature_code` — filter to `{PPLC, PPLA, PPLA2, PPL}` (excludes airports, parks)
- col 8: `country_code` — ISO 2-letter
- col 14: `population`
Filtered to pop ≥ 50,000 and valid feature codes: ~7,0009,000 cities globally.
**Status:** Extractor implemented. Requires `GEONAMES_USERNAME` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so staging models do not fail. Acts as the final fallback in `dim_cities` for any city not matched by Eurostat, US Census, or ONS.
---
### 9.6 DuckDB `read_json()` glob limitation
**Finding:** DuckDB's `glob()` is a **table function** (returns rows), not a scalar function. It cannot be used as an argument to `read_json()` as `read_json(glob('/path/*'))` or via a subquery `read_json((SELECT list(file) FROM glob(...)))`.
**Workaround used:** Extractors that skip due to missing credentials write a `{"rows": [], "count": 0}` placeholder file, ensuring at least one file always exists for each source's glob pattern. The SQLMesh staging models use string glob patterns like:
```sql
read_json(@LANDING_DIR || '/census_usa/*/*/acs5_places.json.gz', auto_detect = true)
```
This pattern requires the files to physically exist. The placeholder approach is cleaner than conditional SQL in the model.
---
### 9.7 Population Coverage Summary (Feb 2026)
| Source | Region | Cities extracted | Credentials needed |
|--------|--------|------------------|--------------------|
| Eurostat `urb_cpop1` + SDMX city labels | EU-27 + EEA | ~1,400 cities | None |
| ONS mid-year estimates | England & Wales | 316 LADs | None |
| US Census ACS 5-year | United States | ~1,500 places | `CENSUS_API_KEY` (free) |
| GeoNames cities15000 | Global fallback | ~7,500 cities | `GEONAMES_USERNAME` (free) |
Population cascade in `dim_cities`: Eurostat → US Census → ONS → GeoNames → 0.
---
## Sources
- [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)