docs(inventory): document population pipeline implementation findings
- Add section 9 to data-sources-inventory.md covering live API quirks: Eurostat SDMX city labels response shape, ONS CSV download path (observations API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format - Add population coverage summary table and DuckDB glob limitation note - fix(extract): census_usa + geonames write empty placeholder when credentials absent so SQLMesh staging models don't fail with "no files found" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -64,7 +64,12 @@ def extract(
|
|||||||
"""Fetch ACS 5-year place population. Skips if already run this month."""
|
"""Fetch ACS 5-year place population. Skips if already run this month."""
|
||||||
api_key = os.environ.get("CENSUS_API_KEY", "").strip()
|
api_key = os.environ.get("CENSUS_API_KEY", "").strip()
|
||||||
if not api_key:
|
if not api_key:
|
||||||
logger.warning("CENSUS_API_KEY not set — skipping US Census extract")
|
logger.warning("CENSUS_API_KEY not set — writing empty placeholder so SQLMesh models can run")
|
||||||
|
year, month = year_month.split("/")
|
||||||
|
dest_dir = landing_path(landing_dir, "census_usa", year, month)
|
||||||
|
dest = dest_dir / "acs5_places.json.gz"
|
||||||
|
if not dest.exists():
|
||||||
|
write_gzip_atomic(dest, b'{"rows": [], "count": 0}')
|
||||||
return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
|
return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
|
||||||
|
|
||||||
# Skip if we already have data for this month (annual data, monthly cursor)
|
# Skip if we already have data for this month (annual data, monthly cursor)
|
||||||
|
|||||||
@@ -105,7 +105,12 @@ def extract(
|
|||||||
"""Download GeoNames cities15000.zip. Skips if already run this month."""
|
"""Download GeoNames cities15000.zip. Skips if already run this month."""
|
||||||
username = os.environ.get("GEONAMES_USERNAME", "").strip()
|
username = os.environ.get("GEONAMES_USERNAME", "").strip()
|
||||||
if not username:
|
if not username:
|
||||||
logger.warning("GEONAMES_USERNAME not set — skipping GeoNames extract")
|
logger.warning("GEONAMES_USERNAME not set — writing empty placeholder so SQLMesh models can run")
|
||||||
|
year, month = year_month.split("/")
|
||||||
|
dest_dir = landing_path(landing_dir, "geonames", year, month)
|
||||||
|
dest = dest_dir / "cities_global.json.gz"
|
||||||
|
if not dest.exists():
|
||||||
|
write_gzip_atomic(dest, b'{"rows": [], "count": 0}')
|
||||||
return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
|
return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
|
||||||
|
|
||||||
last_cursor = get_last_cursor(conn, EXTRACTOR_NAME)
|
last_cursor = get_last_cursor(conn, EXTRACTOR_NAME)
|
||||||
|
|||||||
@@ -579,6 +579,148 @@ Token-based REST API. Free tier includes 50k requests/month and last 6 months of
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 9. Live Implementation Findings (Feb 2026)
|
||||||
|
|
||||||
|
Findings from implementing the population pipeline extractors and staging models.
|
||||||
|
Pipeline stack: Python extractors → landing zone JSON.gz → SQLMesh / DuckDB.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.1 Eurostat SDMX City Labels Codelist
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Endpoint | `https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/codelist/ESTAT/CITIES?format=JSON` |
|
||||||
|
| Credentials | None required |
|
||||||
|
| Response size | ~190 KB |
|
||||||
|
| Dedup | ETag header — only re-downloads when changed |
|
||||||
|
|
||||||
|
The SDMX 2.1 codelist endpoint for `ESTAT/CITIES` returns a **compact dimension JSON**, not the SDMX 2.1 full XML/JSON structure. The useful content is:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"category": {"label": {"DE001C": "Berlin", "DE002C": "Hamburg", ...}}}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a flat `{city_code: city_name}` dict with ~1,800 entries covering EU cities.
|
||||||
|
|
||||||
|
**Critical finding**: The response does NOT match the SDMX 2.1 spec's `data["structure"]["codelists"]` path. The correct path is `data["category"]["label"]`.
|
||||||
|
|
||||||
|
Country-level entries (e.g. `"DE"`, `"FR"`) appear in the dict without digits — filter them by requiring `any(c.isdigit() for c in city_code)` to keep only proper city codes like `DE001C`.
|
||||||
|
|
||||||
|
**Result:** 1,771 city codes extracted. Provides the missing city_code → city_name mapping needed to join Eurostat population data (`urb_cpop1`, which uses city codes) against dim_cities (which uses city names).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.2 Eurostat Population — `urb_cpop1`
|
||||||
|
|
||||||
|
(Existing extractor `eurostat.py`; no changes required.)
|
||||||
|
|
||||||
|
The `urb_cpop1` dataset provides city-level population estimates at the `geoLevel=city` dimension. Data is keyed by city code (e.g. `DE001C`), not city name — the SDMX city labels codelist (9.1) provides the bridge.
|
||||||
|
|
||||||
|
**Population pipeline join:** `stg_city_labels` + `stg_population` → `dim_cities` populates `~75%` of EU/UK cities matched by name. Cities without a matching city code in Eurostat (smaller/newer cities) fall through to the GeoNames fallback.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.3 ONS UK — Mid-Year Population Estimates
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Dataset | `mid-year-pop-est` |
|
||||||
|
| Edition | `mid-2022-england-wales` |
|
||||||
|
| Download URL | `https://api.beta.ons.gov.uk/v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions/{N}/downloads/csv/href` |
|
||||||
|
| Credentials | None required |
|
||||||
|
| File size | ~68 MB uncompressed CSV |
|
||||||
|
| Reference year | 2022 (mid-year estimate) |
|
||||||
|
|
||||||
|
**Critical finding**: The ONS observations API endpoint (`/observations?geography=*&age=0`) returns 404 for the datasets documented in the developer portal. The correct approach for bulk data is the **CSV download** path, reached by:
|
||||||
|
|
||||||
|
1. `GET /v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions` — list versions, pick `max(version)`
|
||||||
|
2. `GET versions[latest]["downloads"]["csv"]["href"]` — download the ~68 MB CSV
|
||||||
|
|
||||||
|
CSV format: one row per `(year × LAD × sex × age-group)`. Filter:
|
||||||
|
- `sex = 'all'` (aggregate row — do not sum individual sex rows as that double-counts)
|
||||||
|
- `calendar-years = '2022'` (target year from edition name)
|
||||||
|
- LAD codes starting with `E0`, `W0`, `S1`, `N0` (English/Welsh/Scottish/NI districts; excludes region/country aggregate codes)
|
||||||
|
|
||||||
|
Sum the `v4_0` column per `administrative-geography` (LAD code) to get total population.
|
||||||
|
|
||||||
|
**Result:** 316 UK Local Authority Districts with population ≥ 50,000, ref_year = 2022.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.4 US Census Bureau — ACS 5-Year Place Population
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Endpoint | `https://api.census.gov/data/2023/acs/acs5?get=B01003_001E,NAME&for=place:*&in=state:*` |
|
||||||
|
| Credentials | `CENSUS_API_KEY` (free — register at https://api.census.gov/data/key_signup.html) |
|
||||||
|
| Variable | `B01003_001E` = total population (ACS concept: Total Population) |
|
||||||
|
| Vintage | 2023 (released late 2024) |
|
||||||
|
| Coverage | ~30,000 Census places across all 50 states + DC |
|
||||||
|
|
||||||
|
Response is a JSON array: first row = headers `["B01003_001E", "NAME", "state", "place"]`, subsequent rows = data.
|
||||||
|
|
||||||
|
Place names follow the pattern `"Los Angeles city, California"` — strip the suffix (` city`, ` town`, ` CDP`, ` borough`, ` village`, ` municipality`) and take the part before the first comma.
|
||||||
|
|
||||||
|
Filtered to population ≥ 50,000: ~1,500 US cities.
|
||||||
|
|
||||||
|
**Status:** Extractor implemented. Requires `CENSUS_API_KEY` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so the SQLMesh staging model does not fail. US population data will be empty until the key is added.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.5 GeoNames — cities15000 Global Bulk
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Download URL | `https://download.geonames.org/export/dump/cities15000.zip` |
|
||||||
|
| Credentials | `GEONAMES_USERNAME` (free — register at https://www.geonames.org/login) |
|
||||||
|
| File size | ~1.5 MB compressed, ~26,000 entries (all cities ≥ 15,000 pop) |
|
||||||
|
| Update frequency | Monthly (GeoNames updates continuously) |
|
||||||
|
| License | CC BY 4.0 |
|
||||||
|
|
||||||
|
The username is passed as `?username=...` in the URL query string (signals ToS acceptance to GeoNames; no auth gate).
|
||||||
|
|
||||||
|
Tab-separated format (19 columns). Relevant columns:
|
||||||
|
- col 0: `geoname_id` — stable numeric ID
|
||||||
|
- col 1: `name` — Unicode name
|
||||||
|
- col 2: `asciiname` — ASCII transliteration (preferred for matching)
|
||||||
|
- col 7: `feature_code` — filter to `{PPLC, PPLA, PPLA2, PPL}` (excludes airports, parks)
|
||||||
|
- col 8: `country_code` — ISO 2-letter
|
||||||
|
- col 14: `population`
|
||||||
|
|
||||||
|
Filtered to pop ≥ 50,000 and valid feature codes: ~7,000–9,000 cities globally.
|
||||||
|
|
||||||
|
**Status:** Extractor implemented. Requires `GEONAMES_USERNAME` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so staging models do not fail. Acts as the final fallback in `dim_cities` for any city not matched by Eurostat, US Census, or ONS.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.6 DuckDB `read_json()` glob limitation
|
||||||
|
|
||||||
|
**Finding:** DuckDB's `glob()` is a **table function** (returns rows), not a scalar function. It cannot be used as an argument to `read_json()` as `read_json(glob('/path/*'))` or via a subquery `read_json((SELECT list(file) FROM glob(...)))`.
|
||||||
|
|
||||||
|
**Workaround used:** Extractors that skip due to missing credentials write a `{"rows": [], "count": 0}` placeholder file, ensuring at least one file always exists for each source's glob pattern. The SQLMesh staging models use string glob patterns like:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
read_json(@LANDING_DIR || '/census_usa/*/*/acs5_places.json.gz', auto_detect = true)
|
||||||
|
```
|
||||||
|
|
||||||
|
This pattern requires the files to physically exist. The placeholder approach is cleaner than conditional SQL in the model.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9.7 Population Coverage Summary (Feb 2026)
|
||||||
|
|
||||||
|
| Source | Region | Cities extracted | Credentials needed |
|
||||||
|
|--------|--------|------------------|--------------------|
|
||||||
|
| Eurostat `urb_cpop1` + SDMX city labels | EU-27 + EEA | ~1,400 cities | None |
|
||||||
|
| ONS mid-year estimates | England & Wales | 316 LADs | None |
|
||||||
|
| US Census ACS 5-year | United States | ~1,500 places | `CENSUS_API_KEY` (free) |
|
||||||
|
| GeoNames cities15000 | Global fallback | ~7,500 cities | `GEONAMES_USERNAME` (free) |
|
||||||
|
|
||||||
|
Population cascade in `dim_cities`: Eurostat → US Census → ONS → GeoNames → 0.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Sources
|
## Sources
|
||||||
|
|
||||||
- [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)
|
- [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)
|
||||||
|
|||||||
Reference in New Issue
Block a user