docs(inventory): document population pipeline implementation findings

- Add section 9 to data-sources-inventory.md covering live API quirks: Eurostat SDMX city labels response shape, ONS CSV download path (observations API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format - Add population coverage summary table and DuckDB glob limitation note - fix(extract): census_usa + geonames write empty placeholder when credentials absent so SQLMesh staging models don't fail with "no files found" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 01:01:10 +01:00
parent 06cbdf80dc
commit 1762188f08
3 changed files with 154 additions and 2 deletions
--- a/extract/padelnomics_extract/src/padelnomics_extract/census_usa.py
+++ b/extract/padelnomics_extract/src/padelnomics_extract/census_usa.py
@@ -64,7 +64,12 @@ def extract(
    """Fetch ACS 5-year place population. Skips if already run this month."""
    api_key = os.environ.get("CENSUS_API_KEY", "").strip()
    if not api_key:
-        logger.warning("CENSUS_API_KEY not set — skipping US Census extract")
+        logger.warning("CENSUS_API_KEY not set — writing empty placeholder so SQLMesh models can run")
        year, month = year_month.split("/")
        dest_dir = landing_path(landing_dir, "census_usa", year, month)
        dest = dest_dir / "acs5_places.json.gz"
        if not dest.exists():
            write_gzip_atomic(dest, b'{"rows": [], "count": 0}')
        return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
    # Skip if we already have data for this month (annual data, monthly cursor)
--- a/extract/padelnomics_extract/src/padelnomics_extract/geonames.py
+++ b/extract/padelnomics_extract/src/padelnomics_extract/geonames.py
@@ -105,7 +105,12 @@ def extract(
    """Download GeoNames cities15000.zip. Skips if already run this month."""
    username = os.environ.get("GEONAMES_USERNAME", "").strip()
    if not username:
-        logger.warning("GEONAMES_USERNAME not set — skipping GeoNames extract")
+        logger.warning("GEONAMES_USERNAME not set — writing empty placeholder so SQLMesh models can run")
        year, month = year_month.split("/")
        dest_dir = landing_path(landing_dir, "geonames", year, month)
        dest = dest_dir / "cities_global.json.gz"
        if not dest.exists():
            write_gzip_atomic(dest, b'{"rows": [], "count": 0}')
        return {"files_written": 0, "files_skipped": 1, "bytes_written": 0}
    last_cursor = get_last_cursor(conn, EXTRACTOR_NAME)
--- a/research/data-sources-inventory.md
+++ b/research/data-sources-inventory.md
@@ -579,6 +579,148 @@ Token-based REST API. Free tier includes 50k requests/month and last 6 months of
 ---
 ## 9. Live Implementation Findings (Feb 2026)
 Findings from implementing the population pipeline extractors and staging models.
 Pipeline stack: Python extractors → landing zone JSON.gz → SQLMesh / DuckDB.
 ---
 ### 9.1 Eurostat SDMX City Labels Codelist
 | Field | Value |
 |-------|-------|
 | Endpoint | `https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/codelist/ESTAT/CITIES?format=JSON` |
 | Credentials | None required |
 | Response size | ~190 KB |
 | Dedup | ETag header — only re-downloads when changed |
 The SDMX 2.1 codelist endpoint for `ESTAT/CITIES` returns a **compact dimension JSON**, not the SDMX 2.1 full XML/JSON structure. The useful content is:
 ```json
 {"category": {"label": {"DE001C": "Berlin", "DE002C": "Hamburg", ...}}}
 ```
 This is a flat `{city_code: city_name}` dict with ~1,800 entries covering EU cities.
 **Critical finding**: The response does NOT match the SDMX 2.1 spec's `data["structure"]["codelists"]` path. The correct path is `data["category"]["label"]`.
 Country-level entries (e.g. `"DE"`, `"FR"`) appear in the dict without digits — filter them by requiring `any(c.isdigit() for c in city_code)` to keep only proper city codes like `DE001C`.
 **Result:** 1,771 city codes extracted. Provides the missing city_code → city_name mapping needed to join Eurostat population data (`urb_cpop1`, which uses city codes) against dim_cities (which uses city names).
 ---
 ### 9.2 Eurostat Population — `urb_cpop1`
 (Existing extractor `eurostat.py`; no changes required.)
 The `urb_cpop1` dataset provides city-level population estimates at the `geoLevel=city` dimension. Data is keyed by city code (e.g. `DE001C`), not city name — the SDMX city labels codelist (9.1) provides the bridge.
 **Population pipeline join:** `stg_city_labels` + `stg_population` → `dim_cities` populates `~75%` of EU/UK cities matched by name. Cities without a matching city code in Eurostat (smaller/newer cities) fall through to the GeoNames fallback.
 ---
 ### 9.3 ONS UK — Mid-Year Population Estimates
 | Field | Value |
 |-------|-------|
 | Dataset | `mid-year-pop-est` |
 | Edition | `mid-2022-england-wales` |
 | Download URL | `https://api.beta.ons.gov.uk/v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions/{N}/downloads/csv/href` |
 | Credentials | None required |
 | File size | ~68 MB uncompressed CSV |
 | Reference year | 2022 (mid-year estimate) |
 **Critical finding**: The ONS observations API endpoint (`/observations?geography=*&age=0`) returns 404 for the datasets documented in the developer portal. The correct approach for bulk data is the **CSV download** path, reached by:
 1. `GET /v1/datasets/mid-year-pop-est/editions/mid-2022-england-wales/versions` — list versions, pick `max(version)`
 2. `GET versions[latest]["downloads"]["csv"]["href"]` — download the ~68 MB CSV
 CSV format: one row per `(year × LAD × sex × age-group)`. Filter:
 - `sex = 'all'` (aggregate row — do not sum individual sex rows as that double-counts)
 - `calendar-years = '2022'` (target year from edition name)
 - LAD codes starting with `E0`, `W0`, `S1`, `N0` (English/Welsh/Scottish/NI districts; excludes region/country aggregate codes)
 Sum the `v4_0` column per `administrative-geography` (LAD code) to get total population.
 **Result:** 316 UK Local Authority Districts with population ≥ 50,000, ref_year = 2022.
 ---
 ### 9.4 US Census Bureau — ACS 5-Year Place Population
 | Field | Value |
 |-------|-------|
 | Endpoint | `https://api.census.gov/data/2023/acs/acs5?get=B01003_001E,NAME&for=place:*&in=state:*` |
 | Credentials | `CENSUS_API_KEY` (free — register at https://api.census.gov/data/key_signup.html) |
 | Variable | `B01003_001E` = total population (ACS concept: Total Population) |
 | Vintage | 2023 (released late 2024) |
 | Coverage | ~30,000 Census places across all 50 states + DC |
 Response is a JSON array: first row = headers `["B01003_001E", "NAME", "state", "place"]`, subsequent rows = data.
 Place names follow the pattern `"Los Angeles city, California"` — strip the suffix (` city`, ` town`, ` CDP`, ` borough`, ` village`, ` municipality`) and take the part before the first comma.
 Filtered to population ≥ 50,000: ~1,500 US cities.
 **Status:** Extractor implemented. Requires `CENSUS_API_KEY` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so the SQLMesh staging model does not fail. US population data will be empty until the key is added.
 ---
 ### 9.5 GeoNames — cities15000 Global Bulk
 | Field | Value |
 |-------|-------|
 | Download URL | `https://download.geonames.org/export/dump/cities15000.zip` |
 | Credentials | `GEONAMES_USERNAME` (free — register at https://www.geonames.org/login) |
 | File size | ~1.5 MB compressed, ~26,000 entries (all cities ≥ 15,000 pop) |
 | Update frequency | Monthly (GeoNames updates continuously) |
 | License | CC BY 4.0 |
 The username is passed as `?username=...` in the URL query string (signals ToS acceptance to GeoNames; no auth gate).
 Tab-separated format (19 columns). Relevant columns:
 - col 0: `geoname_id` — stable numeric ID
 - col 1: `name` — Unicode name
 - col 2: `asciiname` — ASCII transliteration (preferred for matching)
 - col 7: `feature_code` — filter to `{PPLC, PPLA, PPLA2, PPL}` (excludes airports, parks)
 - col 8: `country_code` — ISO 2-letter
 - col 14: `population`
 Filtered to pop ≥ 50,000 and valid feature codes: ~7,000–9,000 cities globally.
 **Status:** Extractor implemented. Requires `GEONAMES_USERNAME` env var. Without it, a `{"rows": [], "count": 0}` placeholder is written so staging models do not fail. Acts as the final fallback in `dim_cities` for any city not matched by Eurostat, US Census, or ONS.
 ---
 ### 9.6 DuckDB `read_json()` glob limitation
 **Finding:** DuckDB's `glob()` is a **table function** (returns rows), not a scalar function. It cannot be used as an argument to `read_json()` as `read_json(glob('/path/*'))` or via a subquery `read_json((SELECT list(file) FROM glob(...)))`.
 **Workaround used:** Extractors that skip due to missing credentials write a `{"rows": [], "count": 0}` placeholder file, ensuring at least one file always exists for each source's glob pattern. The SQLMesh staging models use string glob patterns like:
 ```sql
 read_json(@LANDING_DIR || '/census_usa/*/*/acs5_places.json.gz', auto_detect = true)
 ```
 This pattern requires the files to physically exist. The placeholder approach is cleaner than conditional SQL in the model.
 ---
 ### 9.7 Population Coverage Summary (Feb 2026)
 | Source | Region | Cities extracted | Credentials needed |
 |--------|--------|------------------|--------------------|
 | Eurostat `urb_cpop1` + SDMX city labels | EU-27 + EEA | ~1,400 cities | None |
 | ONS mid-year estimates | England & Wales | 316 LADs | None |
 | US Census ACS 5-year | United States | ~1,500 places | `CENSUS_API_KEY` (free) |
 | GeoNames cities15000 | Global fallback | ~7,500 cities | `GEONAMES_USERNAME` (free) |
 Population cascade in `dim_cities`: Eurostat → US Census → ONS → GeoNames → 0.
 ---
 ## Sources
 - [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)