Files

Deeman 0960990373 feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors

Part A: Data Layer — Sprints 1-5

Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
  (city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
  replaces hardcoded 0::BIGINT population

Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
  30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
  is available to the scoring formula

Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)

Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)

Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
  with case/whitespace-insensitive city name matching

Registered all 4 new CLI entrypoints in pyproject.toml and all.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 00:07:08 +01:00

src/padelnomics_extract

feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors

2026-02-24 00:07:08 +01:00

pyproject.toml

feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors

2026-02-24 00:07:08 +01:00

README.md

feat: migrate transform to 3-layer architecture with per-layer schemas

2026-02-22 19:04:40 +01:00

README.md

Padelnomics Extraction

Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.

Running

# Run all extractors sequentially
LANDING_DIR=data/landing uv run extract

# Run a single extractor
LANDING_DIR=data/landing uv run extract-overpass
LANDING_DIR=data/landing uv run extract-eurostat
LANDING_DIR=data/landing uv run extract-playtomic-tenants
LANDING_DIR=data/landing uv run extract-playtomic-availability

Architecture: one file per source

Each data source lives in its own module with a dedicated CLI entry point:

src/padelnomics_extract/
├── __init__.py
├── _shared.py                  # LANDING_DIR, logger, run_extractor() wrapper
├── utils.py                    # SQLite state tracking, atomic I/O helpers
├── overpass.py                 # OSM padel courts via Overpass API
├── eurostat.py                 # Eurostat city demographics (urb_cpop1, ilc_di03)
├── playtomic_tenants.py        # Playtomic venue listings (tenant search)
├── playtomic_availability.py   # Playtomic booking slots (next-day availability)
└── all.py                      # Runs all extractors sequentially

Adding a new extractor

Create my_source.py following the pattern:

from ._shared import run_extractor, setup_logging
from .utils import landing_path, write_gzip_atomic

logger = setup_logging("padelnomics.extract.my_source")
EXTRACTOR_NAME = "my_source"

def extract(landing_dir, year_month, conn, session):
    """Returns {"files_written": N, "bytes_written": N, ...}."""
    year, month = year_month.split("/")
    dest_dir = landing_path(landing_dir, "my_source", year, month)
    # ... fetch data, write to dest_dir ...
    return {"files_written": 1, "files_skipped": 0, "bytes_written": n}

def main():
    run_extractor(EXTRACTOR_NAME, extract)

Add entry point to pyproject.toml:

extract-my-source = "padelnomics_extract.my_source:main"

Import in all.py and add to EXTRACTORS list.
Add a staging model in transform/sqlmesh_padelnomics/models/staging/.

Design: filesystem as state

The landing zone is an append-only store of raw files:

Idempotency: running twice writes nothing if the source hasn't changed
Debugging: every historical raw file is preserved
Safety: extraction never mutates existing files, only appends new ones

Etag-based dedup (Eurostat)

When the source provides an ETag header, store it in a sibling .etag file. On the next request, send If-None-Match — 304 means skip.

Content-addressed (Overpass, Playtomic)

Files named by date or content. write_gzip_atomic() writes to a .tmp sibling then renames — never leaves partial files on crash.

State tracking

Every run writes one row to data/landing/.state.sqlite:

sqlite3 data/landing/.state.sqlite \
  "SELECT extractor, started_at, status, files_written, cursor_value
   FROM extraction_runs ORDER BY run_id DESC LIMIT 10"

Column	Type	Description
`run_id`	INTEGER	Auto-increment primary key
`extractor`	TEXT	Extractor name (e.g. `overpass`, `eurostat`)
`started_at`	TEXT	ISO 8601 UTC timestamp
`finished_at`	TEXT	ISO 8601 UTC timestamp
`status`	TEXT	`running` → `success` or `failed`
`files_written`	INTEGER	New files written this run
`files_skipped`	INTEGER	Files already present
`bytes_written`	INTEGER	Compressed bytes written
`cursor_value`	TEXT	Resume cursor (date, index, etc.)
`error_message`	TEXT	Exception message if failed

Landing zone structure

data/landing/
├── .state.sqlite
├── overpass/{year}/{month}/courts.json.gz
├── eurostat/{year}/{month}/urb_cpop1.json.gz
├── eurostat/{year}/{month}/ilc_di03.json.gz
├── playtomic/{year}/{month}/tenants.json.gz
└── playtomic/{year}/{month}/availability_{date}.json.gz

Data sources

Source	Module	Schedule	Notes
Overpass API	`overpass.py`	Daily	OSM padel courts, ~5K nodes
Eurostat	`eurostat.py`	Daily (304 most runs)	urb_cpop1, ilc_di03 — etag dedup
Playtomic tenants	`playtomic_tenants.py`	Daily	~8K venues, bounded pagination
Playtomic availability	`playtomic_availability.py`	Daily	Next-day slots, ~4.5h runtime