padelnomics

Author	SHA1	Message	Date
Deeman	ad48f23cfc	fix: add precondition assertions in extract pipeline Assert landing_dir.is_dir() and year_month format (YYYY/MM) at the entry point of each extract function — turning silent wrong-path bugs into immediate AssertionError with a descriptive message. Files changed: - playtomic_availability.py: assert in _load_tenant_ids(), extract(), extract_recheck() - eurostat.py: assert in extract() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 20:42:11 +01:00
Deeman	d3db830c98	Merge branch 'master' into worktree-dual-market-score # Conflicts: # .env.dev.sops	2026-02-24 16:38:25 +01:00
Deeman	c109488d9d	feat(extract): expand GeoNames to cities1000 + add tennis court extractor GeoNames: - cities15000 → cities1000 (~140K global locations, pop ≥ 1K) - Add lat/lon, admin1_code, admin2_code to output (needed for dim_locations) - Expand feature codes to include PPLA3/4/5 (Gemeinden, cantons, etc.) - Remove MIN_POPULATION=50K floor — cities1000 already pre-filters to ≥1K - Update assertions for new scale (~100K+ expected) Tennis courts: - New overpass_tennis.py extractor (sport=tennis, 180s Overpass timeout) - Registered as extract-overpass-tennis, added to EXTRACTORS list - New stg_tennis_courts.sql staging model (grain: osm_id) stg_population_geonames: add lat, lon, admin1_code, admin2_code columns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 16:15:20 +01:00
Deeman	0b472e1a32	feat(extract): tiered proxy with circuit breaker for Playtomic availability Adds a two-tier proxy system for the Playtomic availability extractor: - Primary tier (PROXY_URLS): datacenter proxies, cheap and fast - Fallback tier (PROXY_URLS_FALLBACK): residential rotating gateway, reliable Circuit breaker opens after CIRCUIT_BREAKER_THRESHOLD (default: 10) consecutive failures, permanently switching to the fallback tier for the rest of the run. No auto-recovery — avoids flapping. If circuit opens with no fallback configured, logs an error and writes partial results rather than continuing on a dead proxy pool. Parallel mode submits futures in PARALLEL_BATCH_SIZE=100 batches so the circuit breaker can stop new submissions after it opens. New env vars added to .env.dev.sops (blank defaults): PROXY_URLS_FALLBACK — residential/rotating gateway URL CIRCUIT_BREAKER_THRESHOLD — consecutive failures before switching (default 10) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 16:01:50 +01:00
Deeman	1762188f08	docs(inventory): document population pipeline implementation findings - Add section 9 to data-sources-inventory.md covering live API quirks: Eurostat SDMX city labels response shape, ONS CSV download path (observations API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format - Add population coverage summary table and DuckDB glob limitation note - fix(extract): census_usa + geonames write empty placeholder when credentials absent so SQLMesh staging models don't fail with "no files found" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 01:01:10 +01:00
Deeman	06cbdf80dc	fix(extract): correct Eurostat + ONS response parsers against live APIs eurostat_city_labels: API returns compact dimension JSON (category.label dict), not SDMX 2.1 nested codelists structure. Fixed parser to read from data["category"]["label"]. 1771 city codes fetched successfully. ons_uk: observations endpoint (TS007A) is 404. Switched to CSV download via /datasets/mid-year-pop-est/editions/mid-2022-england-wales — fetches ~68MB CSV, filters to sex='all' + target year, aggregates population per LAD. 316 LADs ≥50K. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 00:26:13 +01:00
Deeman	0960990373	feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors Part A: Data Layer — Sprints 1-5 Sprint 1 — Eurostat SDMX city labels (unblocks EU population): - New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist (city_code → city_name mapping) with ETag dedup - New staging model: stg_city_labels.sql — grain city_code - Updated dim_cities.sql — joins Eurostat population via city code lookup; replaces hardcoded 0::BIGINT population Sprint 2 — Market score formula v2: - city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200), 30pt demand (occupancy or density), 15pt data confidence - Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate is available to the scoring formula Sprint 3 — US Census ACS extractor: - New extractor: census_usa.py — ACS 5-year place population (vintage 2023) - New staging model: stg_population_usa.sql — grain (place_fips, ref_year) Sprint 4 — ONS UK extractor: - New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API - New staging model: stg_population_uk.sql — grain (lad_code, ref_year) Sprint 5 — GeoNames global extractor: - New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop - New staging model: stg_population_geonames.sql — grain geoname_id - dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0) with case/whitespace-insensitive city name matching Registered all 4 new CLI entrypoints in pyproject.toml and all.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 00:07:08 +01:00
Deeman	a1faddbed6	feat: Python supervisor + feature flags Supervisor (replaces supervisor.sh): - supervisor.py — cron-based pipeline orchestration, reads workflows.toml on every tick, runs due extractors in topological waves with parallel execution, then SQLMesh transform + serving export - workflows.toml — workflow registry: overpass (monthly), eurostat (monthly), playtomic_tenants (weekly), playtomic_availability (daily), playtomic_recheck (hourly 6–23) - padelnomics-supervisor.service — updated ExecStart to Python supervisor Extraction enhancements: - proxy.py — optional round-robin/sticky proxy rotation via PROXY_URLS env - playtomic_availability.py — parallel fetch (EXTRACT_WORKERS), recheck mode (main_recheck) re-queries imminent slots for accurate occupancy measurement - _shared.py — realistic browser User-Agent on all extractor sessions - stg_playtomic_availability.sql — reads morning + recheck snapshots, tags each - fct_daily_availability.sql — prefers recheck over morning for same slot Feature flags (replaces WAITLIST_MODE env var): - migration 0019 — feature_flags table, 5 initial flags: markets (on), payments/planner_export/supplier_signup/lead_unlock (off) - core.py — is_flag_enabled() + feature_gate() decorator - routes — payments, markets, planner_export, supplier_signup, lead_unlock gated - admin flags UI — /admin/flags toggle page + nav link - app.py — flag() injected as Jinja2 global Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 13:53:45 +01:00
Deeman	a055660cd2	fix: replace broken bbox pagination with global page-based extraction Playtomic API ignores bbox params (min_latitude, etc.) and offset param. Discovered that `page` param works correctly for global enumeration. Result: 14,202 venues across 82 countries (was 100 with bbox approach). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 01:16:35 +01:00
Deeman	13c86ebf84	Merge branch 'worktree-extraction-overhaul' # Conflicts: # transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql # transform/sqlmesh_padelnomics/models/staging/stg_playtomic_venues.sql	2026-02-23 01:01:26 +01:00
Deeman	79f7fc6fad	feat: Playtomic pricing/occupancy pipeline + email i18n + audience restructure Three workstreams: 1. Playtomic full data extraction & transform pipeline: - Expand venue bounding boxes from 4 to 23 regions (global coverage) - New staging models for court resources, opening hours, and slot-level availability with real prices from the Playtomic API - Foundation fact tables for venue capacity and daily occupancy/revenue - City-level pricing benchmarks replacing hardcoded country estimates - Planner defaults now use 3-tier cascade: city data → country → fallback 2. Transactional email i18n: - _t() helper in worker.py with ~70 translation keys (EN + DE) - All 8 email handlers translated, lang passed in task payloads 3. Resend audiences restructured to 3 named audiences (free plan limit) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 00:54:53 +01:00
Deeman	5a1bb21624	fix: eurostat JSON-stat parsing + staging model corrections Eurostat JSON-stat format (4-7 dimension sparse dict with 583K values) causes DuckDB OOM — pre-process in extractor to flat records. Also fix dim_cities unused CTE bug and playtomic venue lat/lon path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 20:52:25 +01:00
Deeman	c25e20f83a	fix: playtomic pagination stale-page exit + calculator test assertions Playtomic tenants API recycles results past its internal limit — stop after 3 consecutive pages with zero new unique IDs. Calculator tests: replace hardcoded default values (6 courts, specific sqm/capex) with DEFAULTS references so tests don't break when defaults change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 20:06:48 +01:00
Deeman	2db66efe77	feat: migrate transform to 3-layer architecture with per-layer schemas Remove raw/ layer — staging models now read landing JSON directly. Rename all model schemas from padelnomics.* to staging./foundation./serving.*. Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH. Supervisor gets daily sleep interval between pipeline runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 19:04:40 +01:00
Deeman	53e9bbd66b	feat: restructure extraction to one file per source Split monolithic execute.py into per-source modules with separate CLI entry points. Each extractor now uses the framework from utils.py: - SQLite state tracking (start_run / end_run per extractor) - Proper logging (replace print() with logger) - Atomic gzip writes (write_gzip_atomic) - Connection pooling (niquests.Session) - Bounded pagination (MAX_PAGES_PER_BBOX = 500) New entry points: extract — run all 4 extractors sequentially extract-overpass — OSM padel courts extract-eurostat — city demographics (etag dedup) extract-playtomic-tenants — venue listings extract-playtomic-availability — booking slots + pricing (NEW) The availability extractor reads tenant IDs from the latest tenants.json.gz, queries next-day slots for each venue, and stores daily consolidated snapshots. Supports resumability via cursor and retry with backoff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 18:56:41 +01:00
Deeman	ea86940b78	feat: copier update v0.9.0 → v0.10.0 Pulls in template changes: export_serving.py for atomic DuckDB swap, supervisor export step, SQLMesh glob macro, server provisioning script, imprint template, and formatting improvements. Template scaffold SQL models excluded (padelnomics has real models). Web app routes/analytics unchanged (padelnomics-specific customizations). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 17:50:36 +01:00
Deeman	18ee24818b	feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 15:44:48 +01:00
Deeman	4ae00b35d1	refactor: flatten padelnomics/padelnomics/ → repo root git mv all tracked files from the nested padelnomics/ workspace directory to the git repo root. Merged .gitignore files. No code changes — pure path rename. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-22 00:44:40 +01:00

18 Commits