Tests imported make_sticky_selector but it was never implemented.
Hash-based (MD5) consistent selector — same key always returns the
same proxy, distributes across the pool.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
878 → 4212 cities. Broadens coverage to match the granularity of
Eurostat and GeoNames data for smaller metro markets.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single hardcoded Chrome 131 UA with:
- BOT_UA: honest padelnomics-bot UA for Overpass, Eurostat, GeoNames etc.
- _UA_POOL + ua_for_proxy(): deterministic browser UA per proxy URL so each
IP presents a consistent, distinct fingerprint across runs.
Public-API extractors (shared session, no proxy) now send BOT_UA.
Playtomic extractors (proxy-backed) each get a stable pool UA keyed on
their proxy URL hash.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds OVERPASS_MIRRORS list (overpass-api.de, kumi.systems, openstreetmap.ru)
and a post_overpass() helper in _shared.py that tries mirrors in order,
logging a warning on each failure and re-raising the last RequestException
if all mirrors fail. Both overpass.py and overpass_tennis.py now call
post_overpass() instead of hard-coding the primary URL.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both stg_playtomic_resources and stg_playtomic_opening_hours lacked QUALIFY ROW_NUMBER()
dedup despite declaring a grain. When both tenants.json.gz (old) and tenants.jsonl.gz (new)
exist for the same month, the UNION ALL produced exactly 2× rows.
Fixes:
- stg_playtomic_resources: QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, resource_id)
- stg_playtomic_opening_hours: QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, day_of_week)
- playtomic_tenants.py: skip if old blob OR new JSONL already exists for the month,
preventing same-month dual-format writes that trigger the duplicate
Row counts after fix: ~43.8K resources, ~93.4K opening_hours (was 87.6K, 186.8K).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single global Overpass query (150K+ elements, times out) with
10 regional bbox queries (~10-40K elements each, 150s server / 180s client).
- REGIONS: 10 bboxes covering all continents
- Crash recovery: working.jsonl accumulates per-region results;
already_seen_ids deduplication skips re-written elements on restart
- Overlapping bbox elements deduped by OSM id across regions
- Retry per region: up to 2 retries with 30s cooldown
- Polite 5s inter-region delay
- Skip if courts.jsonl.gz or courts.json.gz already exists for the month
stg_tennis_courts: UNION ALL transition (jsonl_elements + blob_elements)
- jsonl_elements: JSONL, explicit columns, COALESCE lat/lon with center coords
(supports both node direct lat/lon and way/relation Overpass out center)
- blob_elements: existing UNNEST(elements) pattern, unchanged
- Removed osm_type='node' filter — ways/relations now usable via center coords
- Dedup on (osm_id, extracted_date DESC) unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- availability_{date}.jsonl.gz replaces .json.gz for morning snapshots
- Each JSONL line = one venue object with date + captured_at_utc injected
- Eliminates in-memory consolidation: working.jsonl IS the final file
(compress_jsonl_atomic at end instead of write_gzip_atomic blob)
- Crash recovery unchanged: working.jsonl accumulates via flush_partial_batch
- _load_morning_availability tries .jsonl.gz first, falls back to .json.gz
- Skip check covers both formats during transition
- Recheck files stay blob format (small, infrequent)
stg_playtomic_availability: UNION ALL transition (morning_jsonl + morning_blob + recheck_blob)
- morning_jsonl: read_json JSONL, tenant_id direct column, no outer UNNEST
- morning_blob / recheck_blob: subquery + LATERAL UNNEST (unchanged semantics)
- All three produce (snapshot_date, captured_at_utc, snapshot_type, recheck_hour, tenant_id, slots_json)
- Downstream raw_resources / raw_slots CTEs unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- playtomic_tenants.py: write each tenant as a JSONL line after dedup,
compress via compress_jsonl_atomic → tenants.jsonl.gz
- playtomic_availability.py: update _load_tenant_ids() to prefer
tenants.jsonl.gz, fall back to tenants.json.gz (transition)
- stg_playtomic_venues.sql: UNION ALL jsonl+blob CTEs for transition;
JSONL reads top-level columns directly, no UNNEST(tenants) needed
- stg_playtomic_resources.sql: same UNION ALL pattern, single UNNEST
for resources in JSONL path vs double UNNEST in blob path
- stg_playtomic_opening_hours.sql: same UNION ALL pattern, opening_hours
as top-level JSON column in JSONL path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Streams a JSONL working file to .jsonl.gz in 1MB chunks (constant memory),
atomic rename via .tmp sibling, deletes source on success. Companion to
write_gzip_atomic() for extractors that stream records incrementally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each slot is now rechecked once, at most 30 min before it starts.
Worst-case miss: a booking made 29 min before start.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
60-min window + hourly rechecks = each slot caught exactly once, 0-60 min
before it starts. 90-min window causes double-querying (T-90 and T-30).
Slot duration is irrelevant — it doesn't affect when the slot appears in
the window.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Data analysis of 5,115 venues with slots shows 24.8% have a 90-min minimum
slot duration. A 60-min window would miss those venues entirely with hourly
rechecks. 90 min is correct — covers 30/60/90-min minimum venues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With hourly rechecks and 60-min minimum slots, a 90-min window causes each
slot to be queried twice. 60-min window = each slot caught exactly once in
the recheck immediately before it starts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- playtomic_tenants.py: batch_size = len(proxy_urls) pages fired in parallel per
batch; each page gets its own session + proxy; sorted(results) ensures
deterministic done-detection; falls back to serial + THROTTLE_SECONDS when no
proxies. Expected speedup: ~2.5 min → ~15 s with 10 proxies.
- .env.dev.sops, .env.prod.sops: remove EXTRACT_WORKERS (now derived from
PROXY_URLS length)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- all.py: replace sequential loop with graphlib.TopologicalSorter + ThreadPoolExecutor
- EXTRACTORS dict declares (func, [deps]) — self-documenting dependency graph
- 8 extractors run in parallel immediately; availability starts as soon as
tenants finishes (not after all others complete)
- max_workers=len(EXTRACTORS) — all I/O-bound, no CPU contention
- playtomic_tenants.py: add proxy rotation via make_round_robin_cycler
- no throttle when PROXY_URLS set (IP rotation removes per-IP rate concern)
- keeps 2s throttle for direct runs
- _shared.py: add optional proxy_url param to run_extractor()
- any extractor can opt in to proxy support via the shared session
- overpass_tennis.py: fix query timeout (out body → out center, timeout 180 → 300)
- out center returns centroids only, not full geometry — fits within server limits
- playtomic_availability.py: fix CIRCUIT_BREAKER_THRESHOLD empty string crash
- int(os.environ.get(..., "10")) → int(os.environ.get(...) or "10")
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- export_serving.py: move `import re` to module level — was imported
inside a loop body on every iteration
- sitemap.py: add comment documenting that the in-memory TTL cache is
process-local (valid for single-worker deployment, Dockerfile --workers 1)
- playtomic_availability.py: use `or "10"` fallback for
CIRCUIT_BREAKER_THRESHOLD env var to handle empty-string case
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Assert landing_dir.is_dir() and year_month format (YYYY/MM) at the
entry point of each extract function — turning silent wrong-path bugs
into immediate AssertionError with a descriptive message.
Files changed:
- playtomic_availability.py: assert in _load_tenant_ids(), extract(),
extract_recheck()
- eurostat.py: assert in extract()
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a two-tier proxy system for the Playtomic availability extractor:
- Primary tier (PROXY_URLS): datacenter proxies, cheap and fast
- Fallback tier (PROXY_URLS_FALLBACK): residential rotating gateway, reliable
Circuit breaker opens after CIRCUIT_BREAKER_THRESHOLD (default: 10) consecutive
failures, permanently switching to the fallback tier for the rest of the run.
No auto-recovery — avoids flapping. If circuit opens with no fallback configured,
logs an error and writes partial results rather than continuing on a dead proxy pool.
Parallel mode submits futures in PARALLEL_BATCH_SIZE=100 batches so the circuit
breaker can stop new submissions after it opens.
New env vars added to .env.dev.sops (blank defaults):
PROXY_URLS_FALLBACK — residential/rotating gateway URL
CIRCUIT_BREAKER_THRESHOLD — consecutive failures before switching (default 10)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add section 9 to data-sources-inventory.md covering live API quirks:
Eurostat SDMX city labels response shape, ONS CSV download path (observations
API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format
- Add population coverage summary table and DuckDB glob limitation note
- fix(extract): census_usa + geonames write empty placeholder when credentials
absent so SQLMesh staging models don't fail with "no files found"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eurostat_city_labels: API returns compact dimension JSON (category.label dict),
not SDMX 2.1 nested codelists structure. Fixed parser to read from
data["category"]["label"]. 1771 city codes fetched successfully.
ons_uk: observations endpoint (TS007A) is 404. Switched to CSV download via
/datasets/mid-year-pop-est/editions/mid-2022-england-wales — fetches ~68MB CSV,
filters to sex='all' + target year, aggregates population per LAD. 316 LADs ≥50K.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Part A: Data Layer — Sprints 1-5
Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
(city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
replaces hardcoded 0::BIGINT population
Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
is available to the scoring formula
Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)
Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)
Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
with case/whitespace-insensitive city name matching
Registered all 4 new CLI entrypoints in pyproject.toml and all.py.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playtomic API ignores bbox params (min_latitude, etc.) and offset param.
Discovered that `page` param works correctly for global enumeration.
Result: 14,202 venues across 82 countries (was 100 with bbox approach).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three workstreams:
1. Playtomic full data extraction & transform pipeline:
- Expand venue bounding boxes from 4 to 23 regions (global coverage)
- New staging models for court resources, opening hours, and slot-level
availability with real prices from the Playtomic API
- Foundation fact tables for venue capacity and daily occupancy/revenue
- City-level pricing benchmarks replacing hardcoded country estimates
- Planner defaults now use 3-tier cascade: city data → country → fallback
2. Transactional email i18n:
- _t() helper in worker.py with ~70 translation keys (EN + DE)
- All 8 email handlers translated, lang passed in task payloads
3. Resend audiences restructured to 3 named audiences (free plan limit)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eurostat JSON-stat format (4-7 dimension sparse dict with 583K values)
causes DuckDB OOM — pre-process in extractor to flat records.
Also fix dim_cities unused CTE bug and playtomic venue lat/lon path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playtomic tenants API recycles results past its internal limit —
stop after 3 consecutive pages with zero new unique IDs.
Calculator tests: replace hardcoded default values (6 courts, specific
sqm/capex) with DEFAULTS references so tests don't break when
defaults change.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove raw/ layer — staging models now read landing JSON directly.
Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*.
Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH.
Supervisor gets daily sleep interval between pipeline runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split monolithic execute.py into per-source modules with separate CLI
entry points. Each extractor now uses the framework from utils.py:
- SQLite state tracking (start_run / end_run per extractor)
- Proper logging (replace print() with logger)
- Atomic gzip writes (write_gzip_atomic)
- Connection pooling (niquests.Session)
- Bounded pagination (MAX_PAGES_PER_BBOX = 500)
New entry points:
extract — run all 4 extractors sequentially
extract-overpass — OSM padel courts
extract-eurostat — city demographics (etag dedup)
extract-playtomic-tenants — venue listings
extract-playtomic-availability — booking slots + pricing (NEW)
The availability extractor reads tenant IDs from the latest tenants.json.gz,
queries next-day slots for each venue, and stores daily consolidated snapshots.
Supports resumability via cursor and retry with backoff.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pulls in template changes: export_serving.py for atomic DuckDB swap,
supervisor export step, SQLMesh glob macro, server provisioning script,
imprint template, and formatting improvements.
Template scaffold SQL models excluded (padelnomics has real models).
Web app routes/analytics unchanged (padelnomics-specific customizations).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sync template from 29ac25b → v0.9.0 (29 template commits). Due to
template's _subdirectory migration, new files were manually rendered
rather than auto-merged by copier.
New files:
- .claude/CLAUDE.md + coding_philosophy.md (agent instructions)
- extract utils.py: SQLite state tracking for extraction runs
- extract/transform READMEs: architecture & pattern documentation
- infra/supervisor: systemd service + orchestration script
- Per-layer model READMEs (raw, staging, foundation, serving)
Also fixes copier-answers.yml (adds 4 feature toggles, removes stale
payment_provider key) and scopes CLAUDE.md gitignore to root only.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
git mv all tracked files from the nested padelnomics/ workspace
directory to the git repo root. Merged .gitignore files.
No code changes — pure path rename.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>