Commit Graph

31 Commits

Author SHA1 Message Date
Deeman
7b03fd71f9 feat(extract): convert playtomic_availability to JSONL output
- availability_{date}.jsonl.gz replaces .json.gz for morning snapshots
- Each JSONL line = one venue object with date + captured_at_utc injected
- Eliminates in-memory consolidation: working.jsonl IS the final file
  (compress_jsonl_atomic at end instead of write_gzip_atomic blob)
- Crash recovery unchanged: working.jsonl accumulates via flush_partial_batch
- _load_morning_availability tries .jsonl.gz first, falls back to .json.gz
- Skip check covers both formats during transition
- Recheck files stay blob format (small, infrequent)

stg_playtomic_availability: UNION ALL transition (morning_jsonl + morning_blob + recheck_blob)
  - morning_jsonl: read_json JSONL, tenant_id direct column, no outer UNNEST
  - morning_blob / recheck_blob: subquery + LATERAL UNNEST (unchanged semantics)
  - All three produce (snapshot_date, captured_at_utc, snapshot_type, recheck_hour, tenant_id, slots_json)
  - Downstream raw_resources / raw_slots CTEs unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:14:38 +01:00
Deeman
9bef055e6d feat(extract): convert playtomic_tenants to JSONL output
- playtomic_tenants.py: write each tenant as a JSONL line after dedup,
  compress via compress_jsonl_atomic → tenants.jsonl.gz
- playtomic_availability.py: update _load_tenant_ids() to prefer
  tenants.jsonl.gz, fall back to tenants.json.gz (transition)
- stg_playtomic_venues.sql: UNION ALL jsonl+blob CTEs for transition;
  JSONL reads top-level columns directly, no UNNEST(tenants) needed
- stg_playtomic_resources.sql: same UNION ALL pattern, single UNNEST
  for resources in JSONL path vs double UNNEST in blob path
- stg_playtomic_opening_hours.sql: same UNION ALL pattern, opening_hours
  as top-level JSON column in JSONL path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:07:53 +01:00
Deeman
6bede60ef8 feat(extract): add compress_jsonl_atomic() utility
Streams a JSONL working file to .jsonl.gz in 1MB chunks (constant memory),
atomic rename via .tmp sibling, deletes source on success. Companion to
write_gzip_atomic() for extractors that stream records incrementally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 11:50:17 +01:00
Deeman
d834bdc59a feat(extract): recheck every 30 min with 30-min window for accurate occupancy
Each slot is now rechecked once, at most 30 min before it starts.
Worst-case miss: a booking made 29 min before start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:39:30 +01:00
Deeman
b7c8568265 fix(extract): recheck window 90→60 min (correct reasoning this time)
60-min window + hourly rechecks = each slot caught exactly once, 0-60 min
before it starts. 90-min window causes double-querying (T-90 and T-30).
Slot duration is irrelevant — it doesn't affect when the slot appears in
the window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:37:17 +01:00
Deeman
be8872beb2 revert: restore recheck window to 90 min
Data analysis of 5,115 venues with slots shows 24.8% have a 90-min minimum
slot duration. A 60-min window would miss those venues entirely with hourly
rechecks. 90 min is correct — covers 30/60/90-min minimum venues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:35:12 +01:00
Deeman
d15787caeb fix(extract): recheck window 90→60 min — matches hourly schedule and min slot duration
With hourly rechecks and 60-min minimum slots, a 90-min window causes each
slot to be queried twice. 60-min window = each slot caught exactly once in
the recheck immediately before it starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:33:20 +01:00
Deeman
d5947af8d4 merge: maximum performance extraction (parallel pages + crash-safe partial JSONL)
# Conflicts:
#	.env.dev.sops
#	.env.prod.sops
#	extract/padelnomics_extract/src/padelnomics_extract/playtomic_tenants.py
2026-02-24 22:36:34 +01:00
Deeman
9f010d8c0c perf(extract): parallel page fetching in tenants, drop EXTRACT_WORKERS env var
- playtomic_tenants.py: batch_size = len(proxy_urls) pages fired in parallel per
  batch; each page gets its own session + proxy; sorted(results) ensures
  deterministic done-detection; falls back to serial + THROTTLE_SECONDS when no
  proxies. Expected speedup: ~2.5 min → ~15 s with 10 proxies.
- .env.dev.sops, .env.prod.sops: remove EXTRACT_WORKERS (now derived from
  PROXY_URLS length)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:30:28 +01:00
Deeman
6116445b56 perf(extract): auto-detect workers from proxies, skip throttle on success, crash-safe partial JSONL
- proxy.py: delete unused make_sticky_selector()
- utils.py: add load_partial_results() + flush_partial_batch() for crash-resumable extraction
- playtomic_availability.py:
  - drop MAX_WORKERS / EXTRACT_WORKERS — worker_count = len(proxy_urls) or 1
  - skip time.sleep(THROTTLE_SECONDS) on success when proxy_url is set; keep sleeps for 429/5xx
  - replace cursor-based resumption with .partial.jsonl sidecar (flush every 50 records)
  - _fetch_venues_parallel accepts on_result callback for incremental partial-file flushing
  - mirror auto-detect worker count in extract_recheck()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:21:05 +01:00
Deeman
79d1b0e672 feat(extract): tiered proxy with circuit breaker + proxy provider research
- playtomic_tenants.py: simplify proxy cycler call (cycler() instead of
  cycler["next_proxy"]()) — matches refactored proxy API
- docs/proxy-provider-inventory.md: proxy provider comparison table for
  Playtomic scraping (~14k req/day, residential IPs, pay-per-GB)
- .env.*.sops: updated encrypted secrets (re-encrypted)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:11 +01:00
Deeman
78ffbc313f feat(extract): parallel DAG scheduler + proxy rotation for tenants
- all.py: replace sequential loop with graphlib.TopologicalSorter + ThreadPoolExecutor
  - EXTRACTORS dict declares (func, [deps]) — self-documenting dependency graph
  - 8 extractors run in parallel immediately; availability starts as soon as
    tenants finishes (not after all others complete)
  - max_workers=len(EXTRACTORS) — all I/O-bound, no CPU contention
- playtomic_tenants.py: add proxy rotation via make_round_robin_cycler
  - no throttle when PROXY_URLS set (IP rotation removes per-IP rate concern)
  - keeps 2s throttle for direct runs
- _shared.py: add optional proxy_url param to run_extractor()
  - any extractor can opt in to proxy support via the shared session
- overpass_tennis.py: fix query timeout (out body → out center, timeout 180 → 300)
  - out center returns centroids only, not full geometry — fits within server limits
- playtomic_availability.py: fix CIRCUIT_BREAKER_THRESHOLD empty string crash
  - int(os.environ.get(..., "10")) → int(os.environ.get(...) or "10")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 21:17:00 +01:00
Deeman
44c0dd0b8d refactor: minor TigerStyle cleanups
- export_serving.py: move `import re` to module level — was imported
  inside a loop body on every iteration
- sitemap.py: add comment documenting that the in-memory TTL cache is
  process-local (valid for single-worker deployment, Dockerfile --workers 1)
- playtomic_availability.py: use `or "10"` fallback for
  CIRCUIT_BREAKER_THRESHOLD env var to handle empty-string case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 20:50:43 +01:00
Deeman
ad48f23cfc fix: add precondition assertions in extract pipeline
Assert landing_dir.is_dir() and year_month format (YYYY/MM) at the
entry point of each extract function — turning silent wrong-path bugs
into immediate AssertionError with a descriptive message.

Files changed:
- playtomic_availability.py: assert in _load_tenant_ids(), extract(),
  extract_recheck()
- eurostat.py: assert in extract()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 20:42:11 +01:00
Deeman
d3db830c98 Merge branch 'master' into worktree-dual-market-score
# Conflicts:
#	.env.dev.sops
2026-02-24 16:38:25 +01:00
Deeman
c109488d9d feat(extract): expand GeoNames to cities1000 + add tennis court extractor
GeoNames:
- cities15000 → cities1000 (~140K global locations, pop ≥ 1K)
- Add lat/lon, admin1_code, admin2_code to output (needed for dim_locations)
- Expand feature codes to include PPLA3/4/5 (Gemeinden, cantons, etc.)
- Remove MIN_POPULATION=50K floor — cities1000 already pre-filters to ≥1K
- Update assertions for new scale (~100K+ expected)

Tennis courts:
- New overpass_tennis.py extractor (sport=tennis, 180s Overpass timeout)
- Registered as extract-overpass-tennis, added to EXTRACTORS list
- New stg_tennis_courts.sql staging model (grain: osm_id)

stg_population_geonames: add lat, lon, admin1_code, admin2_code columns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 16:15:20 +01:00
Deeman
0b472e1a32 feat(extract): tiered proxy with circuit breaker for Playtomic availability
Adds a two-tier proxy system for the Playtomic availability extractor:
- Primary tier (PROXY_URLS): datacenter proxies, cheap and fast
- Fallback tier (PROXY_URLS_FALLBACK): residential rotating gateway, reliable

Circuit breaker opens after CIRCUIT_BREAKER_THRESHOLD (default: 10) consecutive
failures, permanently switching to the fallback tier for the rest of the run.
No auto-recovery — avoids flapping. If circuit opens with no fallback configured,
logs an error and writes partial results rather than continuing on a dead proxy pool.

Parallel mode submits futures in PARALLEL_BATCH_SIZE=100 batches so the circuit
breaker can stop new submissions after it opens.

New env vars added to .env.dev.sops (blank defaults):
  PROXY_URLS_FALLBACK          — residential/rotating gateway URL
  CIRCUIT_BREAKER_THRESHOLD    — consecutive failures before switching (default 10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 16:01:50 +01:00
Deeman
1762188f08 docs(inventory): document population pipeline implementation findings
- Add section 9 to data-sources-inventory.md covering live API quirks:
  Eurostat SDMX city labels response shape, ONS CSV download path (observations
  API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format
- Add population coverage summary table and DuckDB glob limitation note
- fix(extract): census_usa + geonames write empty placeholder when credentials
  absent so SQLMesh staging models don't fail with "no files found"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 01:01:10 +01:00
Deeman
06cbdf80dc fix(extract): correct Eurostat + ONS response parsers against live APIs
eurostat_city_labels: API returns compact dimension JSON (category.label dict),
not SDMX 2.1 nested codelists structure. Fixed parser to read from
data["category"]["label"]. 1771 city codes fetched successfully.

ons_uk: observations endpoint (TS007A) is 404. Switched to CSV download via
/datasets/mid-year-pop-est/editions/mid-2022-england-wales — fetches ~68MB CSV,
filters to sex='all' + target year, aggregates population per LAD. 316 LADs ≥50K.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:26:13 +01:00
Deeman
0960990373 feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors
Part A: Data Layer — Sprints 1-5

Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
  (city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
  replaces hardcoded 0::BIGINT population

Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
  30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
  is available to the scoring formula

Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)

Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)

Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
  with case/whitespace-insensitive city name matching

Registered all 4 new CLI entrypoints in pyproject.toml and all.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:07:08 +01:00
Deeman
a1faddbed6 feat: Python supervisor + feature flags
Supervisor (replaces supervisor.sh):
- supervisor.py — cron-based pipeline orchestration, reads workflows.toml
  on every tick, runs due extractors in topological waves with parallel
  execution, then SQLMesh transform + serving export
- workflows.toml — workflow registry: overpass (monthly), eurostat (monthly),
  playtomic_tenants (weekly), playtomic_availability (daily),
  playtomic_recheck (hourly 6–23)
- padelnomics-supervisor.service — updated ExecStart to Python supervisor

Extraction enhancements:
- proxy.py — optional round-robin/sticky proxy rotation via PROXY_URLS env
- playtomic_availability.py — parallel fetch (EXTRACT_WORKERS), recheck mode
  (main_recheck) re-queries imminent slots for accurate occupancy measurement
- _shared.py — realistic browser User-Agent on all extractor sessions
- stg_playtomic_availability.sql — reads morning + recheck snapshots, tags each
- fct_daily_availability.sql — prefers recheck over morning for same slot

Feature flags (replaces WAITLIST_MODE env var):
- migration 0019 — feature_flags table, 5 initial flags:
  markets (on), payments/planner_export/supplier_signup/lead_unlock (off)
- core.py — is_flag_enabled() + feature_gate() decorator
- routes — payments, markets, planner_export, supplier_signup, lead_unlock gated
- admin flags UI — /admin/flags toggle page + nav link
- app.py — flag() injected as Jinja2 global

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:53:45 +01:00
Deeman
a055660cd2 fix: replace broken bbox pagination with global page-based extraction
Playtomic API ignores bbox params (min_latitude, etc.) and offset param.
Discovered that `page` param works correctly for global enumeration.

Result: 14,202 venues across 82 countries (was 100 with bbox approach).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 01:16:35 +01:00
Deeman
13c86ebf84 Merge branch 'worktree-extraction-overhaul'
# Conflicts:
#	transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql
#	transform/sqlmesh_padelnomics/models/staging/stg_playtomic_venues.sql
2026-02-23 01:01:26 +01:00
Deeman
79f7fc6fad feat: Playtomic pricing/occupancy pipeline + email i18n + audience restructure
Three workstreams:

1. Playtomic full data extraction & transform pipeline:
   - Expand venue bounding boxes from 4 to 23 regions (global coverage)
   - New staging models for court resources, opening hours, and slot-level
     availability with real prices from the Playtomic API
   - Foundation fact tables for venue capacity and daily occupancy/revenue
   - City-level pricing benchmarks replacing hardcoded country estimates
   - Planner defaults now use 3-tier cascade: city data → country → fallback

2. Transactional email i18n:
   - _t() helper in worker.py with ~70 translation keys (EN + DE)
   - All 8 email handlers translated, lang passed in task payloads

3. Resend audiences restructured to 3 named audiences (free plan limit)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 00:54:53 +01:00
Deeman
5a1bb21624 fix: eurostat JSON-stat parsing + staging model corrections
Eurostat JSON-stat format (4-7 dimension sparse dict with 583K values)
causes DuckDB OOM — pre-process in extractor to flat records.
Also fix dim_cities unused CTE bug and playtomic venue lat/lon path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:52:25 +01:00
Deeman
c25e20f83a fix: playtomic pagination stale-page exit + calculator test assertions
Playtomic tenants API recycles results past its internal limit —
stop after 3 consecutive pages with zero new unique IDs.

Calculator tests: replace hardcoded default values (6 courts, specific
sqm/capex) with DEFAULTS references so tests don't break when
defaults change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:06:48 +01:00
Deeman
2db66efe77 feat: migrate transform to 3-layer architecture with per-layer schemas
Remove raw/ layer — staging models now read landing JSON directly.
Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*.
Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH.
Supervisor gets daily sleep interval between pipeline runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:04:40 +01:00
Deeman
53e9bbd66b feat: restructure extraction to one file per source
Split monolithic execute.py into per-source modules with separate CLI
entry points. Each extractor now uses the framework from utils.py:
- SQLite state tracking (start_run / end_run per extractor)
- Proper logging (replace print() with logger)
- Atomic gzip writes (write_gzip_atomic)
- Connection pooling (niquests.Session)
- Bounded pagination (MAX_PAGES_PER_BBOX = 500)

New entry points:
  extract              — run all 4 extractors sequentially
  extract-overpass     — OSM padel courts
  extract-eurostat     — city demographics (etag dedup)
  extract-playtomic-tenants      — venue listings
  extract-playtomic-availability — booking slots + pricing (NEW)

The availability extractor reads tenant IDs from the latest tenants.json.gz,
queries next-day slots for each venue, and stores daily consolidated snapshots.
Supports resumability via cursor and retry with backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:56:41 +01:00
Deeman
ea86940b78 feat: copier update v0.9.0 → v0.10.0
Pulls in template changes: export_serving.py for atomic DuckDB swap,
supervisor export step, SQLMesh glob macro, server provisioning script,
imprint template, and formatting improvements.

Template scaffold SQL models excluded (padelnomics has real models).
Web app routes/analytics unchanged (padelnomics-specific customizations).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 17:50:36 +01:00
Deeman
18ee24818b feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides
Sync template from 29ac25b → v0.9.0 (29 template commits). Due to
template's _subdirectory migration, new files were manually rendered
rather than auto-merged by copier.

New files:
- .claude/CLAUDE.md + coding_philosophy.md (agent instructions)
- extract utils.py: SQLite state tracking for extraction runs
- extract/transform READMEs: architecture & pattern documentation
- infra/supervisor: systemd service + orchestration script
- Per-layer model READMEs (raw, staging, foundation, serving)

Also fixes copier-answers.yml (adds 4 feature toggles, removes stale
payment_provider key) and scopes CLAUDE.md gitignore to root only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:44:48 +01:00
Deeman
4ae00b35d1 refactor: flatten padelnomics/padelnomics/ → repo root
git mv all tracked files from the nested padelnomics/ workspace
directory to the git repo root. Merged .gitignore files.
No code changes — pure path rename.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 00:44:40 +01:00