Commit Graph

58 Commits

Author SHA1 Message Date
Deeman
75305935bd merge: GISCO extractor + wire all extractors + load_dotenv fix
All checks were successful
CI / test (push) Successful in 50s
CI / tag (push) Successful in 3s
2026-03-01 17:08:42 +01:00
Deeman
a15c32d398 fix(extract): load .env automatically via load_dotenv()
PROXY_URLS_* and other secrets were defined in .env but never loaded,
causing availability to run in slow serial mode (1 req/s) instead of
parallel mode with proxies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 16:41:59 +01:00
Deeman
97c5846d51 feat(extract): GISCO extractor + wire all unscheduled extractors
- New gisco.py: proper extractor module replacing scripts/download_gisco_nuts.py.
  Writes uncompressed .geojson (ST_Read can't handle .gz). Fixed partition path
  gisco/2024/01/nuts2_boundaries.geojson; cursor tracking skips re-download monthly.
- all.py: import + register gisco in EXTRACTORS (9 independent, 1 dep)
- pyproject.toml: add extract-gisco entry point
- workflows.toml: add census_usa, census_usa_income, eurostat_city_labels,
  ons_uk, gisco — all monthly, no dependencies
- Delete scripts/download_gisco_nuts.py (superseded)

Unblocks: stg_nuts2_boundaries, stg_regional_income, stg_income_usa,
and 4 downstream models (dim_locations, pseo_city_costs_de,
location_opportunity_profile, pseo_country_overview).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 15:49:39 +01:00
Deeman
42c49e383c fix(proxy): ignore stale-tier failures in record_failure()
With parallel workers, threads that fetch a proxy just before escalation
can report failures after the tier has already changed — those failures
were silently counting against the new tier, immediately exhausting it
before it ever got tried (Rayobyte being skipped entirely in favour of
DataImpulse because 10 in-flight Webshare failures hit the threshold).

Fix: build a proxy_url → tier_idx reverse map at construction time and
skip the tier-level circuit breaker when the failing proxy belongs to an
already-escalated tier.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 14:43:05 +01:00
Deeman
81a57db272 fix(proxy): skip URLs without scheme in load_proxy_tiers()
Validates each URL in PROXY_URLS_DATACENTER / PROXY_URLS_RESIDENTIAL:
logs a warning and skips any entry missing an http:// or https:// scheme
instead of passing malformed URLs that cause SSL or connection errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 14:26:41 +01:00
Deeman
a898a06575 feat(proxy): per-proxy dead tracking in tiered cycler
Add proxy_failure_limit param to make_tiered_cycler (default 3).
Individual proxies hitting the limit are marked dead and permanently
skipped. next_proxy() auto-escalates when all proxies in the active
tier are dead. Both mechanisms coexist: per-proxy dead tracking removes
broken individuals; tier-level threshold catches systemic failure.

- proxy.py: dead_proxies set + proxy_failure_counts dict in state;
  next_proxy skips dead proxies with bounded loop; record_failure/
  record_success accept optional proxy_url; dead_proxy_count() added
- playtomic_tenants.py: pass proxy_url to record_success/record_failure
- playtomic_availability.py: _worker returns (proxy_url, result);
  serial loops in extract + extract_recheck capture proxy_url
- test_supervisor.py: 11 new tests in TestTieredCyclerDeadProxyTracking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 12:28:54 +01:00
Deeman
1aedf78ec6 fix(extract): use tiered cycler in playtomic_tenants
Previously the tenants extractor flattened all proxy tiers into a single
round-robin list, bypassing the circuit breaker entirely. When the free
Webshare tier runs out of bandwidth (402), all 20 free proxies fail and
the batch crashes — the paid datacenter/residential proxies are never tried.

Changes:
- Replace make_round_robin_cycler with make_tiered_cycler (same as availability)
- Add _fetch_page_via_cycler: retries per page across tiers, records
  success/failure in cycler so circuit breaker can escalate
- Fix batch_size to BATCH_SIZE=20 constant (was len(all_proxies) ≈ 22)
- Check cycler.is_exhausted() before each batch; catch RuntimeError mid-batch
  and write partial results rather than crashing with nothing
- CIRCUIT_BREAKER_THRESHOLD from env (default 10), matching availability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 12:13:50 +01:00
Deeman
c1cdeec6be fix(extract): default worker count to 200 when proxies configured
All checks were successful
CI / test (push) Successful in 49s
CI / tag (push) Successful in 3s
Previously fell back to len(tiers[0]) (e.g. 10 for Webshare) when
PROXY_CONCURRENCY was unset. Default is now MAX_PROXY_CONCURRENCY=200
so single-URL rotating proxies (DC/residential) run at full concurrency
without needing an explicit env var.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 18:06:55 +01:00
Deeman
60659a5ec5 merge: daily tenant snapshots with date-based partition 2026-02-28 17:30:33 +01:00
Deeman
beb4195f16 feat(extract): daily tenant snapshots with date-based partition
- playtomic_tenants: partition by YYYY/MM/DD instead of ISO week;
  schedule changed from weekly to daily in workflows.toml
- playtomic_availability: _load_tenant_ids now tries 3-level glob
  (*/*/*/tenants.jsonl.gz) first for daily files, falls back to
  2-level for old monthly/weekly data

Alphabetical sort would rank old monthly files above daily ones
('t' > '2' in ASCII), so the explicit fallback chain is required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 17:27:16 +01:00
Deeman
88cc857f3a merge: weekly tenant snapshots via ISO week partition 2026-02-28 17:19:25 +01:00
Deeman
9116625884 feat(extract): weekly tenant snapshots via ISO week partition
Tenants extractor now partitions by ISO week (e.g. 2026/W09) instead of
month (2026/02), so each weekly run writes a fresh file rather than
skipping for the rest of the month.

_load_tenant_ids() in playtomic_availability already globs */*/tenants.jsonl.gz
and sorts reverse — 'W09' > '02' alphabetically so weekly files take priority
over old monthly ones automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 17:19:19 +01:00
Deeman
1af65bb46f feat(extract): add PROXY_CONCURRENCY override for rotating single-URL proxies
When DC/residential tiers have a single rotating endpoint, worker_count
defaulted to 1 (one URL = one worker). PROXY_CONCURRENCY lets you set
an explicit thread count (e.g. 100) for providers that handle concurrent
connections on a single URL.

Capped at MAX_PROXY_CONCURRENCY=200 to avoid overloading the endpoint.
Falls back to len(tiers[0]) when unset (existing behaviour).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 17:06:53 +01:00
Deeman
adf22924f6 feat(extract): three-tier proxy system with Webshare auto-fetch
Replace two-tier proxy setup (PROXY_URLS / PROXY_URLS_FALLBACK) with
N-tier escalation: free → datacenter → residential.

- proxy.py: fetch_webshare_proxies() auto-fetches the Webshare download
  API on each run (no more stale manually-copied lists). load_proxy_tiers()
  assembles tiers from WEBSHARE_DOWNLOAD_URL, PROXY_URLS_DATACENTER,
  PROXY_URLS_RESIDENTIAL. make_tiered_cycler() generalised to list[list[str]]
  with N-level escalation; is_fallback_active() replaced by is_exhausted().
  Old load_proxy_urls() / load_fallback_proxy_urls() deleted.

- playtomic_availability.py: both extract() and extract_recheck() use
  load_proxy_tiers() + generalised cycler. _fetch_venues_parallel fallback_urls
  param removed. All is_fallback_active() checks → is_exhausted().

- playtomic_tenants.py: flattens tiers for simple round-robin.

- test_supervisor.py: TestLoadProxyUrls removed (function deleted).
  Added TestFetchWebshareProxies, TestLoadProxyTiers, TestTieredCyclerNTier
  (11 tests covering parse format, error handling, escalation, thread safety).

47 tests pass, ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-28 16:57:07 +01:00
Deeman
120f974970 merge: Phase 2a + 2b — EU NUTS-2 spatial join + US state income
Phase 2a: NUTS-1 regional income for Germany (16 Bundesländer via admin1→NUTS-1 mapping)
Phase 2b: EU-wide NUTS-2 via GISCO spatial join + US Census ACS state income
- All EU-27+EFTA+UK locations now auto-resolve to NUTS-2 via ST_Contains
- Germany gets sub-Bundesland (38 Regierungsbezirke) differentiation
- US gets state-level income with PPS normalisation
- Income cascade: NUTS-2 → NUTS-1 → US state → country-level

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 11:11:36 +01:00
Deeman
ede7983a77 fix(proxy): add missing make_sticky_selector function
Tests imported make_sticky_selector but it was never implemented.
Hash-based (MD5) consistent selector — same key always returns the
same proxy, distributes across the pool.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 11:10:20 +01:00
Deeman
409dc4bfac feat(data): Phase 2b step 1 — expand stg_regional_income + Census income extractor
- stg_regional_income.sql: accept NUTS-1 (3-char) + NUTS-2 (4-char) codes;
  rename nuts1_code → nuts_code; add nuts_level column; NUTS-2 rows were
  already in the landing zone but discarded by LENGTH(geo_code) = 3
- scripts/download_gisco_nuts.py: one-time download of GISCO NUTS-2 boundary
  GeoJSON (NUTS_RG_20M_2021_4326_LEVL_2.geojson, ~5MB) to landing zone;
  uncompressed because ST_Read cannot read .gz files
- census_usa_income.py: new extractor for ACS B19013_001E state-level median
  household income; follows census_usa.py pattern; 51 states + DC
- all.py + pyproject.toml: register census_usa_income extractor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 10:58:12 +01:00
Deeman
5ade38eeaf feat(data): Phase 2a — NUTS-1 regional income for opportunity score
- eurostat.py: add nama_10r_2hhinc dataset config; append filter params to
  request URL so server pre-filters the large cube before download
- stg_regional_income.sql: new staging model — reads nama_10r_2hhinc.json.gz,
  filters to NUTS-1 codes (3-char), normalises EL→GR / UK→GB
- dim_locations.sql: add admin1_to_nuts1 VALUES CTE (16 German Bundesländer)
  + regional_income CTE; final SELECT uses COALESCE(regional, country) income
- init_landing_seeds.py: add empty seed for nama_10r_2hhinc.json.gz

Munich/Bayern now scores ~29K PPS vs Chemnitz/Sachsen ~19K PPS instead of
both inheriting the same national average (~25.5K PPS).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 10:26:15 +01:00
Deeman
1fc348f10c feat(extract): lower US city population threshold to 10K
878 → 4212 cities. Broadens coverage to match the granularity of
Eurostat and GeoNames data for smaller metro markets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 00:02:40 +01:00
Deeman
e61aaa574b merge: proxy-pinned UA identities + honest bot UA for public APIs
# Conflicts:
#	extract/padelnomics_extract/src/padelnomics_extract/_shared.py
2026-02-25 22:12:34 +01:00
Deeman
c5b46376af feat(extract): proxy-pinned UA identities + honest bot UA for public APIs
Replace single hardcoded Chrome 131 UA with:
- BOT_UA: honest padelnomics-bot UA for Overpass, Eurostat, GeoNames etc.
- _UA_POOL + ua_for_proxy(): deterministic browser UA per proxy URL so each
  IP presents a consistent, distinct fingerprint across runs.

Public-API extractors (shared session, no proxy) now send BOT_UA.
Playtomic extractors (proxy-backed) each get a stable pool UA keyed on
their proxy URL hash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 22:08:00 +01:00
Deeman
73330b1aaa fix: add Overpass mirror fallback to eliminate 504 failures
Adds OVERPASS_MIRRORS list (overpass-api.de, kumi.systems, openstreetmap.ru)
and a post_overpass() helper in _shared.py that tries mirrors in order,
logging a warning on each failure and re-raising the last RequestException
if all mirrors fail. Both overpass.py and overpass_tennis.py now call
post_overpass() instead of hard-coding the primary URL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 21:29:51 +01:00
Deeman
cee2e9babc merge: standardise recheck availability to JSONL + update docs 2026-02-25 15:45:23 +01:00
Deeman
b33dd51d76 feat: standardise recheck availability to JSONL output
- extract_recheck() now writes availability_{date}_recheck_{HH}.jsonl.gz
  (one venue per line with date/captured_at_utc/recheck_hour injected);
  uses compress_jsonl_atomic; removes write_gzip_atomic import
- stg_playtomic_availability: add recheck_jsonl CTE (newline_delimited
  read_json on *.jsonl.gz recheck files); include in all_venues UNION ALL;
  old recheck_blob CTE kept for transition
- init_landing_seeds.py: add JSONL recheck seed alongside blob seed
- Docs: README landing structure + data sources table updated; CHANGELOG
  availability bullets updated; data-sources-inventory paths corrected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 14:52:47 +01:00
Deeman
a86f1ecd3a fix(staging): enforce grain dedup in resources + opening_hours + skip old blob in tenants
Both stg_playtomic_resources and stg_playtomic_opening_hours lacked QUALIFY ROW_NUMBER()
dedup despite declaring a grain. When both tenants.json.gz (old) and tenants.jsonl.gz (new)
exist for the same month, the UNION ALL produced exactly 2× rows.

Fixes:
- stg_playtomic_resources: QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, resource_id)
- stg_playtomic_opening_hours: QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, day_of_week)
- playtomic_tenants.py: skip if old blob OR new JSONL already exists for the month,
  preventing same-month dual-format writes that trigger the duplicate

Row counts after fix: ~43.8K resources, ~93.4K opening_hours (was 87.6K, 186.8K).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 13:41:23 +01:00
Deeman
b5b8493543 feat(extract): regional overpass_tennis splitting + JSONL output
Replace single global Overpass query (150K+ elements, times out) with
10 regional bbox queries (~10-40K elements each, 150s server / 180s client).

- REGIONS: 10 bboxes covering all continents
- Crash recovery: working.jsonl accumulates per-region results;
  already_seen_ids deduplication skips re-written elements on restart
- Overlapping bbox elements deduped by OSM id across regions
- Retry per region: up to 2 retries with 30s cooldown
- Polite 5s inter-region delay
- Skip if courts.jsonl.gz or courts.json.gz already exists for the month

stg_tennis_courts: UNION ALL transition (jsonl_elements + blob_elements)
  - jsonl_elements: JSONL, explicit columns, COALESCE lat/lon with center coords
    (supports both node direct lat/lon and way/relation Overpass out center)
  - blob_elements: existing UNNEST(elements) pattern, unchanged
  - Removed osm_type='node' filter — ways/relations now usable via center coords
  - Dedup on (osm_id, extracted_date DESC) unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:19:37 +01:00
Deeman
a4f246d69a feat(extract): convert geonames to JSONL output
- cities_global.jsonl.gz replaces .json.gz (one city object per line)
- Empty placeholder writes a minimal .jsonl.gz (null row, filtered in staging)
- Eliminates the {"rows": [...]} blob wrapper and maximum_object_size workaround

stg_population_geonames: UNION ALL transition (jsonl_rows + blob_rows)
  - jsonl_rows: read_json JSONL, explicit columns, no UNNEST
  - blob_rows: existing UNNEST(rows) pattern with 40MB size limit retained

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:16:59 +01:00
Deeman
7b03fd71f9 feat(extract): convert playtomic_availability to JSONL output
- availability_{date}.jsonl.gz replaces .json.gz for morning snapshots
- Each JSONL line = one venue object with date + captured_at_utc injected
- Eliminates in-memory consolidation: working.jsonl IS the final file
  (compress_jsonl_atomic at end instead of write_gzip_atomic blob)
- Crash recovery unchanged: working.jsonl accumulates via flush_partial_batch
- _load_morning_availability tries .jsonl.gz first, falls back to .json.gz
- Skip check covers both formats during transition
- Recheck files stay blob format (small, infrequent)

stg_playtomic_availability: UNION ALL transition (morning_jsonl + morning_blob + recheck_blob)
  - morning_jsonl: read_json JSONL, tenant_id direct column, no outer UNNEST
  - morning_blob / recheck_blob: subquery + LATERAL UNNEST (unchanged semantics)
  - All three produce (snapshot_date, captured_at_utc, snapshot_type, recheck_hour, tenant_id, slots_json)
  - Downstream raw_resources / raw_slots CTEs unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:14:38 +01:00
Deeman
9bef055e6d feat(extract): convert playtomic_tenants to JSONL output
- playtomic_tenants.py: write each tenant as a JSONL line after dedup,
  compress via compress_jsonl_atomic → tenants.jsonl.gz
- playtomic_availability.py: update _load_tenant_ids() to prefer
  tenants.jsonl.gz, fall back to tenants.json.gz (transition)
- stg_playtomic_venues.sql: UNION ALL jsonl+blob CTEs for transition;
  JSONL reads top-level columns directly, no UNNEST(tenants) needed
- stg_playtomic_resources.sql: same UNION ALL pattern, single UNNEST
  for resources in JSONL path vs double UNNEST in blob path
- stg_playtomic_opening_hours.sql: same UNION ALL pattern, opening_hours
  as top-level JSON column in JSONL path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 12:07:53 +01:00
Deeman
6bede60ef8 feat(extract): add compress_jsonl_atomic() utility
Streams a JSONL working file to .jsonl.gz in 1MB chunks (constant memory),
atomic rename via .tmp sibling, deletes source on success. Companion to
write_gzip_atomic() for extractors that stream records incrementally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 11:50:17 +01:00
Deeman
d834bdc59a feat(extract): recheck every 30 min with 30-min window for accurate occupancy
Each slot is now rechecked once, at most 30 min before it starts.
Worst-case miss: a booking made 29 min before start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:39:30 +01:00
Deeman
b7c8568265 fix(extract): recheck window 90→60 min (correct reasoning this time)
60-min window + hourly rechecks = each slot caught exactly once, 0-60 min
before it starts. 90-min window causes double-querying (T-90 and T-30).
Slot duration is irrelevant — it doesn't affect when the slot appears in
the window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:37:17 +01:00
Deeman
be8872beb2 revert: restore recheck window to 90 min
Data analysis of 5,115 venues with slots shows 24.8% have a 90-min minimum
slot duration. A 60-min window would miss those venues entirely with hourly
rechecks. 90 min is correct — covers 30/60/90-min minimum venues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:35:12 +01:00
Deeman
d15787caeb fix(extract): recheck window 90→60 min — matches hourly schedule and min slot duration
With hourly rechecks and 60-min minimum slots, a 90-min window causes each
slot to be queried twice. 60-min window = each slot caught exactly once in
the recheck immediately before it starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 09:33:20 +01:00
Deeman
d5947af8d4 merge: maximum performance extraction (parallel pages + crash-safe partial JSONL)
# Conflicts:
#	.env.dev.sops
#	.env.prod.sops
#	extract/padelnomics_extract/src/padelnomics_extract/playtomic_tenants.py
2026-02-24 22:36:34 +01:00
Deeman
9f010d8c0c perf(extract): parallel page fetching in tenants, drop EXTRACT_WORKERS env var
- playtomic_tenants.py: batch_size = len(proxy_urls) pages fired in parallel per
  batch; each page gets its own session + proxy; sorted(results) ensures
  deterministic done-detection; falls back to serial + THROTTLE_SECONDS when no
  proxies. Expected speedup: ~2.5 min → ~15 s with 10 proxies.
- .env.dev.sops, .env.prod.sops: remove EXTRACT_WORKERS (now derived from
  PROXY_URLS length)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:30:28 +01:00
Deeman
6116445b56 perf(extract): auto-detect workers from proxies, skip throttle on success, crash-safe partial JSONL
- proxy.py: delete unused make_sticky_selector()
- utils.py: add load_partial_results() + flush_partial_batch() for crash-resumable extraction
- playtomic_availability.py:
  - drop MAX_WORKERS / EXTRACT_WORKERS — worker_count = len(proxy_urls) or 1
  - skip time.sleep(THROTTLE_SECONDS) on success when proxy_url is set; keep sleeps for 429/5xx
  - replace cursor-based resumption with .partial.jsonl sidecar (flush every 50 records)
  - _fetch_venues_parallel accepts on_result callback for incremental partial-file flushing
  - mirror auto-detect worker count in extract_recheck()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:21:05 +01:00
Deeman
79d1b0e672 feat(extract): tiered proxy with circuit breaker + proxy provider research
- playtomic_tenants.py: simplify proxy cycler call (cycler() instead of
  cycler["next_proxy"]()) — matches refactored proxy API
- docs/proxy-provider-inventory.md: proxy provider comparison table for
  Playtomic scraping (~14k req/day, residential IPs, pay-per-GB)
- .env.*.sops: updated encrypted secrets (re-encrypted)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:11 +01:00
Deeman
78ffbc313f feat(extract): parallel DAG scheduler + proxy rotation for tenants
- all.py: replace sequential loop with graphlib.TopologicalSorter + ThreadPoolExecutor
  - EXTRACTORS dict declares (func, [deps]) — self-documenting dependency graph
  - 8 extractors run in parallel immediately; availability starts as soon as
    tenants finishes (not after all others complete)
  - max_workers=len(EXTRACTORS) — all I/O-bound, no CPU contention
- playtomic_tenants.py: add proxy rotation via make_round_robin_cycler
  - no throttle when PROXY_URLS set (IP rotation removes per-IP rate concern)
  - keeps 2s throttle for direct runs
- _shared.py: add optional proxy_url param to run_extractor()
  - any extractor can opt in to proxy support via the shared session
- overpass_tennis.py: fix query timeout (out body → out center, timeout 180 → 300)
  - out center returns centroids only, not full geometry — fits within server limits
- playtomic_availability.py: fix CIRCUIT_BREAKER_THRESHOLD empty string crash
  - int(os.environ.get(..., "10")) → int(os.environ.get(...) or "10")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 21:17:00 +01:00
Deeman
44c0dd0b8d refactor: minor TigerStyle cleanups
- export_serving.py: move `import re` to module level — was imported
  inside a loop body on every iteration
- sitemap.py: add comment documenting that the in-memory TTL cache is
  process-local (valid for single-worker deployment, Dockerfile --workers 1)
- playtomic_availability.py: use `or "10"` fallback for
  CIRCUIT_BREAKER_THRESHOLD env var to handle empty-string case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 20:50:43 +01:00
Deeman
ad48f23cfc fix: add precondition assertions in extract pipeline
Assert landing_dir.is_dir() and year_month format (YYYY/MM) at the
entry point of each extract function — turning silent wrong-path bugs
into immediate AssertionError with a descriptive message.

Files changed:
- playtomic_availability.py: assert in _load_tenant_ids(), extract(),
  extract_recheck()
- eurostat.py: assert in extract()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 20:42:11 +01:00
Deeman
d3db830c98 Merge branch 'master' into worktree-dual-market-score
# Conflicts:
#	.env.dev.sops
2026-02-24 16:38:25 +01:00
Deeman
c109488d9d feat(extract): expand GeoNames to cities1000 + add tennis court extractor
GeoNames:
- cities15000 → cities1000 (~140K global locations, pop ≥ 1K)
- Add lat/lon, admin1_code, admin2_code to output (needed for dim_locations)
- Expand feature codes to include PPLA3/4/5 (Gemeinden, cantons, etc.)
- Remove MIN_POPULATION=50K floor — cities1000 already pre-filters to ≥1K
- Update assertions for new scale (~100K+ expected)

Tennis courts:
- New overpass_tennis.py extractor (sport=tennis, 180s Overpass timeout)
- Registered as extract-overpass-tennis, added to EXTRACTORS list
- New stg_tennis_courts.sql staging model (grain: osm_id)

stg_population_geonames: add lat, lon, admin1_code, admin2_code columns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 16:15:20 +01:00
Deeman
0b472e1a32 feat(extract): tiered proxy with circuit breaker for Playtomic availability
Adds a two-tier proxy system for the Playtomic availability extractor:
- Primary tier (PROXY_URLS): datacenter proxies, cheap and fast
- Fallback tier (PROXY_URLS_FALLBACK): residential rotating gateway, reliable

Circuit breaker opens after CIRCUIT_BREAKER_THRESHOLD (default: 10) consecutive
failures, permanently switching to the fallback tier for the rest of the run.
No auto-recovery — avoids flapping. If circuit opens with no fallback configured,
logs an error and writes partial results rather than continuing on a dead proxy pool.

Parallel mode submits futures in PARALLEL_BATCH_SIZE=100 batches so the circuit
breaker can stop new submissions after it opens.

New env vars added to .env.dev.sops (blank defaults):
  PROXY_URLS_FALLBACK          — residential/rotating gateway URL
  CIRCUIT_BREAKER_THRESHOLD    — consecutive failures before switching (default 10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 16:01:50 +01:00
Deeman
1762188f08 docs(inventory): document population pipeline implementation findings
- Add section 9 to data-sources-inventory.md covering live API quirks:
  Eurostat SDMX city labels response shape, ONS CSV download path (observations
  API 404s), US Census ACS place endpoint, GeoNames cities15000 bulk format
- Add population coverage summary table and DuckDB glob limitation note
- fix(extract): census_usa + geonames write empty placeholder when credentials
  absent so SQLMesh staging models don't fail with "no files found"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 01:01:10 +01:00
Deeman
06cbdf80dc fix(extract): correct Eurostat + ONS response parsers against live APIs
eurostat_city_labels: API returns compact dimension JSON (category.label dict),
not SDMX 2.1 nested codelists structure. Fixed parser to read from
data["category"]["label"]. 1771 city codes fetched successfully.

ons_uk: observations endpoint (TS007A) is 404. Switched to CSV download via
/datasets/mid-year-pop-est/editions/mid-2022-england-wales — fetches ~68MB CSV,
filters to sex='all' + target year, aggregates population per LAD. 316 LADs ≥50K.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:26:13 +01:00
Deeman
0960990373 feat(data): Sprint 1-5 population pipeline — city labels, US/UK/Global extractors
Part A: Data Layer — Sprints 1-5

Sprint 1 — Eurostat SDMX city labels (unblocks EU population):
- New extractor: eurostat_city_labels.py — fetches ESTAT/CITIES codelist
  (city_code → city_name mapping) with ETag dedup
- New staging model: stg_city_labels.sql — grain city_code
- Updated dim_cities.sql — joins Eurostat population via city code lookup;
  replaces hardcoded 0::BIGINT population

Sprint 2 — Market score formula v2:
- city_market_profile.sql: 30pt population (LN/1M), 25pt income PPS (/200),
  30pt demand (occupancy or density), 15pt data confidence
- Moved venue_pricing_benchmarks join into base CTE so median_occupancy_rate
  is available to the scoring formula

Sprint 3 — US Census ACS extractor:
- New extractor: census_usa.py — ACS 5-year place population (vintage 2023)
- New staging model: stg_population_usa.sql — grain (place_fips, ref_year)

Sprint 4 — ONS UK extractor:
- New extractor: ons_uk.py — 2021 Census LAD population via ONS beta API
- New staging model: stg_population_uk.sql — grain (lad_code, ref_year)

Sprint 5 — GeoNames global extractor:
- New extractor: geonames.py — cities15000.zip bulk download, filtered to ≥50K pop
- New staging model: stg_population_geonames.sql — grain geoname_id
- dim_cities: 5-source population cascade (Eurostat > Census > ONS > GeoNames > 0)
  with case/whitespace-insensitive city name matching

Registered all 4 new CLI entrypoints in pyproject.toml and all.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 00:07:08 +01:00
Deeman
a1faddbed6 feat: Python supervisor + feature flags
Supervisor (replaces supervisor.sh):
- supervisor.py — cron-based pipeline orchestration, reads workflows.toml
  on every tick, runs due extractors in topological waves with parallel
  execution, then SQLMesh transform + serving export
- workflows.toml — workflow registry: overpass (monthly), eurostat (monthly),
  playtomic_tenants (weekly), playtomic_availability (daily),
  playtomic_recheck (hourly 6–23)
- padelnomics-supervisor.service — updated ExecStart to Python supervisor

Extraction enhancements:
- proxy.py — optional round-robin/sticky proxy rotation via PROXY_URLS env
- playtomic_availability.py — parallel fetch (EXTRACT_WORKERS), recheck mode
  (main_recheck) re-queries imminent slots for accurate occupancy measurement
- _shared.py — realistic browser User-Agent on all extractor sessions
- stg_playtomic_availability.sql — reads morning + recheck snapshots, tags each
- fct_daily_availability.sql — prefers recheck over morning for same slot

Feature flags (replaces WAITLIST_MODE env var):
- migration 0019 — feature_flags table, 5 initial flags:
  markets (on), payments/planner_export/supplier_signup/lead_unlock (off)
- core.py — is_flag_enabled() + feature_gate() decorator
- routes — payments, markets, planner_export, supplier_signup, lead_unlock gated
- admin flags UI — /admin/flags toggle page + nav link
- app.py — flag() injected as Jinja2 global

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:53:45 +01:00
Deeman
a055660cd2 fix: replace broken bbox pagination with global page-based extraction
Playtomic API ignores bbox params (min_latitude, etc.) and offset param.
Discovered that `page` param works correctly for global enumeration.

Result: 14,202 venues across 82 countries (was 100 with bbox approach).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 01:16:35 +01:00
Deeman
13c86ebf84 Merge branch 'worktree-extraction-overhaul'
# Conflicts:
#	transform/sqlmesh_padelnomics/models/foundation/dim_cities.sql
#	transform/sqlmesh_padelnomics/models/staging/stg_playtomic_venues.sql
2026-02-23 01:01:26 +01:00