merge: maximum performance extraction (parallel pages + crash-safe partial JSONL)

# Conflicts:
#	.env.dev.sops
#	.env.prod.sops
#	extract/padelnomics_extract/src/padelnomics_extract/playtomic_tenants.py
This commit is contained in:
Deeman
2026-02-24 22:36:34 +01:00
7 changed files with 312 additions and 209 deletions

View File

@@ -39,6 +39,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
queries, geometry columns).
- **SOPS secrets** — `GEONAMES_USERNAME=padelnomics` and `CENSUS_API_KEY` added to both
`.env.dev.sops` and `.env.prod.sops`.
- **Crash-safe partial JSONL** — `utils.load_partial_results()` and `flush_partial_batch()`
provide a generic opt-in mechanism for incremental progress flushing during long extractions.
Any extractor processing items one-by-one can flush every N records and resume from a
`.partial.jsonl` sidecar file after a crash.
- **Methodology page updated** — `/en/market-score` now documents both scores with:
Two Scores intro section, component cards for each score (4 Marktreife + 5 Marktpotenzial),
score band interpretations, expanded FAQ (7 entries). Section headings use the padelnomics
@@ -52,6 +56,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
First "padelnomics Market Score" mention in each article template now links
to the methodology page (hub-and-spoke internal linking).
### Changed
- **`EXTRACT_WORKERS` env var removed** — worker count is now derived from `PROXY_URLS` length
(one worker per proxy). No proxies → single-threaded. No manual tuning needed.
- **Playtomic tenants extractor** — parallel batch page fetching when proxies are configured.
Each page in a batch fires concurrently using its own session + proxy. Expected speedup:
~2.5 min → ~15 s with 10 Webshare datacenter proxies.
- **Playtomic availability extractor** — three performance changes:
1. No per-request `time.sleep()` on success when a proxy is active (throttle only when
running direct). Retry/backoff sleeps for 429 and 5xx responses are unchanged.
2. Worker count auto-detected from proxy count (drops `EXTRACT_WORKERS`).
3. True crash resumption via `.partial.jsonl` sidecar: progress flushed every 50 venues,
resume skips already-fetched venues and merges prior results into the final file.
### Fixed
- **`datetime.utcnow()` deprecation warnings** — replaced all 94 occurrences
across 22 files (source + tests) with `utcnow()` / `utcnow_iso()` helpers