feat: standardise recheck availability to JSONL output

- extract_recheck() now writes availability_{date}_recheck_{HH}.jsonl.gz
  (one venue per line with date/captured_at_utc/recheck_hour injected);
  uses compress_jsonl_atomic; removes write_gzip_atomic import
- stg_playtomic_availability: add recheck_jsonl CTE (newline_delimited
  read_json on *.jsonl.gz recheck files); include in all_venues UNION ALL;
  old recheck_blob CTE kept for transition
- init_landing_seeds.py: add JSONL recheck seed alongside blob seed
- Docs: README landing structure + data sources table updated; CHANGELOG
  availability bullets updated; data-sources-inventory paths corrected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-25 14:52:47 +01:00
parent 683ca3fc24
commit b33dd51d76
6 changed files with 63 additions and 31 deletions

View File

@@ -9,7 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
### Added
- **JSONL streaming landing format** — extractors now write one JSON object per line (`.jsonl.gz`) instead of a single large blob, eliminating in-memory accumulation and `maximum_object_size` workarounds:
- `playtomic_tenants.py``tenants.jsonl.gz` (one tenant per line; dedup still happens in memory before write)
- `playtomic_availability.py``availability_{date}.jsonl.gz` (one venue per line with `date`/`captured_at_utc` injected; working file IS the final file — eliminates the consolidation step)
- `playtomic_availability.py``availability_{date}.jsonl.gz` (morning) + `availability_{date}_recheck_{HH}.jsonl.gz` (recheck); one venue per line with `date`/`captured_at_utc`/`recheck_hour` injected
- `geonames.py``cities_global.jsonl.gz` (one city per line; eliminates 30 MB blob and its `maximum_object_size` workaround)
- `compress_jsonl_atomic(jsonl_path, dest_path)` utility added to `utils.py` — streams compression in 1 MB chunks, atomic `.tmp` rename, deletes source
- **Regional Overpass splitting for tennis courts** — replaces single global query (150K+ elements, timed out) with 10 regional bbox queries (~10-40K elements each, 150s server / 180s client):
@@ -22,7 +22,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
### Changed
- All modified staging SQL models use **UNION ALL transition CTEs** — both JSONL (new) and blob (old) formats are readable simultaneously; old `.json.gz` files in the landing zone continue working until they rotate out naturally:
- `stg_playtomic_venues`, `stg_playtomic_resources`, `stg_playtomic_opening_hours` — JSONL top-level columns (no `UNNEST(tenants)`)
- `stg_playtomic_availability` — JSONL morning files + blob morning files + blob recheck files
- `stg_playtomic_availability` — JSONL morning + recheck files; blob morning + recheck kept for transition
- `stg_population_geonames` — JSONL city rows (no `UNNEST(rows)`, no `maximum_object_size`)
- `stg_tennis_courts` — JSONL elements with `COALESCE(lat, center.lat)` for way/relation centre coords; blob UNNEST kept for old files
- **Marketplace admin dashboard** (`/admin/marketplace`) — single-screen health view for the two-sided market: