diff --git a/CHANGELOG.md b/CHANGELOG.md index 6189c7e..63e16bc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,24 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). ## [Unreleased] ### Added +- **JSONL streaming landing format** — extractors now write one JSON object per line (`.jsonl.gz`) instead of a single large blob, eliminating in-memory accumulation and `maximum_object_size` workarounds: + - `playtomic_tenants.py` → `tenants.jsonl.gz` (one tenant per line; dedup still happens in memory before write) + - `playtomic_availability.py` → `availability_{date}.jsonl.gz` (one venue per line with `date`/`captured_at_utc` injected; working file IS the final file — eliminates the consolidation step) + - `geonames.py` → `cities_global.jsonl.gz` (one city per line; eliminates 30 MB blob and its `maximum_object_size` workaround) + - `compress_jsonl_atomic(jsonl_path, dest_path)` utility added to `utils.py` — streams compression in 1 MB chunks, atomic `.tmp` rename, deletes source +- **Regional Overpass splitting for tennis courts** — replaces single global query (150K+ elements, timed out) with 10 regional bbox queries (~10-40K elements each, 150s server / 180s client): + - Regions: europe\_west, europe\_central, europe\_east, north\_america, south\_america, asia\_east, asia\_west, oceania, africa, asia\_north + - Per-region retry (2 attempts, 30s cooldown) + 5s inter-region polite delay + - Crash recovery via `working.jsonl` accumulation — already-written element IDs skipped on restart; completed regions produce 0 new elements on re-query + - Output: `courts.jsonl.gz` (one OSM element per line) +- **`scripts/init_landing_seeds.py`** — creates minimal `.jsonl.gz` and `.json.gz` seed files in `1970/01/` so SQLMesh staging models can run before real extraction data arrives; idempotent + +### Changed +- All modified staging SQL models use **UNION ALL transition CTEs** — both JSONL (new) and blob (old) formats are readable simultaneously; old `.json.gz` files in the landing zone continue working until they rotate out naturally: + - `stg_playtomic_venues`, `stg_playtomic_resources`, `stg_playtomic_opening_hours` — JSONL top-level columns (no `UNNEST(tenants)`) + - `stg_playtomic_availability` — JSONL morning files + blob morning files + blob recheck files + - `stg_population_geonames` — JSONL city rows (no `UNNEST(rows)`, no `maximum_object_size`) + - `stg_tennis_courts` — JSONL elements with `COALESCE(lat, center.lat)` for way/relation centre coords; blob UNNEST kept for old files - **Marketplace admin dashboard** (`/admin/marketplace`) — single-screen health view for the two-sided market: - **Lead funnel** — total / verified-new (ready to unlock) / unlocked / won / conversion rate - **Credit economy** — total credits issued, consumed (lead unlocks), outstanding balance across all paid suppliers, 30-day burn rate diff --git a/PROJECT.md b/PROJECT.md index 6e94289..9562f7f 100644 --- a/PROJECT.md +++ b/PROJECT.md @@ -93,6 +93,9 @@ - [x] `dim_venues` (OSM + Playtomic deduped), `dim_cities` (Eurostat population) - [x] `city_market_profile` (market score OBT), `planner_defaults` (per-city calculator pre-fill) - [x] DuckDB analytics reader in app lifecycle +- [x] **JSONL streaming landing format** — extractors write `.jsonl.gz` (one record per line); constant-memory compression via `compress_jsonl_atomic()`; eliminates `maximum_object_size` workarounds; all modified staging models use UNION ALL transition to support both formats +- [x] **Regional Overpass tennis splitting** — 10 regional bbox queries replace the single global 150K-element query that timed out; crash recovery via `working.jsonl` accumulation +- [x] **`init_landing_seeds.py`** — creates minimal seed files for both JSONL and blob formats so SQLMesh can run before real data arrives ### i18n - [x] Full i18n across entire app (EN + DE)