docs: update CHANGELOG and PROJECT.md for JSONL landing format
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
18
CHANGELOG.md
18
CHANGELOG.md
@@ -7,6 +7,24 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
|||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
- **JSONL streaming landing format** — extractors now write one JSON object per line (`.jsonl.gz`) instead of a single large blob, eliminating in-memory accumulation and `maximum_object_size` workarounds:
|
||||||
|
- `playtomic_tenants.py` → `tenants.jsonl.gz` (one tenant per line; dedup still happens in memory before write)
|
||||||
|
- `playtomic_availability.py` → `availability_{date}.jsonl.gz` (one venue per line with `date`/`captured_at_utc` injected; working file IS the final file — eliminates the consolidation step)
|
||||||
|
- `geonames.py` → `cities_global.jsonl.gz` (one city per line; eliminates 30 MB blob and its `maximum_object_size` workaround)
|
||||||
|
- `compress_jsonl_atomic(jsonl_path, dest_path)` utility added to `utils.py` — streams compression in 1 MB chunks, atomic `.tmp` rename, deletes source
|
||||||
|
- **Regional Overpass splitting for tennis courts** — replaces single global query (150K+ elements, timed out) with 10 regional bbox queries (~10-40K elements each, 150s server / 180s client):
|
||||||
|
- Regions: europe\_west, europe\_central, europe\_east, north\_america, south\_america, asia\_east, asia\_west, oceania, africa, asia\_north
|
||||||
|
- Per-region retry (2 attempts, 30s cooldown) + 5s inter-region polite delay
|
||||||
|
- Crash recovery via `working.jsonl` accumulation — already-written element IDs skipped on restart; completed regions produce 0 new elements on re-query
|
||||||
|
- Output: `courts.jsonl.gz` (one OSM element per line)
|
||||||
|
- **`scripts/init_landing_seeds.py`** — creates minimal `.jsonl.gz` and `.json.gz` seed files in `1970/01/` so SQLMesh staging models can run before real extraction data arrives; idempotent
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- All modified staging SQL models use **UNION ALL transition CTEs** — both JSONL (new) and blob (old) formats are readable simultaneously; old `.json.gz` files in the landing zone continue working until they rotate out naturally:
|
||||||
|
- `stg_playtomic_venues`, `stg_playtomic_resources`, `stg_playtomic_opening_hours` — JSONL top-level columns (no `UNNEST(tenants)`)
|
||||||
|
- `stg_playtomic_availability` — JSONL morning files + blob morning files + blob recheck files
|
||||||
|
- `stg_population_geonames` — JSONL city rows (no `UNNEST(rows)`, no `maximum_object_size`)
|
||||||
|
- `stg_tennis_courts` — JSONL elements with `COALESCE(lat, center.lat)` for way/relation centre coords; blob UNNEST kept for old files
|
||||||
- **Marketplace admin dashboard** (`/admin/marketplace`) — single-screen health view for the two-sided market:
|
- **Marketplace admin dashboard** (`/admin/marketplace`) — single-screen health view for the two-sided market:
|
||||||
- **Lead funnel** — total / verified-new (ready to unlock) / unlocked / won / conversion rate
|
- **Lead funnel** — total / verified-new (ready to unlock) / unlocked / won / conversion rate
|
||||||
- **Credit economy** — total credits issued, consumed (lead unlocks), outstanding balance across all paid suppliers, 30-day burn rate
|
- **Credit economy** — total credits issued, consumed (lead unlocks), outstanding balance across all paid suppliers, 30-day burn rate
|
||||||
|
|||||||
@@ -93,6 +93,9 @@
|
|||||||
- [x] `dim_venues` (OSM + Playtomic deduped), `dim_cities` (Eurostat population)
|
- [x] `dim_venues` (OSM + Playtomic deduped), `dim_cities` (Eurostat population)
|
||||||
- [x] `city_market_profile` (market score OBT), `planner_defaults` (per-city calculator pre-fill)
|
- [x] `city_market_profile` (market score OBT), `planner_defaults` (per-city calculator pre-fill)
|
||||||
- [x] DuckDB analytics reader in app lifecycle
|
- [x] DuckDB analytics reader in app lifecycle
|
||||||
|
- [x] **JSONL streaming landing format** — extractors write `.jsonl.gz` (one record per line); constant-memory compression via `compress_jsonl_atomic()`; eliminates `maximum_object_size` workarounds; all modified staging models use UNION ALL transition to support both formats
|
||||||
|
- [x] **Regional Overpass tennis splitting** — 10 regional bbox queries replace the single global 150K-element query that timed out; crash recovery via `working.jsonl` accumulation
|
||||||
|
- [x] **`init_landing_seeds.py`** — creates minimal seed files for both JSONL and blob formats so SQLMesh can run before real data arrives
|
||||||
|
|
||||||
### i18n
|
### i18n
|
||||||
- [x] Full i18n across entire app (EN + DE)
|
- [x] Full i18n across entire app (EN + DE)
|
||||||
|
|||||||
Reference in New Issue
Block a user