merge: standardise recheck availability to JSONL + update docs

This commit is contained in:
Deeman
2026-02-25 15:45:23 +01:00
6 changed files with 63 additions and 31 deletions

View File

@@ -118,7 +118,7 @@ Playtomic covers 16,000+ courts globally. The platform is dominant in Spain, UK,
**Pipeline implementation (tenants):** ✅ Ingested
- Extractor: `extract-playtomic-tenants` — paginated global scrape of `GET /v1/tenants?sport_ids=PADEL`, page size 100, up to 500 pages
- Landing: `data/landing/playtomic/{year}/{month}/tenants.json.gz` (~14K venues as of Feb 2026)
- Landing: `data/landing/playtomic/{year}/{month}/tenants.jsonl.gz` (~14K venues as of Feb 2026)
- Throttle: 2 s between pages; deduplicates on `tenant_id`
- Staging models (all grain `tenant_id` or `(tenant_id, resource_id)`):
- `stg_playtomic_venues` — venue metadata: name, address, city, country, coordinates, booking type, status
@@ -127,9 +127,10 @@ Playtomic covers 16,000+ courts globally. The platform is dominant in Spain, UK,
**Pipeline implementation (availability):** ✅ Ingested
- Extractor: `extract-playtomic-availability` — reads tenant IDs from latest tenants file, queries `GET /v1/availability` for next-day slots per venue
- Landing: `data/landing/playtomic/{year}/{month}/{date}/availability_morning.json.gz` + `availability_recheck.json.gz`
- Recheck mode: re-queries slots starting within 90 min (controlled by `RECHECK_WINDOW_MINUTES`); captures near-real-time fill rates
- Parallelism: `EXTRACT_WORKERS` env var; `PROXY_URLS` for distributed rate limiting; throttle 1 s per venue per worker
- Landing: `data/landing/playtomic/{year}/{month}/availability_{date}.jsonl.gz` (morning) + `availability_{date}_recheck_{HH}.jsonl.gz` (recheck)
- Old blob format (`.json.gz`) retained in landing zone alongside JSONL; staging reads both
- Recheck mode: re-queries slots starting within `RECHECK_WINDOW_MINUTES` (default 30); captures near-real-time fill rates
- Parallelism: worker count derived from `PROXY_URLS` length; throttle 1 s per venue per worker
- Staging: `stg_playtomic_availability`, grain `(snapshot_date, tenant_id, resource_id, slot_start_time, snapshot_type, captured_at_utc)`
---