feat: standardise recheck availability to JSONL output
- extract_recheck() now writes availability_{date}_recheck_{HH}.jsonl.gz
(one venue per line with date/captured_at_utc/recheck_hour injected);
uses compress_jsonl_atomic; removes write_gzip_atomic import
- stg_playtomic_availability: add recheck_jsonl CTE (newline_delimited
read_json on *.jsonl.gz recheck files); include in all_venues UNION ALL;
old recheck_blob CTE kept for transition
- init_landing_seeds.py: add JSONL recheck seed alongside blob seed
- Docs: README landing structure + data sources table updated; CHANGELOG
availability bullets updated; data-sources-inventory paths corrected
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -37,7 +37,7 @@ src/padelnomics_extract/
|
||||
|
||||
```python
|
||||
from ._shared import run_extractor, setup_logging
|
||||
from .utils import landing_path, write_gzip_atomic
|
||||
from .utils import compress_jsonl_atomic, landing_path
|
||||
|
||||
logger = setup_logging("padelnomics.extract.my_source")
|
||||
EXTRACTOR_NAME = "my_source"
|
||||
@@ -108,18 +108,23 @@ sqlite3 data/landing/.state.sqlite \
|
||||
```
|
||||
data/landing/
|
||||
├── .state.sqlite
|
||||
├── overpass/{year}/{month}/courts.json.gz
|
||||
├── overpass/{year}/{month}/courts.{jsonl,json}.gz
|
||||
├── overpass_tennis/{year}/{month}/courts.{jsonl,json}.gz
|
||||
├── eurostat/{year}/{month}/urb_cpop1.json.gz
|
||||
├── eurostat/{year}/{month}/ilc_di03.json.gz
|
||||
├── playtomic/{year}/{month}/tenants.json.gz
|
||||
└── playtomic/{year}/{month}/availability_{date}.json.gz
|
||||
├── geonames/{year}/{month}/cities_global.{jsonl,json}.gz
|
||||
├── playtomic/{year}/{month}/tenants.{jsonl,json}.gz
|
||||
├── playtomic/{year}/{month}/availability_{date}.{jsonl,json}.gz
|
||||
└── playtomic/{year}/{month}/availability_{date}_recheck_{HH}.{jsonl,json}.gz
|
||||
```
|
||||
|
||||
## Data sources
|
||||
|
||||
| Source | Module | Schedule | Notes |
|
||||
|--------|--------|----------|-------|
|
||||
| Overpass API | `overpass.py` | Daily | OSM padel courts, ~5K nodes |
|
||||
| Overpass API (padel) | `overpass.py` | Daily | OSM padel courts, ~5K nodes; JSONL output |
|
||||
| Overpass API (tennis) | `overpass_tennis.py` | Daily | OSM tennis courts, ~150K+ nodes; regional splits; JSONL output |
|
||||
| Eurostat | `eurostat.py` | Daily (304 most runs) | urb_cpop1, ilc_di03 — etag dedup |
|
||||
| Playtomic tenants | `playtomic_tenants.py` | Daily | ~8K venues, bounded pagination |
|
||||
| Playtomic availability | `playtomic_availability.py` | Daily | Next-day slots, ~4.5h runtime |
|
||||
| GeoNames | `geonames.py` | Daily | ~140K locations (pop ≥1K); JSONL output |
|
||||
| Playtomic tenants | `playtomic_tenants.py` | Daily | ~14K venues, bounded pagination; JSONL output |
|
||||
| Playtomic availability | `playtomic_availability.py` | Daily + recheck | Morning: next-day slots; recheck: near-real-time fill; JSONL output |
|
||||
|
||||
Reference in New Issue
Block a user