Remove raw/ layer — staging models now read landing JSON directly. Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*. Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH. Supervisor gets daily sleep interval between pipeline runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
126 lines
4.4 KiB
Markdown
126 lines
4.4 KiB
Markdown
# Padelnomics Extraction
|
|
|
|
Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
# Run all extractors sequentially
|
|
LANDING_DIR=data/landing uv run extract
|
|
|
|
# Run a single extractor
|
|
LANDING_DIR=data/landing uv run extract-overpass
|
|
LANDING_DIR=data/landing uv run extract-eurostat
|
|
LANDING_DIR=data/landing uv run extract-playtomic-tenants
|
|
LANDING_DIR=data/landing uv run extract-playtomic-availability
|
|
```
|
|
|
|
## Architecture: one file per source
|
|
|
|
Each data source lives in its own module with a dedicated CLI entry point:
|
|
|
|
```
|
|
src/padelnomics_extract/
|
|
├── __init__.py
|
|
├── _shared.py # LANDING_DIR, logger, run_extractor() wrapper
|
|
├── utils.py # SQLite state tracking, atomic I/O helpers
|
|
├── overpass.py # OSM padel courts via Overpass API
|
|
├── eurostat.py # Eurostat city demographics (urb_cpop1, ilc_di03)
|
|
├── playtomic_tenants.py # Playtomic venue listings (tenant search)
|
|
├── playtomic_availability.py # Playtomic booking slots (next-day availability)
|
|
└── all.py # Runs all extractors sequentially
|
|
```
|
|
|
|
### Adding a new extractor
|
|
|
|
1. Create `my_source.py` following the pattern:
|
|
|
|
```python
|
|
from ._shared import run_extractor, setup_logging
|
|
from .utils import landing_path, write_gzip_atomic
|
|
|
|
logger = setup_logging("padelnomics.extract.my_source")
|
|
EXTRACTOR_NAME = "my_source"
|
|
|
|
def extract(landing_dir, year_month, conn, session):
|
|
"""Returns {"files_written": N, "bytes_written": N, ...}."""
|
|
year, month = year_month.split("/")
|
|
dest_dir = landing_path(landing_dir, "my_source", year, month)
|
|
# ... fetch data, write to dest_dir ...
|
|
return {"files_written": 1, "files_skipped": 0, "bytes_written": n}
|
|
|
|
def main():
|
|
run_extractor(EXTRACTOR_NAME, extract)
|
|
```
|
|
|
|
2. Add entry point to `pyproject.toml`:
|
|
```toml
|
|
extract-my-source = "padelnomics_extract.my_source:main"
|
|
```
|
|
|
|
3. Import in `all.py` and add to `EXTRACTORS` list.
|
|
|
|
4. Add a staging model in `transform/sqlmesh_padelnomics/models/staging/`.
|
|
|
|
## Design: filesystem as state
|
|
|
|
The landing zone is an append-only store of raw files:
|
|
|
|
- **Idempotency**: running twice writes nothing if the source hasn't changed
|
|
- **Debugging**: every historical raw file is preserved
|
|
- **Safety**: extraction never mutates existing files, only appends new ones
|
|
|
|
### Etag-based dedup (Eurostat)
|
|
|
|
When the source provides an `ETag` header, store it in a sibling `.etag` file.
|
|
On the next request, send `If-None-Match` — 304 means skip.
|
|
|
|
### Content-addressed (Overpass, Playtomic)
|
|
|
|
Files named by date or content. `write_gzip_atomic()` writes to a `.tmp` sibling
|
|
then renames — never leaves partial files on crash.
|
|
|
|
## State tracking
|
|
|
|
Every run writes one row to `data/landing/.state.sqlite`:
|
|
|
|
```bash
|
|
sqlite3 data/landing/.state.sqlite \
|
|
"SELECT extractor, started_at, status, files_written, cursor_value
|
|
FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
|
|
```
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `run_id` | INTEGER | Auto-increment primary key |
|
|
| `extractor` | TEXT | Extractor name (e.g. `overpass`, `eurostat`) |
|
|
| `started_at` | TEXT | ISO 8601 UTC timestamp |
|
|
| `finished_at` | TEXT | ISO 8601 UTC timestamp |
|
|
| `status` | TEXT | `running` → `success` or `failed` |
|
|
| `files_written` | INTEGER | New files written this run |
|
|
| `files_skipped` | INTEGER | Files already present |
|
|
| `bytes_written` | INTEGER | Compressed bytes written |
|
|
| `cursor_value` | TEXT | Resume cursor (date, index, etc.) |
|
|
| `error_message` | TEXT | Exception message if failed |
|
|
|
|
## Landing zone structure
|
|
|
|
```
|
|
data/landing/
|
|
├── .state.sqlite
|
|
├── overpass/{year}/{month}/courts.json.gz
|
|
├── eurostat/{year}/{month}/urb_cpop1.json.gz
|
|
├── eurostat/{year}/{month}/ilc_di03.json.gz
|
|
├── playtomic/{year}/{month}/tenants.json.gz
|
|
└── playtomic/{year}/{month}/availability_{date}.json.gz
|
|
```
|
|
|
|
## Data sources
|
|
|
|
| Source | Module | Schedule | Notes |
|
|
|--------|--------|----------|-------|
|
|
| Overpass API | `overpass.py` | Daily | OSM padel courts, ~5K nodes |
|
|
| Eurostat | `eurostat.py` | Daily (304 most runs) | urb_cpop1, ilc_di03 — etag dedup |
|
|
| Playtomic tenants | `playtomic_tenants.py` | Daily | ~8K venues, bounded pagination |
|
|
| Playtomic availability | `playtomic_availability.py` | Daily | Next-day slots, ~4.5h runtime |
|