padelnomics/extract/padelnomics_extract/README.md

# Padelnomics Extraction

Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.

## Running

```bash
# Run all extractors sequentially
LANDING_DIR=data/landing uv run extract

# Run a single extractor
LANDING_DIR=data/landing uv run extract-overpass
LANDING_DIR=data/landing uv run extract-eurostat
LANDING_DIR=data/landing uv run extract-playtomic-tenants
LANDING_DIR=data/landing uv run extract-playtomic-availability
```

## Architecture: one file per source

Each data source lives in its own module with a dedicated CLI entry point:

```
src/padelnomics_extract/
├── __init__.py
├── _shared.py                  # LANDING_DIR, logger, run_extractor() wrapper
├── utils.py                    # SQLite state tracking, atomic I/O helpers
├── overpass.py                 # OSM padel courts via Overpass API
├── eurostat.py                 # Eurostat city demographics (urb_cpop1, ilc_di03)
├── playtomic_tenants.py        # Playtomic venue listings (tenant search)
├── playtomic_availability.py   # Playtomic booking slots (next-day availability)
└── all.py                      # Runs all extractors sequentially
```

### Adding a new extractor

1. Create `my_source.py` following the pattern:

```python
from ._shared import run_extractor, setup_logging
from .utils import compress_jsonl_atomic, landing_path

logger = setup_logging("padelnomics.extract.my_source")
EXTRACTOR_NAME = "my_source"

def extract(landing_dir, year_month, conn, session):
    """Returns {"files_written": N, "bytes_written": N, ...}."""
    year, month = year_month.split("/")
    dest_dir = landing_path(landing_dir, "my_source", year, month)
    # ... fetch data, write to dest_dir ...
    return {"files_written": 1, "files_skipped": 0, "bytes_written": n}

def main():
    run_extractor(EXTRACTOR_NAME, extract)
```

2. Add entry point to `pyproject.toml`:
```toml
extract-my-source = "padelnomics_extract.my_source:main"
```

3. Import in `all.py` and add to `EXTRACTORS` list.

4. Add a staging model in `transform/sqlmesh_padelnomics/models/staging/`.

## Design: filesystem as state

The landing zone is an append-only store of raw files:

- **Idempotency**: running twice writes nothing if the source hasn't changed
- **Debugging**: every historical raw file is preserved
- **Safety**: extraction never mutates existing files, only appends new ones

### Etag-based dedup (Eurostat)

When the source provides an `ETag` header, store it in a sibling `.etag` file.
On the next request, send `If-None-Match` — 304 means skip.

### Content-addressed (Overpass, Playtomic)

Files named by date or content. `write_gzip_atomic()` writes to a `.tmp` sibling
then renames — never leaves partial files on crash.

## State tracking

Every run writes one row to `data/landing/.state.sqlite`:

```bash
sqlite3 data/landing/.state.sqlite \
  "SELECT extractor, started_at, status, files_written, cursor_value
   FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
```

| Column | Type | Description |
|--------|------|-------------|
| `run_id` | INTEGER | Auto-increment primary key |
| `extractor` | TEXT | Extractor name (e.g. `overpass`, `eurostat`) |
| `started_at` | TEXT | ISO 8601 UTC timestamp |
| `finished_at` | TEXT | ISO 8601 UTC timestamp |
| `status` | TEXT | `running` → `success` or `failed` |
| `files_written` | INTEGER | New files written this run |
| `files_skipped` | INTEGER | Files already present |
| `bytes_written` | INTEGER | Compressed bytes written |
| `cursor_value` | TEXT | Resume cursor (date, index, etc.) |
| `error_message` | TEXT | Exception message if failed |

## Landing zone structure

```
data/landing/
├── .state.sqlite
├── overpass/{year}/{month}/courts.{jsonl,json}.gz
├── overpass_tennis/{year}/{month}/courts.{jsonl,json}.gz
├── eurostat/{year}/{month}/urb_cpop1.json.gz
├── eurostat/{year}/{month}/ilc_di03.json.gz
├── geonames/{year}/{month}/cities_global.{jsonl,json}.gz
├── playtomic/{year}/{month}/tenants.{jsonl,json}.gz
├── playtomic/{year}/{month}/availability_{date}.{jsonl,json}.gz
└── playtomic/{year}/{month}/availability_{date}_recheck_{HH}.{jsonl,json}.gz
```

## Data sources

| Source | Module | Schedule | Notes |
|--------|--------|----------|-------|
| Overpass API (padel) | `overpass.py` | Daily | OSM padel courts, ~5K nodes; JSONL output |
| Overpass API (tennis) | `overpass_tennis.py` | Daily | OSM tennis courts, ~150K+ nodes; regional splits; JSONL output |
| Eurostat | `eurostat.py` | Daily (304 most runs) | urb_cpop1, ilc_di03 — etag dedup |
| GeoNames | `geonames.py` | Daily | ~140K locations (pop ≥1K); JSONL output |
| Playtomic tenants | `playtomic_tenants.py` | Daily | ~14K venues, bounded pagination; JSONL output |
| Playtomic availability | `playtomic_availability.py` | Daily + recheck | Morning: next-day slots; recheck: near-real-time fill; JSONL output |