Files

Deeman 53e9bbd66b feat: restructure extraction to one file per source

Split monolithic execute.py into per-source modules with separate CLI
entry points. Each extractor now uses the framework from utils.py:
- SQLite state tracking (start_run / end_run per extractor)
- Proper logging (replace print() with logger)
- Atomic gzip writes (write_gzip_atomic)
- Connection pooling (niquests.Session)
- Bounded pagination (MAX_PAGES_PER_BBOX = 500)

New entry points:
  extract              — run all 4 extractors sequentially
  extract-overpass     — OSM padel courts
  extract-eurostat     — city demographics (etag dedup)
  extract-playtomic-tenants      — venue listings
  extract-playtomic-availability — booking slots + pricing (NEW)

The availability extractor reads tenant IDs from the latest tenants.json.gz,
queries next-day slots for each venue, and stores daily consolidated snapshots.
Supports resumability via cursor and retry with backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-22 18:56:41 +01:00

src/padelnomics_extract

feat: restructure extraction to one file per source

2026-02-22 18:56:41 +01:00

pyproject.toml

feat: restructure extraction to one file per source

2026-02-22 18:56:41 +01:00

README.md

feat: copier update v0.9.0 → v0.10.0

2026-02-22 17:50:36 +01:00

README.md

Padelnomics Extraction

Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.

Running

# One-shot (most recent data only)
LANDING_DIR=data/landing uv run extract

# First-time full backfill (add your own backfill entry point)
LANDING_DIR=data/landing uv run python -m padelnomics_extract.execute

Design: filesystem as state

The landing zone is an append-only store of raw files. Each file is named by its content fingerprint (etag or SHA256 hash), so:

Idempotency: running twice writes nothing if the source hasn't changed
Debugging: every historical raw file is preserved — reprocess any window by re-running transforms
Safety: extraction never mutates existing files, only appends new ones

Etag-based dedup (preferred)

When the source provides an ETag header, use it as the filename:

data/landing/padelnomics/{year}/{month:02d}/{etag}.csv.gz

The file existing on disk means the content matches the server's current version. No content download needed.

Hash-based dedup (fallback)

When the source has no etag (static files that update in-place), download the content and use its SHA256 prefix as the filename:

data/landing/padelnomics/{year}/{date}_{sha256[:8]}.csv.gz

Two runs that produce identical content produce the same hash → same filename → skip.

State tracking

Every run writes one row to data/landing/.state.sqlite. Query it to answer operational questions:

# When did extraction last succeed?
sqlite3 data/landing/.state.sqlite \
  "SELECT extractor, started_at, status, files_written, files_skipped, cursor_value
   FROM extraction_runs ORDER BY run_id DESC LIMIT 10"

# Did anything fail in the last 7 days?
sqlite3 data/landing/.state.sqlite \
  "SELECT * FROM extraction_runs WHERE status = 'failed'
   AND started_at > datetime('now', '-7 days')"

State table schema:

Column	Type	Description
`run_id`	INTEGER	Auto-increment primary key
`extractor`	TEXT	Extractor name (e.g. `padelnomics`)
`started_at`	TEXT	ISO 8601 UTC timestamp
`finished_at`	TEXT	ISO 8601 UTC timestamp, NULL if still running
`status`	TEXT	`running` → `success` or `failed`
`files_written`	INTEGER	New files written this run
`files_skipped`	INTEGER	Files already present (content unchanged)
`bytes_written`	INTEGER	Compressed bytes written
`cursor_value`	TEXT	Last successful cursor (date, etag, page, etc.)
`error_message`	TEXT	Exception message if status = `failed`

Adding a new extractor

Add a function in execute.py following the same pattern as extract_file_by_etag() or extract_file_by_hash()
Call it from extract_dataset() with its own extractor name in start_run()
Store files under a new subdirectory: landing_path(LANDING_DIR, "my_new_source", year)
Add a new SQLMesh raw/ model that reads from the new subdirectory glob

Landing zone structure

data/landing/
├── .state.sqlite              # extraction run history
└── padelnomics/        # one subdirectory per source
    └── {year}/
        └── {month:02d}/
            └── {etag}.csv.gz  # immutable, content-addressed files