Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.3 KiB
Padelnomics Extraction
Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.
Running
# One-shot (most recent data only)
LANDING_DIR=data/landing uv run extract
# First-time full backfill (add your own backfill entry point)
LANDING_DIR=data/landing uv run python -m padelnomics_extract.execute
Design: filesystem as state
The landing zone is an append-only store of raw files. Each file is named by its content fingerprint (etag or SHA256 hash), so:
- Idempotency: running twice writes nothing if the source hasn't changed
- Debugging: every historical raw file is preserved — reprocess any window by re-running transforms
- Safety: extraction never mutates existing files, only appends new ones
Etag-based dedup (preferred)
When the source provides an ETag header, use it as the filename:
data/landing/padelnomics/{year}/{month:02d}/{etag}.csv.gz
The file existing on disk means the content matches the server's current version. No content download needed.
Hash-based dedup (fallback)
When the source has no etag (static files that update in-place), download the content and use its SHA256 prefix as the filename:
data/landing/padelnomics/{year}/{date}_{sha256[:8]}.csv.gz
Two runs that produce identical content produce the same hash → same filename → skip.
State tracking
Every run writes one row to data/landing/.state.sqlite. Query it to answer operational questions:
# When did extraction last succeed?
sqlite3 data/landing/.state.sqlite \
"SELECT extractor, started_at, status, files_written, files_skipped, cursor_value
FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
# Did anything fail in the last 7 days?
sqlite3 data/landing/.state.sqlite \
"SELECT * FROM extraction_runs WHERE status = 'failed'
AND started_at > datetime('now', '-7 days')"
State table schema:
| Column | Type | Description |
|---|---|---|
run_id |
INTEGER | Auto-increment primary key |
extractor |
TEXT | Extractor name (e.g. padelnomics) |
started_at |
TEXT | ISO 8601 UTC timestamp |
finished_at |
TEXT | ISO 8601 UTC timestamp, NULL if still running |
status |
TEXT | running → success or failed |
files_written |
INTEGER | New files written this run |
files_skipped |
INTEGER | Files already present (content unchanged) |
bytes_written |
INTEGER | Compressed bytes written |
cursor_value |
TEXT | Last successful cursor (date, etag, page, etc.) |
error_message |
TEXT | Exception message if status = failed |
Adding a new extractor
- Add a function in
execute.pyfollowing the same pattern asextract_file_by_etag()orextract_file_by_hash() - Call it from
extract_dataset()with its ownextractorname instart_run() - Store files under a new subdirectory:
landing_path(LANDING_DIR, "my_new_source", year) - Add a new SQLMesh
raw/model that reads from the new subdirectory glob
Landing zone structure
data/landing/
├── .state.sqlite # extraction run history
└── padelnomics/ # one subdirectory per source
└── {year}/
└── {month:02d}/
└── {etag}.csv.gz # immutable, content-addressed files