feat: restructure extraction to one file per source

Split monolithic execute.py into per-source modules with separate CLI
entry points. Each extractor now uses the framework from utils.py:
- SQLite state tracking (start_run / end_run per extractor)
- Proper logging (replace print() with logger)
- Atomic gzip writes (write_gzip_atomic)
- Connection pooling (niquests.Session)
- Bounded pagination (MAX_PAGES_PER_BBOX = 500)

New entry points:
  extract              — run all 4 extractors sequentially
  extract-overpass     — OSM padel courts
  extract-eurostat     — city demographics (etag dedup)
  extract-playtomic-tenants      — venue listings
  extract-playtomic-availability — booking slots + pricing (NEW)

The availability extractor reads tenant IDs from the latest tenants.json.gz,
queries next-day slots for each venue, and stores daily consolidated snapshots.
Supports resumability via cursor and retry with backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-22 18:56:41 +01:00
parent ea86940b78
commit 53e9bbd66b
10 changed files with 625 additions and 223 deletions

View File

@@ -1,6 +1,6 @@
[project]
name = "padelnomics_extract"
version = "0.1.0"
version = "0.2.0"
description = "Data extraction pipelines for padelnomics"
requires-python = ">=3.11"
dependencies = [
@@ -9,7 +9,11 @@ dependencies = [
]
[project.scripts]
extract = "padelnomics_extract.execute:extract_dataset"
extract = "padelnomics_extract.all:main"
extract-overpass = "padelnomics_extract.overpass:main"
extract-eurostat = "padelnomics_extract.eurostat:main"
extract-playtomic-tenants = "padelnomics_extract.playtomic_tenants:main"
extract-playtomic-availability = "padelnomics_extract.playtomic_availability:main"
[build-system]
requires = ["hatchling"]