From 8bb00ea9b0d5435eef8951c8b1b1ca926ae7c9e4 Mon Sep 17 00:00:00 2001 From: Deeman Date: Tue, 24 Feb 2026 01:33:32 +0100 Subject: [PATCH] docs(inventory): pipeline tracker, scores, impl notes, FX section MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace Priority Summary Table with Pipeline Status Tracker: status (βœ…/πŸ”²/⏸/β€”), score (1-5), credential requirements, and extractor refs for all 30+ sources - Add implementation notes to Β§1.1 (Overpass), Β§1.2 (Playtomic tenants + availability), Β§5.1 (Eurostat urb_cpop1 + ilc_di03), Β§5.2 (Census), Β§5.3 (ONS) - Update Β§8 DuckDB integration table with extractor names and status - Add Β§10 FX / Currency Rates: ECB SDMX endpoint and Frankfurter.app wrapper, proposed landing format and stg_fx_rates staging model design Co-Authored-By: Claude Opus 4.6 --- research/data-sources-inventory.md | 215 +++++++++++++++++++++++------ 1 file changed, 171 insertions(+), 44 deletions(-) diff --git a/research/data-sources-inventory.md b/research/data-sources-inventory.md index 9d27ed7..f95f3a0 100644 --- a/research/data-sources-inventory.md +++ b/research/data-sources-inventory.md @@ -1,43 +1,51 @@ # Padel Market Intelligence β€” Data Sources Inventory -Compiled: 2026-02-21 -Purpose: Identify data sources to feed a DuckDB analytics pipeline for padel business intelligence. +Compiled: 2026-02-21 Β· Updated: 2026-02-24 +Purpose: Identify and track data sources feeding the Padelnomics DuckDB analytics pipeline. --- -## Priority Summary Table +## Pipeline Status Tracker -Sorted by Priority (High first), then by category. +**Status:** βœ… Ingested β€” extractor + staging model live in `master` | πŸ”² Planned β€” worth building | ⏸ On hold β€” blocked on cost/access | β€” Not targeted -| Source | Category | Access Method | Priority | Notes | -|--------|----------|---------------|----------|-------| -| OpenStreetMap / Overpass API | Court Locations | Public API | High | Free, global, `sport=padel` tag, no auth | -| Playtomic API (read-only) | Court Locations / Pricing | Public API | High | Some endpoints unauthenticated; official API needs club credentials | -| Eurostat Statistics API | Demographics | Public API | High | Free, no auth, NUTS city-level data | -| US Census Bureau API | Demographics | Public API | High | Free with API key, comprehensive | -| ONS Beta API | Demographics | Public API | High | Free, no auth, 120 req/10 s limit | -| FIP World Padel Report | Market Reports | Open Download | High | Free PDF; 2024 and 2025 editions available | -| Playtomic Global Padel Report | Market Reports | Open Download | High | Free PDF; co-produced with PwC/Strategy& | -| Sport England Active Lives | Demographics | Open Download | High | Free download; UK sports participation data | -| USPA Court Directory | Court Locations | Scrape | High | Website scrape; 100+ member clubs listed | -| DPV Standorte (Germany) | Court Locations | Scrape | High | German federation venue page, small dataset | -| LTA Padel Venue Finder | Court Locations | Scrape | High | UK venue registry; The Padel Directory also available | -| PadelAPI.org | Tournament Data | Public API | High | Free tier: 50k req, last 6 months of data | -| padelapi.org MCP server | Tournament Data | Public API | High | AI-accessible padel tournament & player stats | -| Google Maps Places API | Court Locations | Public API | Medium | $200 free/mo credit; text search for padel courts | -| Playtomic third-party API | Pricing / Bookings | Public API | Medium | Club credential required; read-only; 1 req/min | -| ImmoScout24 API | Real Estate | Public API | Medium | Developer portal; commercial use; auth required | -| Immowelt API | Real Estate | Public API | Medium | API documented; aggregator EstateSync also available | -| planning.data.gov.uk | Regulatory | Public API | Medium | UK planning data portal; some endpoints open | -| FEP (Spanish federation) | Market Reports | Manual | Medium | Annual statistics published as press releases | -| Statista (padel topic page) | Market Reports | Subscription | Low | Some charts free; full data requires subscription | -| Playskan.com | Pricing / Bookings | Scrape | Low | No public API; consumer site; ToS unclear | -| CoStar / LoopNet | Real Estate | Subscription | Low | No public API; subscription only; scraping violates ToS | -| Rightmove Commercial API | Real Estate | Subscription | Low | ADF partner program only; not open to arbitrary developers | -| JLL / CBRE Reports | Real Estate | Manual | Low | Published reports only; no API | -| Court Metrics | Pricing / Utilisation | Subscription | Low | Aggregated padel club competitive intelligence platform | -| Shovels.ai | Regulatory | Subscription | Low | US building permit intelligence; paid | -| Matchi | Court Locations | Scrape | Low | No documented public API; consumer app | +**Score (1–5):** Overall ingestion priority. Weighs data value to Padelnomics (market scores, financial planner, pSEO content) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible. + +| Source | Category | Status | Score | Credentials | Pipeline refs | +|--------|----------|--------|-------|-------------|---------------| +| OpenStreetMap / Overpass | Court locations | βœ… Ingested | 5 | None | `extract-overpass` β†’ `stg_padel_courts` | +| Playtomic β€” tenants | Court locations | βœ… Ingested | 5 | None | `extract-playtomic-tenants` β†’ `stg_playtomic_venues/resources/opening_hours` | +| Playtomic β€” availability | Pricing / utilisation | βœ… Ingested | 5 | None | `extract-playtomic-availability` β†’ `stg_playtomic_availability` | +| Eurostat `urb_cpop1` | Demographics β€” EU city population | βœ… Ingested | 5 | None | `extract-eurostat` β†’ `stg_population` | +| Eurostat `ilc_di03` | Demographics β€” EU income | βœ… Ingested | 5 | None | `extract-eurostat` β†’ `stg_income` | +| Eurostat SDMX city labels | Demographics β€” EU city lookup | βœ… Ingested | 4 | None | `extract-eurostat-city-labels` β†’ `stg_city_labels` | +| ONS UK mid-year estimates | Demographics β€” UK population | βœ… Ingested | 4 | None | `extract-ons-uk` β†’ `stg_population_uk` | +| US Census ACS 5-year | Demographics β€” US population | βœ… Ingested† | 3 | `CENSUS_API_KEY` (free) | `extract-census-usa` β†’ `stg_population_usa` | +| GeoNames cities15000 | Demographics β€” global fallback | βœ… Ingested† | 3 | `GEONAMES_USERNAME` (free) | `extract-geonames` β†’ `stg_population_geonames` | +| ECB / Frankfurter.app | FX rates | πŸ”² Planned | 4 | None | `extract-fx` β†’ `stg_fx_rates` (proposed) | +| FIP World Padel Report | Market reports | πŸ”² Planned | 4 | None (PDF) | Annual seed table | +| PadelAPI.org | Tournament data | πŸ”² Planned | 3 | Free-tier token | 50k req/mo | +| Sport England Active Lives | Demographics β€” UK participation | πŸ”² Planned | 3 | None (CSV) | Annual download | +| DPV Standorte | Court locations | πŸ”² Planned | 2 | None (scrape) | DE federation registry | +| LTA Padel Venue Finder | Court locations | πŸ”² Planned | 2 | None (scrape) | UK venue registry | +| USPA Club Directory | Court locations | πŸ”² Planned | 2 | None (scrape) | US member clubs | +| UK Planning Data Portal | Regulatory | πŸ”² Planned | 2 | None | Planning permissions, sports use | +| Google Maps Places API | Court locations | ⏸ On hold | 2 | Paid ($200/mo credit) | Gap-fill for US/DE; data storage license required | +| ImmoScout24 API | Real estate β€” DE | ⏸ On hold | 2 | Partner account | Commercial rent benchmarks | +| Immowelt API | Real estate β€” DE | ⏸ On hold | 2 | Partner account | Commercial rent | +| Rightmove Commercial | Real estate β€” UK | β€” | 1 | ADF partner only | Not accessible without partner agreement | +| LoopNet / CoStar | Real estate β€” US/UK | β€” | 1 | Subscription | ToS prohibits scraping | +| JLL / CBRE reports | Real estate | β€” | 1 | Manual (PDF) | Annual benchmark seed table only | +| Statista | Market reports | β€” | 1 | Subscription | Primary data available from FIP/Playtomic for free | +| Playskan | Pricing | β€” | 1 | No public API | Aggregates Playtomic/Matchi; go direct | +| Court Metrics | Pricing | β€” | 1 | Subscription | Derived from Playtomic signals | +| World Padel Rating | Tournament data | β€” | 1 | Scrape | Tournament venues only; limited utility | +| Matchi | Court locations | β€” | 1 | No public API | ToS prohibits scraping | +| GovData Germany | Regulatory | β€” | 1 | CKAN | Only aggregate permit counts available | +| Shovels.ai | Regulatory | β€” | 1 | Subscription | US only | +| Padel Biz Magazine | Market reports | β€” | 1 | Manual | No structured data | + +† Extractor and staging model are live; placeholder file written when credentials absent. Set `CENSUS_API_KEY` / `GEONAMES_USERNAME` env vars to activate real data. --- @@ -70,6 +78,14 @@ Limitations: coverage is community-driven and incomplete in newer markets (Germa OSM wiki: https://wiki.openstreetmap.org/wiki/Tag:sport=padel +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract-overpass` β€” single global query (all nodes/ways/relations with `sport=padel`), writes raw OSM JSON +- Landing: `data/landing/overpass/{year}/{month}/courts.json.gz` +- Staging: `staging.stg_padel_courts`, grain `osm_id` +- Columns: `osm_type, osm_id, lat, lon, name, country_code, city_tag, postcode, operator_name, opening_hours, fee` +- Cadence: monthly (OSM community changes are incremental; full re-query is cheap at ~1.5 MB response) +- No auth; query timeout set to 60 s in extractor + --- ### 1.2 Playtomic API @@ -99,6 +115,22 @@ External API docs (Notion): https://playtomicio.notion.site/Playtomic-External-A Playtomic covers 16,000+ courts globally. The platform is dominant in Spain, UK, France, Germany, and expanding in the US. +**Pipeline implementation (tenants):** βœ… Ingested +- Extractor: `extract-playtomic-tenants` β€” paginated global scrape of `GET /v1/tenants?sport_ids=PADEL`, page size 100, up to 500 pages +- Landing: `data/landing/playtomic/{year}/{month}/tenants.json.gz` (~14K venues as of Feb 2026) +- Throttle: 2 s between pages; deduplicates on `tenant_id` +- Staging models (all grain `tenant_id` or `(tenant_id, resource_id)`): + - `stg_playtomic_venues` β€” venue metadata: name, address, city, country, coordinates, booking type, status + - `stg_playtomic_resources` β€” court resources per venue: resource type, sport, surface, indoor/outdoor + - `stg_playtomic_opening_hours` β€” operating hours per venue per day of week + +**Pipeline implementation (availability):** βœ… Ingested +- Extractor: `extract-playtomic-availability` β€” reads tenant IDs from latest tenants file, queries `GET /v1/availability` for next-day slots per venue +- Landing: `data/landing/playtomic/{year}/{month}/{date}/availability_morning.json.gz` + `availability_recheck.json.gz` +- Recheck mode: re-queries slots starting within 90 min (controlled by `RECHECK_WINDOW_MINUTES`); captures near-real-time fill rates +- Parallelism: `EXTRACT_WORKERS` env var; `PROXY_URLS` for distributed rate limiting; throttle 1 s per venue per worker +- Staging: `stg_playtomic_availability`, grain `(snapshot_date, tenant_id, resource_id, slot_start_time, snapshot_type, captured_at_utc)` + --- ### 1.3 DPV β€” Deutscher Padel Verband Standorte @@ -430,6 +462,16 @@ Key datasets: The R `eurostat` package and Python `eurostat` library provide typed wrappers. Data is queryable at NUTS2/NUTS3 and city level using `geoLevel=city`. +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract-eurostat` β€” ETag deduplication (304 Not Modified skips the write; most runs are fast no-ops) +- Landing: `data/landing/eurostat/{year}/{month}/{dataset}.json.gz` +- Base URL: `https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/{datasetCode}` +- Datasets fetched: + - `urb_cpop1` (city population): filter `indic_ur=DE1001V` (Population on 1 January, total), `geoLevel=city` β†’ staging `stg_population`, grain `(city_code, ref_year)`. City codes are Eurostat format (`DE001C`). + - `ilc_di03` (median income): filter `indic_il=MED_E`, `unit=PPS`, `sex=T`, `age=TOTAL` β†’ staging `stg_income`, grain `(country_code, ref_year)`. Income in Purchasing Power Standards for cross-country comparability. +- City-code bridge: `extract-eurostat-city-labels` / `stg_city_labels` maps `DE001C β†’ Berlin`. See Β§9.1 for the live implementation details (compact JSON response, not SDMX 2.1 spec). +- Used in: `foundation.dim_cities` (Eurostat population + income joined via city labels β†’ market score) + --- ### 5.2 US Census Bureau API @@ -447,6 +489,8 @@ The American Community Survey (ACS) API provides city and tract-level demographi Relevant for US market expansion analysis. +**Pipeline implementation:** βœ… Ingested† β€” see Β§9.4 for full implementation details (endpoint, response format, place name parsing). Staging: `stg_population_usa`, grain `(place_fips, ref_year)`. Requires `CENSUS_API_KEY` env var; writes empty placeholder when absent. + --- ### 5.3 ONS Beta API (UK) @@ -462,6 +506,8 @@ Relevant for US market expansion analysis. The ONS Beta API at `https://api.beta.ons.gov.uk/v1` is open and unauthenticated. Rate limit: 120 requests/10 s, 200/min. Datasets include population estimates, deprivation indices, and 2021 census variables at MSOA/LAD level. Sports participation specifically comes from Sport England (see 5.4), not ONS directly. +**Pipeline implementation:** βœ… Ingested β€” see Β§9.3 for full details (CSV download path, LAD code filtering, observations endpoint 404 bug). Staging: `stg_population_uk`, grain `(lad_code, ref_year)`. No credentials required. + --- ### 5.4 Sport England β€” Active Lives Survey @@ -557,17 +603,23 @@ Token-based REST API. Free tier includes 50k requests/month and last 6 months of ### Recommended ingestion patterns -| Source | Ingestion Pattern | -|--------|------------------| -| Eurostat API | `httpfs` + JSON β†’ staging table; run weekly | -| Overpass API / OSM | Bulk `.osm.pbf` download via Geofabrik β†’ DuckDB spatial extension; run monthly | -| Playtomic unauthenticated API | Paginated scraper per city bounding box β†’ Parquet; run nightly | -| FIP / Playtomic PDFs | Manual parse β†’ CSV seed files; run annually | -| US Census ACS | `httpfs` REST β†’ staging; run annually | -| ONS Beta API | `httpfs` REST β†’ staging; run annually | -| Sport England CSV | Manual download β†’ seed file; run annually | -| ImmoScout24 / Immowelt | API β†’ staging (requires partner account); run monthly | -| planning.data.gov.uk | REST API β†’ staging; run weekly for new permissions | +| Source | Ingestion Pattern | Extractor | +|--------|------------------|-----------| +| Overpass / OSM | Single global query β†’ JSON.gz; run monthly | `extract-overpass` βœ… | +| Playtomic tenants | Paginated global scrape β†’ JSON.gz; run monthly | `extract-playtomic-tenants` βœ… | +| Playtomic availability | Per-venue slot query β†’ JSON.gz; run daily | `extract-playtomic-availability` βœ… | +| Eurostat `urb_cpop1` + `ilc_di03` | SDMX REST + ETag dedup β†’ JSON.gz; run monthly | `extract-eurostat` βœ… | +| Eurostat SDMX city labels | Codelist fetch + ETag dedup β†’ JSON.gz; run monthly | `extract-eurostat-city-labels` βœ… | +| ONS UK mid-year estimates | CSV download (~68 MB) β†’ JSON.gz; run annually | `extract-ons-uk` βœ… | +| US Census ACS 5-year | REST β†’ JSON.gz; run annually | `extract-census-usa` βœ…β€  | +| GeoNames cities15000 | Bulk zip download β†’ JSON.gz; run monthly | `extract-geonames` βœ…β€  | +| ECB / Frankfurter.app FX | REST β†’ JSON.gz; run daily or monthly | `extract-fx` πŸ”² planned | +| FIP / Playtomic PDFs | Manual parse β†’ CSV seed files; run annually | β€” | +| Sport England CSV | Manual download β†’ seed file; run annually | β€” | +| ImmoScout24 / Immowelt | API β†’ staging (requires partner account); run monthly | β€” | +| planning.data.gov.uk | REST API β†’ staging; run weekly for new permissions | β€” | + +† Placeholder file written when credentials absent; set `CENSUS_API_KEY` / `GEONAMES_USERNAME` to activate. ### Key technical constraints @@ -721,6 +773,81 @@ Population cascade in `dim_cities`: Eurostat β†’ US Census β†’ ONS β†’ GeoNames --- +## 10. FX / Currency Rates + +Needed for two purposes: +1. **Cross-market normalisation** β€” Playtomic venue prices are in local currency (GBP for UK, USD for US, EUR for eurozone). Benchmarking court rates across countries requires a common base. +2. **Financial planner display** β€” the planner currently shows symbols (€/Β£/$) per country but applies no conversion. FX rates would let users toggle a "view in EUR" mode, or auto-convert EUR benchmark figures to the investor's local currency. + +--- + +### 10.1 European Central Bank (ECB) Data Portal + +| Field | Value | +|-------|-------| +| URL | https://data-api.ecb.europa.eu/service/data/EXR | +| Data Type | Daily exchange rates, EUR as base currency | +| Access Method | Public SDMX REST API | +| Credentials | None | +| Update Frequency | Daily (business days) | +| License | Public domain | +| Score | **4** | +| Status | πŸ”² Planned | + +ECB publishes official daily reference rates for ~30 currencies against EUR via SDMX. Free, unauthenticated, stable. + +``` +GET https://data-api.ecb.europa.eu/service/data/EXR/D.USD+GBP+CHF+SEK+AED.EUR.SP00.A + ?format=jsondata&lastNObservations=1 +``` + +Returns the most recent observation per currency pair. The SDMX JSON response is nested; rates live at `dataSets[0].series["{key}"].observations["0"][0]` where `{key}` encodes the dimension index positions (0:0:0:0:0, 1:0:0:0:0, …). + +Key series for Padelnomics: +- `D.USD.EUR.SP00.A` β€” EUR/USD +- `D.GBP.EUR.SP00.A` β€” EUR/GBP +- `D.CHF.EUR.SP00.A` β€” EUR/CHF (Switzerland) +- `D.SEK.EUR.SP00.A` β€” EUR/SEK (Sweden) +- `D.AED.EUR.SP00.A` β€” EUR/AED (UAE) + +**Note:** ECB only provides EUR-base rates. Cross rates (e.g. USD/GBP) require computation: `rate = eur_gbp / eur_usd`. + +--- + +### 10.2 Frankfurter.app + +| Field | Value | +|-------|-------| +| URL | https://api.frankfurter.app | +| Data Type | Daily exchange rates (ECB data re-served) | +| Access Method | Public REST API | +| Credentials | None | +| Update Frequency | Daily | +| License | MIT (open source) | +| Score | **4** | +| Status | πŸ”² Planned | + +Frankfurter is an open-source wrapper around ECB data with a simpler interface than the raw SDMX endpoint. No auth, no documented rate limit. Preferred for implementation simplicity; self-host the open-source version if uptime SLA becomes a concern. + +``` +GET https://api.frankfurter.app/latest?from=EUR&to=USD,GBP,CHF,SEK,AED +``` + +Response: +```json +{"amount": 1.0, "base": "EUR", "date": "2026-02-24", + "rates": {"USD": 1.0531, "GBP": 0.8412, "CHF": 0.9374, "SEK": 10.932, "AED": 3.8669}} +``` + +**Proposed pipeline:** +- Landing: `data/landing/fx/{year}/{month}/{date}/rates.json.gz` +- Format: `{"date": "2026-02-24", "base": "EUR", "rates": {"USD": 1.05, "GBP": 0.84, ...}}` +- Cadence: daily (or monthly β€” rates change daily but the pipeline only needs monthly snapshots for historical benchmarking) +- Staging: `staging.stg_fx_rates`, grain `(date, quote_currency)` β€” columns: `date, base_currency ('EUR'), quote_currency, rate` +- Downstream: join to `stg_playtomic_availability` price column to normalize to EUR; expose latest rate to planner for display conversion + +--- + ## Sources - [Reverse Engineering Playtomic](https://mattrighetti.com/2025/03/03/reverse-engineering-playtomic)