feat: migrate transform to 3-layer architecture with per-layer schemas
Remove raw/ layer — staging models now read landing JSON directly. Rename all model schemas from padelnomics.* to staging.*/foundation.*/serving.*. Web app queries updated to serving.planner_defaults via SERVING_DUCKDB_PATH. Supervisor gets daily sleep interval between pipeline runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Padelnomics Transform (SQLMesh)
|
||||
|
||||
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
|
||||
3-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app via an atomically-swapped serving DB.
|
||||
|
||||
## Running
|
||||
|
||||
@@ -16,42 +16,41 @@ uv run sqlmesh -p transform/sqlmesh_padelnomics test
|
||||
|
||||
# Format SQL
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics format
|
||||
|
||||
# Export serving tables to analytics.duckdb (run after SQLMesh)
|
||||
DUCKDB_PATH=data/lakehouse.duckdb SERVING_DUCKDB_PATH=data/analytics.duckdb \
|
||||
uv run python -m padelnomics.export_serving
|
||||
```
|
||||
|
||||
## 4-layer architecture
|
||||
## 3-layer architecture
|
||||
|
||||
```
|
||||
landing/ ← raw files (extraction output)
|
||||
└── padelnomics/
|
||||
└── {year}/{etag}.csv.gz
|
||||
├── overpass/*/*/courts.json.gz
|
||||
├── eurostat/*/*/urb_cpop1.json.gz
|
||||
└── playtomic/*/*/tenants.json.gz
|
||||
|
||||
raw/ ← reads files verbatim
|
||||
└── raw.padelnomics
|
||||
|
||||
staging/ ← type casting, deduplication
|
||||
└── staging.stg_padelnomics
|
||||
staging/ ← reads landing files directly, type casting, dedup
|
||||
├── staging.stg_padel_courts
|
||||
├── staging.stg_playtomic_venues
|
||||
└── staging.stg_population
|
||||
|
||||
foundation/ ← business logic, dimensions, facts
|
||||
└── foundation.dim_category
|
||||
├── foundation.dim_venues
|
||||
└── foundation.dim_cities
|
||||
|
||||
serving/ ← pre-aggregated for web app
|
||||
└── serving.padelnomics_metrics
|
||||
├── serving.city_market_profile
|
||||
└── serving.planner_defaults
|
||||
```
|
||||
|
||||
### raw/ — verbatim source reads
|
||||
### staging/ — read landing files + type casting
|
||||
|
||||
- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
|
||||
- No transformations, no business logic
|
||||
- Column names match the source exactly
|
||||
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
|
||||
- Naming: `raw.<source>`
|
||||
|
||||
### staging/ — type casting and cleansing
|
||||
|
||||
- One model per raw model (1:1)
|
||||
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
|
||||
- Deduplicate if source produces duplicates
|
||||
- Minimal renaming — only where raw names are genuinely unclear
|
||||
- Reads landing zone JSON files directly with `read_json(..., format='auto', filename=true)`
|
||||
- Uses `@LANDING_DIR` variable for file path discovery
|
||||
- Casts all columns to correct types: `TRY_CAST(... AS DOUBLE)`
|
||||
- Deduplicates where source produces duplicates (ROW_NUMBER partitioned on ID)
|
||||
- Validates coordinates, nulls, and data quality inline
|
||||
- Naming: `staging.stg_<source>`
|
||||
|
||||
### foundation/ — business logic
|
||||
@@ -59,49 +58,54 @@ serving/ ← pre-aggregated for web app
|
||||
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
|
||||
- Facts (`fact_*`): events and measurements, one row per event
|
||||
- May join across multiple staging models from different sources
|
||||
- Surrogate keys: `MD5(business_key)` for stable joins
|
||||
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
||||
|
||||
### serving/ — analytics-ready aggregates
|
||||
|
||||
- Pre-aggregated for specific web app query patterns
|
||||
- These are the only tables the web app reads
|
||||
- These are the only tables the web app reads (via `analytics.duckdb`)
|
||||
- Queried from `analytics.py` via `fetch_analytics()`
|
||||
- Named to match what the frontend expects
|
||||
- Naming: `serving.<purpose>`
|
||||
|
||||
## Two-DuckDB architecture
|
||||
|
||||
```
|
||||
data/lakehouse.duckdb ← SQLMesh exclusive write (DUCKDB_PATH)
|
||||
├── staging.*
|
||||
├── foundation.*
|
||||
└── serving.*
|
||||
|
||||
data/analytics.duckdb ← web app read-only (SERVING_DUCKDB_PATH)
|
||||
└── serving.* ← atomically replaced by export_serving.py
|
||||
```
|
||||
|
||||
SQLMesh holds an exclusive write lock on `lakehouse.duckdb` during plan/run.
|
||||
The web app needs read-only access at all times. `export_serving.py` copies
|
||||
`serving.*` tables to a temp file, then atomically renames it to `analytics.duckdb`.
|
||||
The web app detects the inode change on next query — no restart needed.
|
||||
|
||||
**Never point DUCKDB_PATH and SERVING_DUCKDB_PATH to the same file.**
|
||||
|
||||
## Adding a new data source
|
||||
|
||||
1. Add a landing zone directory in the extraction package
|
||||
2. Add a glob macro in `macros/__init__.py`:
|
||||
```python
|
||||
@macro()
|
||||
def my_source_glob(evaluator) -> str:
|
||||
landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
|
||||
return f"'{landing_dir}/my_source/**/*.csv.gz'"
|
||||
```
|
||||
3. Add a raw model: `models/raw/raw_my_source.sql`
|
||||
4. Add a staging model: `models/staging/stg_my_source.sql`
|
||||
5. Join into foundation or serving models as needed
|
||||
1. Add an extractor in `extract/padelnomics_extract/` (see extraction README)
|
||||
2. Add a staging model: `models/staging/stg_<source>.sql` that reads landing files directly
|
||||
3. Join into foundation or serving models as needed
|
||||
|
||||
## Model materialization
|
||||
|
||||
| Layer | Default kind | Rationale |
|
||||
|-------|-------------|-----------|
|
||||
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
|
||||
| staging | FULL | 1:1 with raw; same cost |
|
||||
| staging | FULL | Re-reads all landing files; cheap with DuckDB parallel scan |
|
||||
| foundation | FULL | Business logic rarely changes; recompute is fast |
|
||||
| serving | FULL | Small aggregates; web app needs latest at all times |
|
||||
|
||||
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
|
||||
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column.
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
||||
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
|
||||
The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
|
||||
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
|
||||
SQLMesh holds an exclusive write lock during plan/run.
|
||||
| `DUCKDB_PATH` | `data/lakehouse.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
| `SERVING_DUCKDB_PATH` | `data/analytics.duckdb` | Serving DB (web app reads from here) |
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
-- Cities without Eurostat coverage (US, non-EU) are derived from venue clusters.
|
||||
|
||||
MODEL (
|
||||
name padelnomics.dim_cities,
|
||||
name foundation.dim_cities,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain city_code
|
||||
@@ -16,7 +16,7 @@ eurostat_cities AS (
|
||||
country_code,
|
||||
population,
|
||||
ref_year
|
||||
FROM padelnomics.stg_population
|
||||
FROM staging.stg_population
|
||||
QUALIFY ROW_NUMBER() OVER (PARTITION BY city_code ORDER BY ref_year DESC) = 1
|
||||
),
|
||||
-- Venue counts per (country_code, city) from dim_venues
|
||||
@@ -27,7 +27,7 @@ venue_counts AS (
|
||||
COUNT(*) AS venue_count,
|
||||
AVG(lat) AS centroid_lat,
|
||||
AVG(lon) AS centroid_lon
|
||||
FROM padelnomics.dim_venues
|
||||
FROM foundation.dim_venues
|
||||
WHERE city IS NOT NULL AND city != ''
|
||||
GROUP BY country_code, city
|
||||
),
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
-- Proximity dedup uses haversine approximation: 1 degree lat ≈ 111 km.
|
||||
|
||||
MODEL (
|
||||
name padelnomics.dim_venues,
|
||||
name foundation.dim_venues,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain venue_id
|
||||
@@ -22,7 +22,7 @@ WITH all_venues AS (
|
||||
postcode,
|
||||
NULL AS tenant_type,
|
||||
extracted_date
|
||||
FROM padelnomics.stg_padel_courts
|
||||
FROM staging.stg_padel_courts
|
||||
WHERE country_code IS NOT NULL
|
||||
|
||||
UNION ALL
|
||||
@@ -38,7 +38,7 @@ WITH all_venues AS (
|
||||
postcode,
|
||||
tenant_type,
|
||||
extracted_date
|
||||
FROM padelnomics.stg_playtomic_venues
|
||||
FROM staging.stg_playtomic_venues
|
||||
WHERE country_code IS NOT NULL
|
||||
),
|
||||
-- Rank venues so Playtomic records win ties in proximity dedup
|
||||
|
||||
@@ -1,6 +0,0 @@
|
||||
# raw
|
||||
|
||||
Read raw landing zone files directly with `read_csv_auto()`.
|
||||
No transformations — schema as-is from source.
|
||||
|
||||
Naming convention: `raw.<source>_<dataset>`
|
||||
@@ -1,64 +0,0 @@
|
||||
-- Raw Eurostat Urban Audit city population (dataset: urb_cpop1).
|
||||
-- Source: data/landing/eurostat/{year}/{month}/urb_cpop1.json.gz
|
||||
-- Format: Eurostat JSON Statistics API (dimensions + flat value array).
|
||||
--
|
||||
-- The Eurostat JSON format encodes dimensions separately from values:
|
||||
-- dimension.cities.category.index → maps city code to flat array position
|
||||
-- dimension.time.category.index → maps year to flat array position
|
||||
-- values → flat object {position_str: value}
|
||||
--
|
||||
-- This model stores one row per (city_code, year) by computing positions.
|
||||
-- Reference: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+Statistics
|
||||
|
||||
MODEL (
|
||||
name padelnomics.raw_eurostat_population,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain (city_code, ref_year)
|
||||
);
|
||||
|
||||
WITH raw AS (
|
||||
SELECT
|
||||
raw_json,
|
||||
filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/eurostat/*/*/urb_cpop1.json.gz',
|
||||
format = 'auto',
|
||||
filename = true,
|
||||
columns = { 'raw_json': 'JSON' }
|
||||
)
|
||||
),
|
||||
-- Unnest city codes with their ordinal positions
|
||||
cities AS (
|
||||
SELECT
|
||||
city_code,
|
||||
(city_pos)::INTEGER AS city_pos,
|
||||
filename,
|
||||
raw_json,
|
||||
(json_extract(raw_json, '$.size[1]'))::INTEGER AS n_times
|
||||
FROM raw,
|
||||
LATERAL (
|
||||
SELECT key AS city_code, value::INTEGER AS city_pos
|
||||
FROM json_each(json_extract(raw_json, '$.dimension.cities.category.index'))
|
||||
)
|
||||
),
|
||||
-- Unnest time (year) values with positions
|
||||
times AS (
|
||||
SELECT key AS ref_year, value::INTEGER AS time_pos
|
||||
FROM (SELECT raw_json FROM raw LIMIT 1),
|
||||
LATERAL (
|
||||
SELECT key, value
|
||||
FROM json_each(json_extract(raw_json, '$.dimension.time.category.index'))
|
||||
)
|
||||
)
|
||||
SELECT
|
||||
c.city_code,
|
||||
t.ref_year,
|
||||
TRY_CAST(
|
||||
json_extract(c.raw_json, '$.' || (c.city_pos * c.n_times + t.time_pos)::TEXT)
|
||||
AS DOUBLE
|
||||
) AS population,
|
||||
c.filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM cities c
|
||||
CROSS JOIN times t
|
||||
@@ -1,42 +0,0 @@
|
||||
-- Raw OpenStreetMap padel courts from Overpass API landing files.
|
||||
-- Source: data/landing/overpass/{year}/{month}/courts.json.gz
|
||||
-- Format: {"version": ..., "elements": [{type, id, lat, lon, tags}, ...]}
|
||||
--
|
||||
-- Only node elements carry direct lat/lon. Way and relation elements need
|
||||
-- centroid calculation from member nodes (not done here — filter to node only
|
||||
-- for the initial raw layer; ways/relations retained as-is for future enrichment).
|
||||
|
||||
MODEL (
|
||||
name padelnomics.raw_overpass_courts,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain (osm_type, osm_id)
|
||||
);
|
||||
|
||||
SELECT
|
||||
elem ->> 'type' AS osm_type,
|
||||
(elem ->> 'id')::BIGINT AS osm_id,
|
||||
TRY_CAST(elem ->> 'lat' AS DOUBLE) AS lat,
|
||||
TRY_CAST(elem ->> 'lon' AS DOUBLE) AS lon,
|
||||
elem -> 'tags' ->> 'name' AS name,
|
||||
elem -> 'tags' ->> 'sport' AS sport,
|
||||
elem -> 'tags' ->> 'leisure' AS leisure,
|
||||
elem -> 'tags' ->> 'addr:country' AS country_code,
|
||||
elem -> 'tags' ->> 'addr:city' AS city_tag,
|
||||
elem -> 'tags' ->> 'addr:postcode' AS postcode,
|
||||
elem -> 'tags' ->> 'operator' AS operator_name,
|
||||
elem -> 'tags' ->> 'opening_hours' AS opening_hours,
|
||||
elem -> 'tags' ->> 'fee' AS fee,
|
||||
filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM (
|
||||
SELECT
|
||||
UNNEST(elements) AS elem,
|
||||
filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/overpass/*/*/courts.json.gz',
|
||||
format = 'auto',
|
||||
filename = true
|
||||
)
|
||||
)
|
||||
WHERE (elem ->> 'type') IS NOT NULL
|
||||
@@ -1,35 +0,0 @@
|
||||
-- Raw Playtomic venue (tenant) listings from unauthenticated tenant search API.
|
||||
-- Source: data/landing/playtomic/{year}/{month}/tenants.json.gz
|
||||
-- Format: {"tenants": [{tenant_id, name, address, sport_ids, ...}], "count": N}
|
||||
|
||||
MODEL (
|
||||
name padelnomics.raw_playtomic_tenants,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain tenant_id
|
||||
);
|
||||
|
||||
SELECT
|
||||
tenant ->> 'tenant_id' AS tenant_id,
|
||||
tenant ->> 'tenant_name' AS tenant_name,
|
||||
tenant -> 'address' ->> 'street' AS street,
|
||||
tenant -> 'address' ->> 'city' AS city,
|
||||
tenant -> 'address' ->> 'postal_code' AS postal_code,
|
||||
tenant -> 'address' ->> 'country_code' AS country_code,
|
||||
TRY_CAST(tenant -> 'address' ->> 'coordinate_lat' AS DOUBLE) AS lat,
|
||||
TRY_CAST(tenant -> 'address' ->> 'coordinate_lon' AS DOUBLE) AS lon,
|
||||
tenant ->> 'sport_ids' AS sport_ids_raw,
|
||||
tenant ->> 'tenant_type' AS tenant_type,
|
||||
filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM (
|
||||
SELECT
|
||||
UNNEST(tenants) AS tenant,
|
||||
filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/playtomic/*/*/tenants.json.gz',
|
||||
format = 'auto',
|
||||
filename = true
|
||||
)
|
||||
)
|
||||
WHERE (tenant ->> 'tenant_id') IS NOT NULL
|
||||
@@ -7,7 +7,7 @@
|
||||
-- 20% data confidence (completeness of both population + venue data)
|
||||
|
||||
MODEL (
|
||||
name padelnomics.city_market_profile,
|
||||
name serving.city_market_profile,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain city_slug
|
||||
@@ -35,7 +35,7 @@ WITH base AS (
|
||||
WHEN c.population > 0 OR c.padel_venue_count > 0 THEN 0.5
|
||||
ELSE 0.0
|
||||
END AS data_confidence
|
||||
FROM padelnomics.dim_cities c
|
||||
FROM foundation.dim_cities c
|
||||
WHERE c.padel_venue_count > 0
|
||||
),
|
||||
scored AS (
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
-- Units are explicit in column names (EUR, %, h). All monetary values in EUR.
|
||||
|
||||
MODEL (
|
||||
name padelnomics.planner_defaults,
|
||||
name serving.planner_defaults,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain city_slug
|
||||
@@ -43,7 +43,7 @@ city_venue_density AS (
|
||||
population,
|
||||
venues_per_100k,
|
||||
market_score
|
||||
FROM padelnomics.city_market_profile
|
||||
FROM serving.city_market_profile
|
||||
)
|
||||
SELECT
|
||||
cvd.city_slug,
|
||||
|
||||
@@ -1,30 +1,53 @@
|
||||
-- Cleaned OSM padel courts — node elements only (direct lat/lon available).
|
||||
-- Deduplicates on osm_id, keeps most recently extracted record.
|
||||
-- Country code resolved from addr:country tag or approximated by lat/lon bbox.
|
||||
-- Padel court locations from OpenStreetMap via Overpass API.
|
||||
-- Reads landing zone JSON directly, unnests elements, filters to nodes with
|
||||
-- valid coordinates, deduplicates on osm_id, and approximates country from bbox.
|
||||
--
|
||||
-- Source: data/landing/overpass/{year}/{month}/courts.json.gz
|
||||
|
||||
MODEL (
|
||||
name padelnomics.stg_padel_courts,
|
||||
name staging.stg_padel_courts,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain osm_id
|
||||
);
|
||||
|
||||
WITH deduped AS (
|
||||
WITH parsed AS (
|
||||
SELECT
|
||||
elem ->> 'type' AS osm_type,
|
||||
(elem ->> 'id')::BIGINT AS osm_id,
|
||||
TRY_CAST(elem ->> 'lat' AS DOUBLE) AS lat,
|
||||
TRY_CAST(elem ->> 'lon' AS DOUBLE) AS lon,
|
||||
elem -> 'tags' ->> 'name' AS name,
|
||||
elem -> 'tags' ->> 'addr:country' AS country_code,
|
||||
elem -> 'tags' ->> 'addr:city' AS city_tag,
|
||||
elem -> 'tags' ->> 'addr:postcode' AS postcode,
|
||||
elem -> 'tags' ->> 'operator' AS operator_name,
|
||||
elem -> 'tags' ->> 'opening_hours' AS opening_hours,
|
||||
elem -> 'tags' ->> 'fee' AS fee,
|
||||
filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM (
|
||||
SELECT UNNEST(elements) AS elem, filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/overpass/*/*/courts.json.gz',
|
||||
format = 'auto',
|
||||
filename = true
|
||||
)
|
||||
)
|
||||
WHERE (elem ->> 'type') IS NOT NULL
|
||||
),
|
||||
deduped AS (
|
||||
SELECT *,
|
||||
ROW_NUMBER() OVER (PARTITION BY osm_id ORDER BY extracted_date DESC) AS rn
|
||||
FROM padelnomics.raw_overpass_courts
|
||||
FROM parsed
|
||||
WHERE osm_type = 'node'
|
||||
AND lat IS NOT NULL
|
||||
AND lon IS NOT NULL
|
||||
AND lat IS NOT NULL AND lon IS NOT NULL
|
||||
AND lat BETWEEN -90 AND 90
|
||||
AND lon BETWEEN -180 AND 180
|
||||
),
|
||||
-- Approximate country from lat/lon when addr:country tag is absent
|
||||
with_country AS (
|
||||
SELECT
|
||||
osm_id,
|
||||
lat,
|
||||
lon,
|
||||
osm_id, lat, lon,
|
||||
COALESCE(NULLIF(TRIM(UPPER(country_code)), ''), CASE
|
||||
WHEN lat BETWEEN 47.27 AND 55.06 AND lon BETWEEN 5.87 AND 15.04 THEN 'DE'
|
||||
WHEN lat BETWEEN 35.95 AND 43.79 AND lon BETWEEN -9.39 AND 4.33 THEN 'ES'
|
||||
@@ -37,26 +60,15 @@ with_country AS (
|
||||
ELSE NULL
|
||||
END) AS country_code,
|
||||
NULLIF(TRIM(name), '') AS name,
|
||||
NULLIF(TRIM(city_tag), '') AS city_tag,
|
||||
postcode,
|
||||
operator_name,
|
||||
opening_hours,
|
||||
fee,
|
||||
extracted_date
|
||||
NULLIF(TRIM(city_tag), '') AS city,
|
||||
postcode, operator_name, opening_hours, fee, extracted_date
|
||||
FROM deduped
|
||||
WHERE rn = 1
|
||||
)
|
||||
SELECT
|
||||
osm_id,
|
||||
'osm' AS source,
|
||||
lat,
|
||||
lon,
|
||||
country_code,
|
||||
name,
|
||||
city_tag AS city,
|
||||
postcode,
|
||||
operator_name,
|
||||
opening_hours,
|
||||
lat, lon, country_code, name, city, postcode, operator_name, opening_hours,
|
||||
CASE LOWER(fee) WHEN 'yes' THEN TRUE WHEN 'no' THEN FALSE ELSE NULL END AS is_paid,
|
||||
extracted_date
|
||||
FROM with_country
|
||||
|
||||
@@ -1,27 +1,53 @@
|
||||
-- Cleaned Playtomic padel venue records. One row per venue, deduped on tenant_id.
|
||||
-- Playtomic padel venue records from unauthenticated tenant search API.
|
||||
-- Reads landing zone JSON directly, unnests tenant array, deduplicates on
|
||||
-- tenant_id (keeps most recent), and normalizes address fields.
|
||||
--
|
||||
-- Source: data/landing/playtomic/{year}/{month}/tenants.json.gz
|
||||
|
||||
MODEL (
|
||||
name padelnomics.stg_playtomic_venues,
|
||||
name staging.stg_playtomic_venues,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain tenant_id
|
||||
);
|
||||
|
||||
WITH deduped AS (
|
||||
WITH parsed AS (
|
||||
SELECT
|
||||
tenant ->> 'tenant_id' AS tenant_id,
|
||||
tenant ->> 'tenant_name' AS tenant_name,
|
||||
tenant -> 'address' ->> 'street' AS street,
|
||||
tenant -> 'address' ->> 'city' AS city,
|
||||
tenant -> 'address' ->> 'postal_code' AS postal_code,
|
||||
tenant -> 'address' ->> 'country_code' AS country_code,
|
||||
TRY_CAST(tenant -> 'address' ->> 'coordinate_lat' AS DOUBLE) AS lat,
|
||||
TRY_CAST(tenant -> 'address' ->> 'coordinate_lon' AS DOUBLE) AS lon,
|
||||
tenant ->> 'sport_ids' AS sport_ids_raw,
|
||||
tenant ->> 'tenant_type' AS tenant_type,
|
||||
filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM (
|
||||
SELECT UNNEST(tenants) AS tenant, filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/playtomic/*/*/tenants.json.gz',
|
||||
format = 'auto',
|
||||
filename = true
|
||||
)
|
||||
)
|
||||
WHERE (tenant ->> 'tenant_id') IS NOT NULL
|
||||
),
|
||||
deduped AS (
|
||||
SELECT *,
|
||||
ROW_NUMBER() OVER (PARTITION BY tenant_id ORDER BY extracted_date DESC) AS rn
|
||||
FROM padelnomics.raw_playtomic_tenants
|
||||
FROM parsed
|
||||
WHERE tenant_id IS NOT NULL
|
||||
AND lat IS NOT NULL
|
||||
AND lon IS NOT NULL
|
||||
AND lat IS NOT NULL AND lon IS NOT NULL
|
||||
AND lat BETWEEN -90 AND 90
|
||||
AND lon BETWEEN -180 AND 180
|
||||
)
|
||||
SELECT
|
||||
tenant_id,
|
||||
'playtomic' AS source,
|
||||
lat,
|
||||
lon,
|
||||
lat, lon,
|
||||
UPPER(country_code) AS country_code,
|
||||
NULLIF(TRIM(tenant_name), '') AS name,
|
||||
NULLIF(TRIM(city), '') AS city,
|
||||
|
||||
@@ -1,21 +1,65 @@
|
||||
-- Eurostat Urban Audit city population, cleaned and typed.
|
||||
-- Eurostat city codes follow the NUTS Urban Audit convention (e.g. DE001C).
|
||||
-- Country code is the first two characters of the city code.
|
||||
-- Eurostat Urban Audit city population (dataset: urb_cpop1).
|
||||
-- Reads landing zone JSON directly and parses the Eurostat multidimensional format.
|
||||
-- One row per (city_code, year) with validated population values.
|
||||
--
|
||||
-- Source: data/landing/eurostat/{year}/{month}/urb_cpop1.json.gz
|
||||
|
||||
MODEL (
|
||||
name padelnomics.stg_population,
|
||||
name staging.stg_population,
|
||||
kind FULL,
|
||||
cron '@daily',
|
||||
grain (city_code, ref_year)
|
||||
);
|
||||
|
||||
WITH raw AS (
|
||||
SELECT raw_json, filename
|
||||
FROM read_json(
|
||||
@LANDING_DIR || '/eurostat/*/*/urb_cpop1.json.gz',
|
||||
format = 'auto',
|
||||
filename = true,
|
||||
columns = { 'raw_json': 'JSON' }
|
||||
)
|
||||
),
|
||||
cities AS (
|
||||
SELECT
|
||||
city_code,
|
||||
(city_pos)::INTEGER AS city_pos,
|
||||
filename, raw_json,
|
||||
(json_extract(raw_json, '$.size[1]'))::INTEGER AS n_times
|
||||
FROM raw,
|
||||
LATERAL (
|
||||
SELECT key AS city_code, value::INTEGER AS city_pos
|
||||
FROM json_each(json_extract(raw_json, '$.dimension.cities.category.index'))
|
||||
)
|
||||
),
|
||||
times AS (
|
||||
SELECT key AS ref_year, value::INTEGER AS time_pos
|
||||
FROM (SELECT raw_json FROM raw LIMIT 1),
|
||||
LATERAL (
|
||||
SELECT key, value
|
||||
FROM json_each(json_extract(raw_json, '$.dimension.time.category.index'))
|
||||
)
|
||||
),
|
||||
parsed AS (
|
||||
SELECT
|
||||
c.city_code,
|
||||
t.ref_year,
|
||||
TRY_CAST(
|
||||
json_extract(c.raw_json, '$.' || (c.city_pos * c.n_times + t.time_pos)::TEXT)
|
||||
AS DOUBLE
|
||||
) AS population,
|
||||
c.filename AS source_file,
|
||||
CURRENT_DATE AS extracted_date
|
||||
FROM cities c
|
||||
CROSS JOIN times t
|
||||
)
|
||||
SELECT
|
||||
UPPER(city_code) AS city_code,
|
||||
UPPER(LEFT(city_code, 2)) AS country_code,
|
||||
ref_year::INTEGER AS ref_year,
|
||||
population::BIGINT AS population,
|
||||
extracted_date
|
||||
FROM padelnomics.raw_eurostat_population
|
||||
FROM parsed
|
||||
WHERE population IS NOT NULL
|
||||
AND population > 0
|
||||
AND ref_year ~ '^\d{4}$'
|
||||
|
||||
Reference in New Issue
Block a user