83 lines
3.2 KiB
Markdown
83 lines
3.2 KiB
Markdown
# Materia SQLMesh Transform Layer
|
|
|
|
Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# From repo root
|
|
|
|
# Plan changes (dev environment)
|
|
uv run sqlmesh -p transform/sqlmesh_materia plan
|
|
|
|
# Apply to production
|
|
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
|
|
|
# Run model tests
|
|
uv run sqlmesh -p transform/sqlmesh_materia test
|
|
|
|
# Format SQL
|
|
uv run sqlmesh -p transform/sqlmesh_materia format
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### 3-Layer Data Model
|
|
|
|
```
|
|
landing/ ← immutable files (extraction output)
|
|
├── psd/{year}/{month}/ ← USDA PSD
|
|
├── cot/{year}/ ← CFTC COT
|
|
├── prices/coffee_kc/ ← KC=F daily prices
|
|
├── ice_stocks/ ← ICE daily warehouse stocks
|
|
├── ice_aging/ ← ICE monthly aging report
|
|
└── ice_stocks_by_port/ ← ICE historical EOM by port
|
|
|
|
staging/ ← read_csv + seed joins + cast (PSD)
|
|
└── staging.psdalldata__commodity
|
|
|
|
seeds/ ← static lookup CSVs (PSD code mappings)
|
|
├── seeds.psd_commodity_codes
|
|
├── seeds.psd_attribute_codes
|
|
└── seeds.psd_unit_of_measure_codes
|
|
|
|
foundation/ ← read_csv + cast + dedup (prices, COT, ICE)
|
|
├── foundation.fct_coffee_prices
|
|
├── foundation.fct_cot_positioning
|
|
├── foundation.fct_ice_warehouse_stocks
|
|
├── foundation.fct_ice_aging_stocks
|
|
├── foundation.fct_ice_warehouse_stocks_by_port
|
|
└── foundation.dim_commodity
|
|
|
|
serving/ ← pre-aggregated for web app
|
|
├── serving.coffee_prices
|
|
├── serving.cot_positioning
|
|
├── serving.ice_warehouse_stocks
|
|
├── serving.ice_aging_stocks
|
|
├── serving.ice_warehouse_stocks_by_port
|
|
└── serving.commodity_metrics
|
|
```
|
|
|
|
### Layer responsibilities
|
|
|
|
**staging/** — PSD only: reads landing CSVs directly via `@psd_glob()`, joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).
|
|
|
|
**seeds/** — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from `seeds/*.csv`. Referenced by staging.
|
|
|
|
**foundation/** — All other sources (prices, COT, ICE): reads landing data (e.g. CSVs) directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds `dim_commodity` (the cross-source identity mapping).
|
|
|
|
**serving/** — Analytics-ready aggregates consumed by the web app via `analytics.duckdb`. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.
|
|
|
|
### Why no raw layer?
|
|
|
|
Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
|
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
|
|
|
The web app reads from a separate `analytics.duckdb` via `export_serving.py`.
|