# Materia SQLMesh Transform Layer Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture. ## Quick Start ```bash # From repo root # Plan changes (dev environment) uv run sqlmesh -p transform/sqlmesh_materia plan # Apply to production uv run sqlmesh -p transform/sqlmesh_materia plan prod # Run model tests uv run sqlmesh -p transform/sqlmesh_materia test # Format SQL uv run sqlmesh -p transform/sqlmesh_materia format ``` ## Architecture ### 3-Layer Data Model ``` landing/ ← immutable files (extraction output) ├── psd/{year}/{month}/ ← USDA PSD ├── cot/{year}/ ← CFTC COT ├── prices/coffee_kc/ ← KC=F daily prices ├── ice_stocks/ ← ICE daily warehouse stocks ├── ice_aging/ ← ICE monthly aging report └── ice_stocks_by_port/ ← ICE historical EOM by port staging/ ← read_csv + seed joins + cast (PSD) └── staging.psdalldata__commodity seeds/ ← static lookup CSVs (PSD code mappings) ├── seeds.psd_commodity_codes ├── seeds.psd_attribute_codes └── seeds.psd_unit_of_measure_codes foundation/ ← read_csv + cast + dedup (prices, COT, ICE) ├── foundation.fct_coffee_prices ├── foundation.fct_cot_positioning ├── foundation.fct_ice_warehouse_stocks ├── foundation.fct_ice_aging_stocks ├── foundation.fct_ice_warehouse_stocks_by_port └── foundation.dim_commodity serving/ ← pre-aggregated for web app ├── serving.coffee_prices ├── serving.cot_positioning ├── serving.ice_warehouse_stocks ├── serving.ice_aging_stocks ├── serving.ice_warehouse_stocks_by_port └── serving.commodity_metrics ``` ### Layer responsibilities **staging/** — PSD only: reads landing CSVs directly via `@psd_glob()`, joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path). **seeds/** — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from `seeds/*.csv`. Referenced by staging. **foundation/** — All other sources (prices, COT, ICE): reads landing data (e.g. CSVs) directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds `dim_commodity` (the cross-source identity mapping). **serving/** — Analytics-ready aggregates consumed by the web app via `analytics.duckdb`. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads. ### Why no raw layer? Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `LANDING_DIR` | `data/landing` | Root of the landing zone | | `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) | The web app reads from a separate `analytics.duckdb` via `export_serving.py`.