beanflows/transform/sqlmesh_materia/readme.md

# Materia SQLMesh Transform Layer

Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.

## Quick Start

```bash
# From repo root

# Plan changes (dev environment)
uv run sqlmesh -p transform/sqlmesh_materia plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format
```

## Architecture

### 3-Layer Data Model

```
landing/                          ← immutable files (extraction output)
  ├── psd/{year}/{month}/         ← USDA PSD
  ├── cot/{year}/                 ← CFTC COT
  ├── prices/coffee_kc/           ← KC=F daily prices
  ├── ice_stocks/                 ← ICE daily warehouse stocks
  ├── ice_aging/                  ← ICE monthly aging report
  └── ice_stocks_by_port/         ← ICE historical EOM by port

staging/                          ← read_csv + seed joins + cast (PSD)
  └── staging.psdalldata__commodity

seeds/                            ← static lookup CSVs (PSD code mappings)
  ├── seeds.psd_commodity_codes
  ├── seeds.psd_attribute_codes
  └── seeds.psd_unit_of_measure_codes

foundation/                       ← read_csv + cast + dedup (prices, COT, ICE)
  ├── foundation.fct_coffee_prices
  ├── foundation.fct_cot_positioning
  ├── foundation.fct_ice_warehouse_stocks
  ├── foundation.fct_ice_aging_stocks
  ├── foundation.fct_ice_warehouse_stocks_by_port
  └── foundation.dim_commodity

serving/                          ← pre-aggregated for web app
  ├── serving.coffee_prices
  ├── serving.cot_positioning
  ├── serving.ice_warehouse_stocks
  ├── serving.ice_aging_stocks
  ├── serving.ice_warehouse_stocks_by_port
  └── serving.commodity_metrics
```

### Layer responsibilities

**staging/** — PSD only: reads landing CSVs directly via `@psd_glob()`, joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).

**seeds/** — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from `seeds/*.csv`. Referenced by staging.

**foundation/** — All other sources (prices, COT, ICE): reads landing data (e.g. CSVs) directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds `dim_commodity` (the cross-source identity mapping).

**serving/** — Analytics-ready aggregates consumed by the web app via `analytics.duckdb`. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.

### Why no raw layer?

Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |

The web app reads from a separate `analytics.duckdb` via `export_serving.py`.