refactor(transform): remove raw layer, read landing zone directly

- Delete 6 data raw models (coffee_prices, cot_disaggregated, ice_*, psd_data) — pure read_csv passthroughs with no added value - Move 3 PSD seed models raw/ → seeds/, rename schema raw.* → seeds.* - Update staging.psdalldata__commodity: read_csv(@psd_glob()) directly, join seeds.psd_* instead of raw.psd_* - Update 5 foundation models: inline read_csv() with src CTE, removing raw.* dependency (fct_coffee_prices, fct_cot_positioning, fct_ice_*) - Remove fixture-based SQLMesh test that depended on raw.cot_disaggregated (unit tests incompatible with inline read_csv; integration run covers this) - Update readme.md: 3-layer architecture (staging/foundation → serving) Landing files are immutable and content-addressed — the landing directory is the audit trail. A raw SQL layer duplicated file bytes into DuckDB with no added value. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:30:18 +01:00
parent 1814a76e74
commit c3c8333407
18 changed files with 266 additions and 643 deletions
--- a/transform/sqlmesh_materia/readme.md
+++ b/transform/sqlmesh_materia/readme.md
@@ -1,92 +1,82 @@
 # Materia SQLMesh Transform Layer

-Data transformation pipeline using SQLMesh and DuckDB, implementing a 4-layer architecture.
+Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.

 ## Quick Start

 ```bash
-cd transform/sqlmesh_materia
+# From repo root

-# Local development (virtual environment)
-sqlmesh plan dev_<username>
+# Plan changes (dev environment)
+uv run sqlmesh -p transform/sqlmesh_materia plan

-# Production
-sqlmesh plan prod
+# Apply to production
+uv run sqlmesh -p transform/sqlmesh_materia plan prod

-# Run tests
-sqlmesh test
+# Run model tests
+uv run sqlmesh -p transform/sqlmesh_materia test

 # Format SQL
-sqlmesh format
+uv run sqlmesh -p transform/sqlmesh_materia format
 ```

 ## Architecture

-### Gateway Configuration
+### 3-Layer Data Model

-**Single Gateway:** All environments connect to Cloudflare R2 Data Catalog (Apache Iceberg)
- **Production:** `sqlmesh plan prod`
- **Development:** `sqlmesh plan dev_<username>` (isolated virtual environment)
+```
+landing/                          ← immutable files (extraction output)
+  ├── psd/{year}/{month}/         ← USDA PSD
+  ├── cot/{year}/                 ← CFTC COT
+  ├── prices/coffee_kc/           ← KC=F daily prices
+  ├── ice_stocks/                 ← ICE daily warehouse stocks
+  ├── ice_aging/                  ← ICE monthly aging report
+  └── ice_stocks_by_port/         ← ICE historical EOM by port

-SQLMesh manages environment isolation automatically - no need for separate local databases.
+staging/                          ← read_csv + seed joins + cast (PSD)
+  └── staging.psdalldata__commodity

-### 4-Layer Data Model
+seeds/                            ← static lookup CSVs (PSD code mappings)
+  ├── seeds.psd_commodity_codes
+  ├── seeds.psd_attribute_codes
+  └── seeds.psd_unit_of_measure_codes

-See `models/README.md` for detailed architecture documentation:
+foundation/                       ← read_csv + cast + dedup (prices, COT, ICE)
+  ├── foundation.fct_coffee_prices
+  ├── foundation.fct_cot_positioning
+  ├── foundation.fct_ice_warehouse_stocks
+  ├── foundation.fct_ice_aging_stocks
+  ├── foundation.fct_ice_warehouse_stocks_by_port
+  └── foundation.dim_commodity

-1. **Raw** - Immutable source data
-2. **Staging** - Schema, types, basic cleansing
-3. **Cleaned** - Business logic, integration
-4. **Serving** - Analytics-ready (facts, dimensions, aggregates)
-
-## Configuration
-
-**Config:** `config.yaml`
- DuckDB in-memory with R2 Iceberg catalog
- Extensions: httpfs, iceberg
- Auto-apply enabled (no prompts)
- Initialization hooks for R2 secret/catalog attachment
-
-## Commands
-
-```bash
-# Plan changes for dev environment
-sqlmesh plan dev_yourname
-
-# Plan changes for prod
-sqlmesh plan prod
-
-# Run tests
-sqlmesh test
-
-# Validate models
-sqlmesh validate
-
-# Run audits
-sqlmesh audit
-
-# Format SQL files
-sqlmesh format
-
-# Start web UI
-sqlmesh ui
+serving/                          ← pre-aggregated for web app
+  ├── serving.coffee_prices
+  ├── serving.cot_positioning
+  ├── serving.ice_warehouse_stocks
+  ├── serving.ice_aging_stocks
+  ├── serving.ice_warehouse_stocks_by_port
+  └── serving.commodity_metrics
 ```

-## Environment Variables (Prod)
+### Layer responsibilities

-Required for production R2 Iceberg catalog:
- `CLOUDFLARE_API_TOKEN` - R2 API token
- `ICEBERG_REST_URI` - R2 catalog REST endpoint
- `R2_WAREHOUSE_NAME` - Warehouse name (default: "materia")
+**staging/** — PSD only: reads landing CSVs directly via `@psd_glob()`, joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).

-These are injected via Pulumi ESC (`beanflows/prod`) on the supervisor instance.
+**seeds/** — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from `seeds/*.csv`. Referenced by staging.

-## Development Workflow
+**foundation/** — All other sources (prices, COT, ICE): reads landing CSVs directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds `dim_commodity` (the cross-source identity mapping).

-1. Make changes to models in `models/`
-2. Test locally: `sqlmesh test`
-3. Plan changes: `sqlmesh plan dev_yourname`
-4. Review and apply changes
-5. Commit and push to trigger CI/CD
+**serving/** — Analytics-ready aggregates consumed by the web app via `analytics.duckdb`. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.

-SQLMesh will handle environment isolation, table versioning, and incremental updates automatically.
+### Why no raw layer?
+
+Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LANDING_DIR` | `data/landing` | Root of the landing zone |
+| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
+
+The web app reads from a separate `analytics.duckdb` via `export_serving.py`.