- extract/cftc_cot: refactor extract_cot_year() to accept url_template and
landing_subdir params; add _extract_cot() shared loop; add extract_cot_combined()
entry point using com_disagg_txt_{year}.zip → landing/cot_combined/
- pyproject.toml: add extract_cot_combined script entry point
- macros/__init__.py: add @cot_combined_glob() for cot_combined/**/*.csv.gzip
- fct_cot_positioning.sql: union cot_glob and cot_combined_glob in src CTE;
add report_type column (FutOnly_or_Combined) to cast_and_clean + deduplicated;
include FutOnly_or_Combined in hkey to avoid key collisions; add report_type to grain
- obt_cot_positioning.sql: add report_type = 'FutOnly' filter to preserve
existing serving behavior
- obt_cot_positioning_combined.sql: new serving model filtered to report_type =
'Combined'; identical analytics (COT index, net %, windows) on combined data
- pipelines.py: register extract_cot_combined; add to extract_all meta-pipeline
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Materia SQLMesh Transform Layer
Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.
Quick Start
# From repo root
# Plan changes (dev environment)
uv run sqlmesh -p transform/sqlmesh_materia plan
# Apply to production
uv run sqlmesh -p transform/sqlmesh_materia plan prod
# Run model tests
uv run sqlmesh -p transform/sqlmesh_materia test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format
Architecture
3-Layer Data Model
landing/ ← immutable files (extraction output)
├── psd/{year}/{month}/ ← USDA PSD
├── cot/{year}/ ← CFTC COT
├── prices/coffee_kc/ ← KC=F daily prices
├── ice_stocks/ ← ICE daily warehouse stocks
├── ice_aging/ ← ICE monthly aging report
└── ice_stocks_by_port/ ← ICE historical EOM by port
staging/ ← read_csv + seed joins + cast (PSD)
└── staging.psdalldata__commodity
seeds/ ← static lookup CSVs (PSD code mappings)
├── seeds.psd_commodity_codes
├── seeds.psd_attribute_codes
└── seeds.psd_unit_of_measure_codes
foundation/ ← read_csv + cast + dedup (prices, COT, ICE)
├── foundation.fct_coffee_prices
├── foundation.fct_cot_positioning
├── foundation.fct_ice_warehouse_stocks
├── foundation.fct_ice_aging_stocks
├── foundation.fct_ice_warehouse_stocks_by_port
└── foundation.dim_commodity
serving/ ← pre-aggregated for web app
├── serving.coffee_prices
├── serving.cot_positioning
├── serving.ice_warehouse_stocks
├── serving.ice_aging_stocks
├── serving.ice_warehouse_stocks_by_port
└── serving.commodity_metrics
Layer responsibilities
staging/ — PSD only: reads landing CSVs directly via @psd_glob(), joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).
seeds/ — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from seeds/*.csv. Referenced by staging.
foundation/ — All other sources (prices, COT, ICE): reads landing CSVs directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds dim_commodity (the cross-source identity mapping).
serving/ — Analytics-ready aggregates consumed by the web app via analytics.duckdb. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.
Why no raw layer?
Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.
Environment Variables
| Variable | Default | Description |
|---|---|---|
LANDING_DIR |
data/landing |
Root of the landing zone |
DUCKDB_PATH |
local.duckdb |
DuckDB file (SQLMesh exclusive write access) |
The web app reads from a separate analytics.duckdb via export_serving.py.