Files

Deeman b884bc2b4a feat(cot): add combined (futures+options) COT extractor and transform models

- extract/cftc_cot: refactor extract_cot_year() to accept url_template and
  landing_subdir params; add _extract_cot() shared loop; add extract_cot_combined()
  entry point using com_disagg_txt_{year}.zip → landing/cot_combined/
- pyproject.toml: add extract_cot_combined script entry point
- macros/__init__.py: add @cot_combined_glob() for cot_combined/**/*.csv.gzip
- fct_cot_positioning.sql: union cot_glob and cot_combined_glob in src CTE;
  add report_type column (FutOnly_or_Combined) to cast_and_clean + deduplicated;
  include FutOnly_or_Combined in hkey to avoid key collisions; add report_type to grain
- obt_cot_positioning.sql: add report_type = 'FutOnly' filter to preserve
  existing serving behavior
- obt_cot_positioning_combined.sql: new serving model filtered to report_type =
  'Combined'; identical analytics (COT index, net %, windows) on combined data
- pipelines.py: register extract_cot_combined; add to extract_all meta-pipeline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-26 11:24:56 +01:00

audits

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

macros

feat(cot): add combined (futures+options) COT extractor and transform models

2026-02-26 11:24:56 +01:00

models

feat(cot): add combined (futures+options) COT extractor and transform models

2026-02-26 11:24:56 +01:00

seeds

feat(extract): add 4 weather locations (ES, PE, UG, CI)

2026-02-26 00:12:29 +01:00

tests

refactor(transform): remove raw layer, read landing zone directly

2026-02-22 17:30:18 +01:00

config.yaml

Fix COT pipeline: TRY_CAST nulls, dim_commodity leading zeros, correct CFTC codes

2026-02-20 23:28:10 +01:00

external_models.yaml

Change layer naming

2025-09-10 18:46:18 +02:00

pyproject.toml

Refactor to local-first architecture on Hetzner NVMe

2026-02-18 19:50:19 +01:00

readme.md

refactor(transform): remove raw layer, read landing zone directly

2026-02-22 17:30:18 +01:00

readme.md

Materia SQLMesh Transform Layer

Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.

Quick Start

# From repo root

# Plan changes (dev environment)
uv run sqlmesh -p transform/sqlmesh_materia plan

# Apply to production
uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run model tests
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format

Architecture

3-Layer Data Model

landing/                          ← immutable files (extraction output)
  ├── psd/{year}/{month}/         ← USDA PSD
  ├── cot/{year}/                 ← CFTC COT
  ├── prices/coffee_kc/           ← KC=F daily prices
  ├── ice_stocks/                 ← ICE daily warehouse stocks
  ├── ice_aging/                  ← ICE monthly aging report
  └── ice_stocks_by_port/         ← ICE historical EOM by port

staging/                          ← read_csv + seed joins + cast (PSD)
  └── staging.psdalldata__commodity

seeds/                            ← static lookup CSVs (PSD code mappings)
  ├── seeds.psd_commodity_codes
  ├── seeds.psd_attribute_codes
  └── seeds.psd_unit_of_measure_codes

foundation/                       ← read_csv + cast + dedup (prices, COT, ICE)
  ├── foundation.fct_coffee_prices
  ├── foundation.fct_cot_positioning
  ├── foundation.fct_ice_warehouse_stocks
  ├── foundation.fct_ice_aging_stocks
  ├── foundation.fct_ice_warehouse_stocks_by_port
  └── foundation.dim_commodity

serving/                          ← pre-aggregated for web app
  ├── serving.coffee_prices
  ├── serving.cot_positioning
  ├── serving.ice_warehouse_stocks
  ├── serving.ice_aging_stocks
  ├── serving.ice_warehouse_stocks_by_port
  └── serving.commodity_metrics

Layer responsibilities

staging/ — PSD only: reads landing CSVs directly via @psd_glob(), joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).

seeds/ — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from seeds/*.csv. Referenced by staging.

foundation/ — All other sources (prices, COT, ICE): reads landing CSVs directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds dim_commodity (the cross-source identity mapping).

serving/ — Analytics-ready aggregates consumed by the web app via analytics.duckdb. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.

Why no raw layer?

Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.

Environment Variables

Variable	Default	Description
`LANDING_DIR`	`data/landing`	Root of the landing zone
`DUCKDB_PATH`	`local.duckdb`	DuckDB file (SQLMesh exclusive write access)

The web app reads from a separate analytics.duckdb via export_serving.py.