refactor(transform): remove raw layer, read landing zone directly
- Delete 6 data raw models (coffee_prices, cot_disaggregated, ice_*, psd_data) — pure read_csv passthroughs with no added value - Move 3 PSD seed models raw/ → seeds/, rename schema raw.* → seeds.* - Update staging.psdalldata__commodity: read_csv(@psd_glob()) directly, join seeds.psd_* instead of raw.psd_* - Update 5 foundation models: inline read_csv() with src CTE, removing raw.* dependency (fct_coffee_prices, fct_cot_positioning, fct_ice_*) - Remove fixture-based SQLMesh test that depended on raw.cot_disaggregated (unit tests incompatible with inline read_csv; integration run covers this) - Update readme.md: 3-layer architecture (staging/foundation → serving) Landing files are immutable and content-addressed — the landing directory is the audit trail. A raw SQL layer duplicated file bytes into DuckDB with no added value. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,92 +1,82 @@
|
||||
# Materia SQLMesh Transform Layer
|
||||
|
||||
Data transformation pipeline using SQLMesh and DuckDB, implementing a 4-layer architecture.
|
||||
Data transformation pipeline using SQLMesh and DuckDB, implementing a 3-layer architecture.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
cd transform/sqlmesh_materia
|
||||
# From repo root
|
||||
|
||||
# Local development (virtual environment)
|
||||
sqlmesh plan dev_<username>
|
||||
# Plan changes (dev environment)
|
||||
uv run sqlmesh -p transform/sqlmesh_materia plan
|
||||
|
||||
# Production
|
||||
sqlmesh plan prod
|
||||
# Apply to production
|
||||
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||
|
||||
# Run tests
|
||||
sqlmesh test
|
||||
# Run model tests
|
||||
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||
|
||||
# Format SQL
|
||||
sqlmesh format
|
||||
uv run sqlmesh -p transform/sqlmesh_materia format
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Gateway Configuration
|
||||
### 3-Layer Data Model
|
||||
|
||||
**Single Gateway:** All environments connect to Cloudflare R2 Data Catalog (Apache Iceberg)
|
||||
- **Production:** `sqlmesh plan prod`
|
||||
- **Development:** `sqlmesh plan dev_<username>` (isolated virtual environment)
|
||||
```
|
||||
landing/ ← immutable files (extraction output)
|
||||
├── psd/{year}/{month}/ ← USDA PSD
|
||||
├── cot/{year}/ ← CFTC COT
|
||||
├── prices/coffee_kc/ ← KC=F daily prices
|
||||
├── ice_stocks/ ← ICE daily warehouse stocks
|
||||
├── ice_aging/ ← ICE monthly aging report
|
||||
└── ice_stocks_by_port/ ← ICE historical EOM by port
|
||||
|
||||
SQLMesh manages environment isolation automatically - no need for separate local databases.
|
||||
staging/ ← read_csv + seed joins + cast (PSD)
|
||||
└── staging.psdalldata__commodity
|
||||
|
||||
### 4-Layer Data Model
|
||||
seeds/ ← static lookup CSVs (PSD code mappings)
|
||||
├── seeds.psd_commodity_codes
|
||||
├── seeds.psd_attribute_codes
|
||||
└── seeds.psd_unit_of_measure_codes
|
||||
|
||||
See `models/README.md` for detailed architecture documentation:
|
||||
foundation/ ← read_csv + cast + dedup (prices, COT, ICE)
|
||||
├── foundation.fct_coffee_prices
|
||||
├── foundation.fct_cot_positioning
|
||||
├── foundation.fct_ice_warehouse_stocks
|
||||
├── foundation.fct_ice_aging_stocks
|
||||
├── foundation.fct_ice_warehouse_stocks_by_port
|
||||
└── foundation.dim_commodity
|
||||
|
||||
1. **Raw** - Immutable source data
|
||||
2. **Staging** - Schema, types, basic cleansing
|
||||
3. **Cleaned** - Business logic, integration
|
||||
4. **Serving** - Analytics-ready (facts, dimensions, aggregates)
|
||||
|
||||
## Configuration
|
||||
|
||||
**Config:** `config.yaml`
|
||||
- DuckDB in-memory with R2 Iceberg catalog
|
||||
- Extensions: httpfs, iceberg
|
||||
- Auto-apply enabled (no prompts)
|
||||
- Initialization hooks for R2 secret/catalog attachment
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Plan changes for dev environment
|
||||
sqlmesh plan dev_yourname
|
||||
|
||||
# Plan changes for prod
|
||||
sqlmesh plan prod
|
||||
|
||||
# Run tests
|
||||
sqlmesh test
|
||||
|
||||
# Validate models
|
||||
sqlmesh validate
|
||||
|
||||
# Run audits
|
||||
sqlmesh audit
|
||||
|
||||
# Format SQL files
|
||||
sqlmesh format
|
||||
|
||||
# Start web UI
|
||||
sqlmesh ui
|
||||
serving/ ← pre-aggregated for web app
|
||||
├── serving.coffee_prices
|
||||
├── serving.cot_positioning
|
||||
├── serving.ice_warehouse_stocks
|
||||
├── serving.ice_aging_stocks
|
||||
├── serving.ice_warehouse_stocks_by_port
|
||||
└── serving.commodity_metrics
|
||||
```
|
||||
|
||||
## Environment Variables (Prod)
|
||||
### Layer responsibilities
|
||||
|
||||
Required for production R2 Iceberg catalog:
|
||||
- `CLOUDFLARE_API_TOKEN` - R2 API token
|
||||
- `ICEBERG_REST_URI` - R2 catalog REST endpoint
|
||||
- `R2_WAREHOUSE_NAME` - Warehouse name (default: "materia")
|
||||
**staging/** — PSD only: reads landing CSVs directly via `@psd_glob()`, joins seed lookup tables, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE (ingest_date derived from filename path).
|
||||
|
||||
These are injected via Pulumi ESC (`beanflows/prod`) on the supervisor instance.
|
||||
**seeds/** — Static lookup tables (commodity codes, attribute codes, unit of measure) loaded from `seeds/*.csv`. Referenced by staging.
|
||||
|
||||
## Development Workflow
|
||||
**foundation/** — All other sources (prices, COT, ICE): reads landing CSVs directly via glob macros, casts types, deduplicates. Uses INCREMENTAL_BY_TIME_RANGE. Also holds `dim_commodity` (the cross-source identity mapping).
|
||||
|
||||
1. Make changes to models in `models/`
|
||||
2. Test locally: `sqlmesh test`
|
||||
3. Plan changes: `sqlmesh plan dev_yourname`
|
||||
4. Review and apply changes
|
||||
5. Commit and push to trigger CI/CD
|
||||
**serving/** — Analytics-ready aggregates consumed by the web app via `analytics.duckdb`. Pre-computes moving averages, COT indices, MoM changes. These are the only tables the web app reads.
|
||||
|
||||
SQLMesh will handle environment isolation, table versioning, and incremental updates automatically.
|
||||
### Why no raw layer?
|
||||
|
||||
Landing files are immutable and content-addressed — the landing directory is the audit trail. A SQL raw layer would just duplicate file bytes into DuckDB with no added value. The first SQL layer reads directly from landing.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
||||
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
|
||||
The web app reads from a separate `analytics.duckdb` via `export_serving.py`.
|
||||
|
||||
Reference in New Issue
Block a user