## Key Changes
1. **Simplified extraction logic**
- Changed from downloading 220+ historical archives to checking only latest available month
- Tries current month and falls back up to 3 months (handles USDA publication lag)
- Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary
2. **Flat storage structure**
- Old: `data/{year}/{month}/{etag}.zip`
- New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
- Migrated 226 existing files to flat structure
3. **Dual storage modes**
- **Local mode**: Downloads to local directory (development)
- **R2 mode**: Uploads to Cloudflare R2 (production)
- Mode determined by presence of R2 environment variables
- Added boto3 dependency for S3-compatible R2 API
4. **Updated raw SQLMesh model**
- Changed pattern from `**/*.zip` to `*.zip` to match flat structure
## Benefits
- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity
## Testing
- ✅ Local extraction works and respects ETags
- ✅ Falls back correctly when current month unavailable
- ✅ Linting passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Materia SQLMesh Transform Layer
Data transformation pipeline using SQLMesh and DuckDB, implementing a 4-layer architecture.
Quick Start
cd transform/sqlmesh_materia
# Local development (virtual environment)
sqlmesh plan dev_<username>
# Production
sqlmesh plan prod
# Run tests
sqlmesh test
# Format SQL
sqlmesh format
Architecture
Gateway Configuration
Single Gateway: All environments connect to Cloudflare R2 Data Catalog (Apache Iceberg)
- Production:
sqlmesh plan prod - Development:
sqlmesh plan dev_<username>(isolated virtual environment)
SQLMesh manages environment isolation automatically - no need for separate local databases.
4-Layer Data Model
See models/README.md for detailed architecture documentation:
- Raw - Immutable source data
- Staging - Schema, types, basic cleansing
- Cleaned - Business logic, integration
- Serving - Analytics-ready (facts, dimensions, aggregates)
Configuration
Config: config.yaml
- DuckDB in-memory with R2 Iceberg catalog
- Extensions: httpfs, iceberg
- Auto-apply enabled (no prompts)
- Initialization hooks for R2 secret/catalog attachment
Commands
# Plan changes for dev environment
sqlmesh plan dev_yourname
# Plan changes for prod
sqlmesh plan prod
# Run tests
sqlmesh test
# Validate models
sqlmesh validate
# Run audits
sqlmesh audit
# Format SQL files
sqlmesh format
# Start web UI
sqlmesh ui
Environment Variables (Prod)
Required for production R2 Iceberg catalog:
CLOUDFLARE_API_TOKEN- R2 API tokenICEBERG_REST_URI- R2 catalog REST endpointR2_WAREHOUSE_NAME- Warehouse name (default: "materia")
These are injected via Pulumi ESC (beanflows/prod) on the supervisor instance.
Development Workflow
- Make changes to models in
models/ - Test locally:
sqlmesh test - Plan changes:
sqlmesh plan dev_yourname - Review and apply changes
- Commit and push to trigger CI/CD
SQLMesh will handle environment isolation, table versioning, and incremental updates automatically.