extract: wrap response.content in BytesIO before passing to
normalize_zipped_csv, and call .read() on the returned BytesIO before
write_bytes (two bugs: wrong type in, wrong type out)
sqlmesh: {{ var() }} inside SQL string literals is not substituted by
SQLMesh's Jinja (SQL parser treats them as opaque strings). Replace with
a @psd_glob() macro that evaluates LANDING_DIR at render time and returns
a quoted glob path string.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove distributed R2/Iceberg/SSH pipeline architecture in favor of
local subprocess execution with NVMe storage. Landing data backed up
to R2 via rclone timer.
- Strip Iceberg catalog, httpfs, boto3, paramiko, prefect, pyarrow
- Pipelines run via subprocess.run() with bounded timeouts
- Extract writes to {LANDING_DIR}/psd/{year}/{month}/{etag}.csv.gzip
- SQLMesh reads LANDING_DIR variable, writes to DUCKDB_PATH
- Delete unused provider stubs (ovh, scaleway, oracle)
- Add rclone systemd timer for R2 backup every 6h
- Update supervisor to run pipelines with env vars
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Key Changes
1. **Simplified extraction logic**
- Changed from downloading 220+ historical archives to checking only latest available month
- Tries current month and falls back up to 3 months (handles USDA publication lag)
- Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary
2. **Flat storage structure**
- Old: `data/{year}/{month}/{etag}.zip`
- New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
- Migrated 226 existing files to flat structure
3. **Dual storage modes**
- **Local mode**: Downloads to local directory (development)
- **R2 mode**: Uploads to Cloudflare R2 (production)
- Mode determined by presence of R2 environment variables
- Added boto3 dependency for S3-compatible R2 API
4. **Updated raw SQLMesh model**
- Changed pattern from `**/*.zip` to `*.zip` to match flat structure
## Benefits
- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity
## Testing
- ✅ Local extraction works and respects ETags
- ✅ Falls back correctly when current month unavailable
- ✅ Linting passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Update secret token: CLOUDFLARE_API_TOKEN → R2_ADMIN_API_TOKEN
- Update warehouse name: R2_WAREHOUSE_NAME → ICEBERG_WAREHOUSE_NAME
- Update endpoint: ICEBERG_REST_URI → ICEBERG_CATALOG_URI
- Remove CREATE SCHEMA and USE statements
- DuckDB has bug with Iceberg REST: missing Content-Type header
- Schema creation via SQL currently not supported
- Models will use fully-qualified table names instead
Successfully tested with real R2 credentials:
- Iceberg catalog attachment works ✓
- Plan dry-run executes ✓
- Only fails on missing source data (expected) ✓
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add catalog ATTACH statement in before_all with SECRET parameter
- References r2_secret created by connection configuration
- Uses proper DuckDB ATTACH syntax per Cloudflare docs
- Single-line format to avoid Jinja parsing issues
- Remove manual CREATE SECRET from before_all hooks
- Secret automatically created by SQLMesh from connection config
- Cleaner separation: connection defines credentials, hooks use them
Successfully tested - config validates without warnings.
Only fails on missing env vars (expected locally).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Move Iceberg secret from before_all hook to connection.secrets
- Fixes SQLMesh warning about unsupported @env_var syntax
- Uses Jinja templating {{ env_var() }} instead of @env_var()
- Remove database: ':memory:' (incompatible with catalogs)
- DuckDB doesn't allow both database and catalogs config
- Connection defaults to in-memory when no database specified
- Simplify before_all hooks to only handle ATTACH and schema setup
- Secret is now created automatically by SQLMesh
- Cleaner separation: connection config vs runtime setup
Based on:
- https://developers.cloudflare.com/r2/data-catalog/config-examples/duckdb/
- https://sqlmesh.readthedocs.io/en/latest/integrations/engines/duckdb/🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove dev gateway (local DuckDB file no longer needed)
- Single prod gateway connects to R2 Iceberg catalog
- Use virtual environments for dev isolation (e.g., dev_<username>)
- Update CLAUDE.md with new workflow and environment strategy
- Create comprehensive transform/sqlmesh_materia/README.md
Benefits:
- Simpler configuration (one gateway instead of two)
- All environments use same R2 Iceberg catalog
- SQLMesh handles environment isolation automatically
- No need to maintain local 13GB materia_dev.db file
- before_all hooks only run for prod gateway (no conditional logic needed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>