Refactor PSD extraction: simplify to latest-only + add R2 support
## Key Changes
1. **Simplified extraction logic**
- Changed from downloading 220+ historical archives to checking only latest available month
- Tries current month and falls back up to 3 months (handles USDA publication lag)
- Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary
2. **Flat storage structure**
- Old: `data/{year}/{month}/{etag}.zip`
- New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
- Migrated 226 existing files to flat structure
3. **Dual storage modes**
- **Local mode**: Downloads to local directory (development)
- **R2 mode**: Uploads to Cloudflare R2 (production)
- Mode determined by presence of R2 environment variables
- Added boto3 dependency for S3-compatible R2 API
4. **Updated raw SQLMesh model**
- Changed pattern from `**/*.zip` to `*.zip` to match flat structure
## Benefits
- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity
## Testing
- ✅ Local extraction works and respects ETags
- ✅ Falls back correctly when current month unavailable
- ✅ Linting passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -21,4 +21,4 @@ MODEL (
|
||||
)
|
||||
);
|
||||
SELECT *
|
||||
FROM read_csv('zip://extract/psdonline/src/psdonline/data/**/*.zip/*.csv', header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
|
||||
FROM read_csv('zip://extract/psdonline/src/psdonline/data/*.zip/*.csv', header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
|
||||
|
||||
Reference in New Issue
Block a user