Refactor PSD extraction: simplify to latest-only + add R2 support
## Key Changes
1. **Simplified extraction logic**
- Changed from downloading 220+ historical archives to checking only latest available month
- Tries current month and falls back up to 3 months (handles USDA publication lag)
- Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary
2. **Flat storage structure**
- Old: `data/{year}/{month}/{etag}.zip`
- New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
- Migrated 226 existing files to flat structure
3. **Dual storage modes**
- **Local mode**: Downloads to local directory (development)
- **R2 mode**: Uploads to Cloudflare R2 (production)
- Mode determined by presence of R2 environment variables
- Added boto3 dependency for S3-compatible R2 API
4. **Updated raw SQLMesh model**
- Changed pattern from `**/*.zip` to `*.zip` to match flat structure
## Benefits
- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity
## Testing
- ✅ Local extraction works and respects ETags
- ✅ Falls back correctly when current month unavailable
- ✅ Linting passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
15
CLAUDE.md
15
CLAUDE.md
@@ -36,13 +36,24 @@ This is a uv workspace with three main components:
|
||||
### 1. Extract Layer (`extract/`)
|
||||
Contains extraction packages for pulling data from external sources.
|
||||
|
||||
- **`extract/psdonline/`**: Extracts USDA PSD commodity data from archives dating back to 2006
|
||||
- **`extract/psdonline/`**: Extracts USDA PSD commodity data
|
||||
- Entry point: `extract_psd` CLI command (defined in `extract/psdonline/src/psdonline/execute.py`)
|
||||
- Downloads monthly zip archives to `extract/psdonline/src/psdonline/data/`
|
||||
- Checks latest available monthly snapshot (tries current month and 3 months back)
|
||||
- Uses ETags to avoid re-downloading unchanged files
|
||||
- Storage modes:
|
||||
- **Local mode** (no R2 credentials): Downloads to `extract/psdonline/src/psdonline/data/{etag}.zip`
|
||||
- **R2 mode** (R2 credentials present): Uploads to `s3://bucket/psd/{etag}.zip`
|
||||
- Flat structure: files named by ETag for natural deduplication
|
||||
|
||||
**Run extraction:**
|
||||
```bash
|
||||
extract_psd # Local mode (default)
|
||||
|
||||
# R2 mode (requires env vars: R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY)
|
||||
export R2_ENDPOINT=...
|
||||
export R2_BUCKET=...
|
||||
export R2_ACCESS_KEY=...
|
||||
export R2_SECRET_KEY=...
|
||||
extract_psd
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user