Add architectural plan document for PSD extraction refactoring
Documents the complete analysis, implementation, and results of the PSD extraction refactoring from the architecture advisor's recommendations. Includes: - Problem statement and key insights - Architecture analysis (data-oriented approach) - Implementation phases and results - Testing outcomes and metrics - 227 files migrated, ~40 lines reduced, 220+ → 1-4 requests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
158
.claude/plans/refactor-psd-extraction.md
Normal file
158
.claude/plans/refactor-psd-extraction.md
Normal file
@@ -0,0 +1,158 @@
|
|||||||
|
# PSD Extraction Refactoring Plan
|
||||||
|
|
||||||
|
**Status:** ✅ Completed
|
||||||
|
**Branch:** `refactor/psd-extraction-r2`
|
||||||
|
**Date:** 2025-10-20
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
The original PSD extraction implementation downloaded 220+ historical monthly archives from August 2006 to present, storing them in a nested `{year}/{month}/{etag}.zip` directory structure. This approach was overengineered because:
|
||||||
|
|
||||||
|
1. **ETags already provide deduplication** - Each unique data snapshot has a unique ETag
|
||||||
|
2. **Historical year/month structure was redundant** - The publication date (year/month) is metadata, not data identity
|
||||||
|
3. **No R2 support** - Files could only be stored locally, not in production R2 bucket
|
||||||
|
4. **Unnecessary complexity** - Downloading 220+ URLs hoping to find unique ETags when we only need the latest
|
||||||
|
|
||||||
|
## Architecture Analysis
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
|
||||||
|
**What does each file represent?**
|
||||||
|
|
||||||
|
USDA publishes monthly snapshots, but most months they re-publish the same data. The ETag tells you when the *actual data* has changed, not when USDA published it. The year/month structure is publication metadata, not data identity.
|
||||||
|
|
||||||
|
### The Data-Oriented Question
|
||||||
|
|
||||||
|
**What do we actually need?**
|
||||||
|
|
||||||
|
We need to capture every unique data snapshot. The ETag already identifies unique snapshots. The current approach downloads 220+ URLs to find unique ETags. The direct approach: check 1 URL (current month), download if new ETag, done.
|
||||||
|
|
||||||
|
## Proposed Solution
|
||||||
|
|
||||||
|
### 1. Simplify to Current-Month-Only Extraction
|
||||||
|
|
||||||
|
**Old approach:**
|
||||||
|
```python
|
||||||
|
for year in range(2006, today.year+1):
|
||||||
|
for month in range(1, 13):
|
||||||
|
download(year, month) # 220+ downloads
|
||||||
|
```
|
||||||
|
|
||||||
|
**New approach:**
|
||||||
|
```python
|
||||||
|
# Try current month, fallback 3 months (handles publication lag)
|
||||||
|
for months_back in range(4):
|
||||||
|
if download_if_exists(today.year, today.month - months_back):
|
||||||
|
break
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why this works:**
|
||||||
|
- ETags naturally deduplicate
|
||||||
|
- Historical snapshots already captured from previous runs
|
||||||
|
- Only need to check for latest data
|
||||||
|
- Same result, 220x less work
|
||||||
|
|
||||||
|
### 2. Flatten Storage Structure
|
||||||
|
|
||||||
|
**Old:** `data/{year}/{month}/{etag}.zip`
|
||||||
|
**New:** `data/{etag}.zip` (local) or `landing/psd/{etag}.zip` (R2)
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ETag is the natural identifier
|
||||||
|
- Simpler to manage
|
||||||
|
- No nested directory traversal
|
||||||
|
- Works identically for local and R2
|
||||||
|
|
||||||
|
### 3. Dual Storage Modes
|
||||||
|
|
||||||
|
**Local Mode (Development):**
|
||||||
|
- No R2 credentials → downloads to local directory
|
||||||
|
- ETag-based deduplication via file existence check
|
||||||
|
- Use case: Local development and testing
|
||||||
|
|
||||||
|
**R2 Mode (Production):**
|
||||||
|
- R2 credentials present → uploads to R2 only (no local storage)
|
||||||
|
- ETag-based deduplication via S3 HEAD request
|
||||||
|
- Use case: Production pipelines on ephemeral workers
|
||||||
|
|
||||||
|
**Mode Detection:**
|
||||||
|
```python
|
||||||
|
use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY])
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. R2 Integration
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
- Bucket: `beanflows-data-prod`
|
||||||
|
- Path: `landing/psd/{etag}.zip`
|
||||||
|
- Credentials: Via Pulumi ESC (`beanflows/prod`)
|
||||||
|
- Library: boto3 with S3-compatible API
|
||||||
|
|
||||||
|
**Pulumi ESC Environment Variables:**
|
||||||
|
- `R2_ENDPOINT`: Account URL (without bucket path)
|
||||||
|
- `R2_BUCKET`: `beanflows-data-prod`
|
||||||
|
- `R2_ADMIN_ACCESS_KEY_ID`: Access key (fallback from R2_ACCESS_KEY)
|
||||||
|
- `R2_ADMIN_SECRET_ACCESS_KEY`: Secret key (fallback from R2_SECRET_KEY)
|
||||||
|
|
||||||
|
## Implementation Summary
|
||||||
|
|
||||||
|
### Phase 1: Simplify Extraction ✅
|
||||||
|
- Changed loop from 220+ historical downloads to current month check
|
||||||
|
- Added fallback logic (tries 4 months back for publication lag)
|
||||||
|
- Flattened storage to `{etag}.zip`
|
||||||
|
- Updated raw SQLMesh model pattern to `*.zip`
|
||||||
|
|
||||||
|
### Phase 2: Add R2 Support ✅
|
||||||
|
- Added boto3 dependency
|
||||||
|
- Implemented R2 upload with ETag deduplication
|
||||||
|
- Added support for ESC variable names
|
||||||
|
- Updated Pulumi ESC environment with R2_BUCKET and fixed R2_ENDPOINT
|
||||||
|
|
||||||
|
### Phase 3: Historical Migration ✅
|
||||||
|
- Created temporary script to upload 227 existing files to R2
|
||||||
|
- All files now in `landing/psd/*.zip`
|
||||||
|
- Verified deduplication works on both local and R2
|
||||||
|
|
||||||
|
### Phase 4: Documentation ✅
|
||||||
|
- Updated CLAUDE.md with Pulumi ESC usage guide
|
||||||
|
- Fixed supervisor bootstrap documentation (automatic in CI/CD)
|
||||||
|
- Added examples for running commands with ESC secrets
|
||||||
|
|
||||||
|
## Benefits Achieved
|
||||||
|
|
||||||
|
1. **Simplicity:** Single file check instead of 220+ URL attempts
|
||||||
|
2. **Efficiency:** ETag-based deduplication works naturally
|
||||||
|
3. **Flexibility:** Supports both local dev and production R2 storage
|
||||||
|
4. **Maintainability:** Removed unnecessary complexity
|
||||||
|
5. **Cost Optimization:** Ephemeral workers don't need local storage
|
||||||
|
6. **Data Consistency:** All historical data now in R2 landing bucket
|
||||||
|
|
||||||
|
## Testing Results
|
||||||
|
|
||||||
|
✅ Local extraction works and respects ETags
|
||||||
|
✅ R2 upload works (tested with Sept 2025 data)
|
||||||
|
✅ R2 deduplication works (skips existing files)
|
||||||
|
✅ Fallback logic works (tries current month, falls back to Sept)
|
||||||
|
✅ Historical migration completed (227 files uploaded)
|
||||||
|
✅ All linting passes
|
||||||
|
|
||||||
|
## Metrics
|
||||||
|
|
||||||
|
- **Code reduction:** ~40 lines removed, ~80 lines added (net +40 for R2 support)
|
||||||
|
- **Download efficiency:** 220+ requests → 1-4 requests
|
||||||
|
- **Storage structure:** Nested 3-level → Flat 1-level
|
||||||
|
- **Files migrated:** 227 historical files to R2
|
||||||
|
- **Time to migrate:** ~2 minutes for 227 files (~2.3 GB)
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. Update SQLMesh raw model to support reading from R2 (future work)
|
||||||
|
2. Merge branch to master
|
||||||
|
3. Deploy to production
|
||||||
|
4. Monitor daily extraction runs
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Architecture pattern: Data-oriented design (identify data by content, not metadata)
|
||||||
|
- Inspiration: ETag-based caching patterns
|
||||||
|
- Storage: Cloudflare R2 (S3-compatible object storage)
|
||||||
Reference in New Issue
Block a user