# PSD Extraction Refactoring Plan **Status:** ✅ Completed **Branch:** `refactor/psd-extraction-r2` **Date:** 2025-10-20 ## Problem Statement The original PSD extraction implementation downloaded 220+ historical monthly archives from August 2006 to present, storing them in a nested `{year}/{month}/{etag}.zip` directory structure. This approach was overengineered because: 1. **ETags already provide deduplication** - Each unique data snapshot has a unique ETag 2. **Historical year/month structure was redundant** - The publication date (year/month) is metadata, not data identity 3. **No R2 support** - Files could only be stored locally, not in production R2 bucket 4. **Unnecessary complexity** - Downloading 220+ URLs hoping to find unique ETags when we only need the latest ## Architecture Analysis ### Key Insight **What does each file represent?** USDA publishes monthly snapshots, but most months they re-publish the same data. The ETag tells you when the *actual data* has changed, not when USDA published it. The year/month structure is publication metadata, not data identity. ### The Data-Oriented Question **What do we actually need?** We need to capture every unique data snapshot. The ETag already identifies unique snapshots. The current approach downloads 220+ URLs to find unique ETags. The direct approach: check 1 URL (current month), download if new ETag, done. ## Proposed Solution ### 1. Simplify to Current-Month-Only Extraction **Old approach:** ```python for year in range(2006, today.year+1): for month in range(1, 13): download(year, month) # 220+ downloads ``` **New approach:** ```python # Try current month, fallback 3 months (handles publication lag) for months_back in range(4): if download_if_exists(today.year, today.month - months_back): break ``` **Why this works:** - ETags naturally deduplicate - Historical snapshots already captured from previous runs - Only need to check for latest data - Same result, 220x less work ### 2. Flatten Storage Structure **Old:** `data/{year}/{month}/{etag}.zip` **New:** `data/{etag}.zip` (local) or `landing/psd/{etag}.zip` (R2) **Benefits:** - ETag is the natural identifier - Simpler to manage - No nested directory traversal - Works identically for local and R2 ### 3. Dual Storage Modes **Local Mode (Development):** - No R2 credentials → downloads to local directory - ETag-based deduplication via file existence check - Use case: Local development and testing **R2 Mode (Production):** - R2 credentials present → uploads to R2 only (no local storage) - ETag-based deduplication via S3 HEAD request - Use case: Production pipelines on ephemeral workers **Mode Detection:** ```python use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY]) ``` ### 4. R2 Integration **Configuration:** - Bucket: `beanflows-data-prod` - Path: `landing/psd/{etag}.zip` - Credentials: Via Pulumi ESC (`beanflows/prod`) - Library: boto3 with S3-compatible API **Pulumi ESC Environment Variables:** - `R2_ENDPOINT`: Account URL (without bucket path) - `R2_BUCKET`: `beanflows-data-prod` - `R2_ADMIN_ACCESS_KEY_ID`: Access key (fallback from R2_ACCESS_KEY) - `R2_ADMIN_SECRET_ACCESS_KEY`: Secret key (fallback from R2_SECRET_KEY) ## Implementation Summary ### Phase 1: Simplify Extraction ✅ - Changed loop from 220+ historical downloads to current month check - Added fallback logic (tries 4 months back for publication lag) - Flattened storage to `{etag}.zip` - Updated raw SQLMesh model pattern to `*.zip` ### Phase 2: Add R2 Support ✅ - Added boto3 dependency - Implemented R2 upload with ETag deduplication - Added support for ESC variable names - Updated Pulumi ESC environment with R2_BUCKET and fixed R2_ENDPOINT ### Phase 3: Historical Migration ✅ - Created temporary script to upload 227 existing files to R2 - All files now in `landing/psd/*.zip` - Verified deduplication works on both local and R2 ### Phase 4: Documentation ✅ - Updated CLAUDE.md with Pulumi ESC usage guide - Fixed supervisor bootstrap documentation (automatic in CI/CD) - Added examples for running commands with ESC secrets ## Benefits Achieved 1. **Simplicity:** Single file check instead of 220+ URL attempts 2. **Efficiency:** ETag-based deduplication works naturally 3. **Flexibility:** Supports both local dev and production R2 storage 4. **Maintainability:** Removed unnecessary complexity 5. **Cost Optimization:** Ephemeral workers don't need local storage 6. **Data Consistency:** All historical data now in R2 landing bucket ## Testing Results ✅ Local extraction works and respects ETags ✅ R2 upload works (tested with Sept 2025 data) ✅ R2 deduplication works (skips existing files) ✅ Fallback logic works (tries current month, falls back to Sept) ✅ Historical migration completed (227 files uploaded) ✅ All linting passes ## Metrics - **Code reduction:** ~40 lines removed, ~80 lines added (net +40 for R2 support) - **Download efficiency:** 220+ requests → 1-4 requests - **Storage structure:** Nested 3-level → Flat 1-level - **Files migrated:** 227 historical files to R2 - **Time to migrate:** ~2 minutes for 227 files (~2.3 GB) ## Next Steps 1. Update SQLMesh raw model to support reading from R2 (future work) 2. Merge branch to master 3. Deploy to production 4. Monitor daily extraction runs ## References - Architecture pattern: Data-oriented design (identify data by content, not metadata) - Inspiration: ETag-based caching patterns - Storage: Cloudflare R2 (S3-compatible object storage)