From 320ddd5123168539c43527a29f295e197e406ea1 Mon Sep 17 00:00:00 2001
From: Deeman <hendriknote@gmail.com>
Date: Mon, 20 Oct 2025 22:55:58 +0200
Subject: [PATCH] Add architectural plan document for PSD extraction
 refactoring
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Documents the complete analysis, implementation, and results of the
PSD extraction refactoring from the architecture advisor's recommendations.

Includes:
- Problem statement and key insights
- Architecture analysis (data-oriented approach)
- Implementation phases and results
- Testing outcomes and metrics
- 227 files migrated, ~40 lines reduced, 220+ → 1-4 requests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .claude/plans/refactor-psd-extraction.md | 158 +++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 .claude/plans/refactor-psd-extraction.md

diff --git a/.claude/plans/refactor-psd-extraction.md b/.claude/plans/refactor-psd-extraction.md
new file mode 100644
index 0000000..1799cc8
--- /dev/null
+++ b/.claude/plans/refactor-psd-extraction.md
@@ -0,0 +1,158 @@
+# PSD Extraction Refactoring Plan
+
+**Status:** ✅ Completed
+**Branch:** `refactor/psd-extraction-r2`
+**Date:** 2025-10-20
+
+## Problem Statement
+
+The original PSD extraction implementation downloaded 220+ historical monthly archives from August 2006 to present, storing them in a nested `{year}/{month}/{etag}.zip` directory structure. This approach was overengineered because:
+
+1. **ETags already provide deduplication** - Each unique data snapshot has a unique ETag
+2. **Historical year/month structure was redundant** - The publication date (year/month) is metadata, not data identity
+3. **No R2 support** - Files could only be stored locally, not in production R2 bucket
+4. **Unnecessary complexity** - Downloading 220+ URLs hoping to find unique ETags when we only need the latest
+
+## Architecture Analysis
+
+### Key Insight
+
+**What does each file represent?**
+
+USDA publishes monthly snapshots, but most months they re-publish the same data. The ETag tells you when the *actual data* has changed, not when USDA published it. The year/month structure is publication metadata, not data identity.
+
+### The Data-Oriented Question
+
+**What do we actually need?**
+
+We need to capture every unique data snapshot. The ETag already identifies unique snapshots. The current approach downloads 220+ URLs to find unique ETags. The direct approach: check 1 URL (current month), download if new ETag, done.
+
+## Proposed Solution
+
+### 1. Simplify to Current-Month-Only Extraction
+
+**Old approach:**
+```python
+for year in range(2006, today.year+1):
+    for month in range(1, 13):
+        download(year, month)  # 220+ downloads
+```
+
+**New approach:**
+```python
+# Try current month, fallback 3 months (handles publication lag)
+for months_back in range(4):
+    if download_if_exists(today.year, today.month - months_back):
+        break
+```
+
+**Why this works:**
+- ETags naturally deduplicate
+- Historical snapshots already captured from previous runs
+- Only need to check for latest data
+- Same result, 220x less work
+
+### 2. Flatten Storage Structure
+
+**Old:** `data/{year}/{month}/{etag}.zip`
+**New:** `data/{etag}.zip` (local) or `landing/psd/{etag}.zip` (R2)
+
+**Benefits:**
+- ETag is the natural identifier
+- Simpler to manage
+- No nested directory traversal
+- Works identically for local and R2
+
+### 3. Dual Storage Modes
+
+**Local Mode (Development):**
+- No R2 credentials → downloads to local directory
+- ETag-based deduplication via file existence check
+- Use case: Local development and testing
+
+**R2 Mode (Production):**
+- R2 credentials present → uploads to R2 only (no local storage)
+- ETag-based deduplication via S3 HEAD request
+- Use case: Production pipelines on ephemeral workers
+
+**Mode Detection:**
+```python
+use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY])
+```
+
+### 4. R2 Integration
+
+**Configuration:**
+- Bucket: `beanflows-data-prod`
+- Path: `landing/psd/{etag}.zip`
+- Credentials: Via Pulumi ESC (`beanflows/prod`)
+- Library: boto3 with S3-compatible API
+
+**Pulumi ESC Environment Variables:**
+- `R2_ENDPOINT`: Account URL (without bucket path)
+- `R2_BUCKET`: `beanflows-data-prod`
+- `R2_ADMIN_ACCESS_KEY_ID`: Access key (fallback from R2_ACCESS_KEY)
+- `R2_ADMIN_SECRET_ACCESS_KEY`: Secret key (fallback from R2_SECRET_KEY)
+
+## Implementation Summary
+
+### Phase 1: Simplify Extraction ✅
+- Changed loop from 220+ historical downloads to current month check
+- Added fallback logic (tries 4 months back for publication lag)
+- Flattened storage to `{etag}.zip`
+- Updated raw SQLMesh model pattern to `*.zip`
+
+### Phase 2: Add R2 Support ✅
+- Added boto3 dependency
+- Implemented R2 upload with ETag deduplication
+- Added support for ESC variable names
+- Updated Pulumi ESC environment with R2_BUCKET and fixed R2_ENDPOINT
+
+### Phase 3: Historical Migration ✅
+- Created temporary script to upload 227 existing files to R2
+- All files now in `landing/psd/*.zip`
+- Verified deduplication works on both local and R2
+
+### Phase 4: Documentation ✅
+- Updated CLAUDE.md with Pulumi ESC usage guide
+- Fixed supervisor bootstrap documentation (automatic in CI/CD)
+- Added examples for running commands with ESC secrets
+
+## Benefits Achieved
+
+1. **Simplicity:** Single file check instead of 220+ URL attempts
+2. **Efficiency:** ETag-based deduplication works naturally
+3. **Flexibility:** Supports both local dev and production R2 storage
+4. **Maintainability:** Removed unnecessary complexity
+5. **Cost Optimization:** Ephemeral workers don't need local storage
+6. **Data Consistency:** All historical data now in R2 landing bucket
+
+## Testing Results
+
+✅ Local extraction works and respects ETags
+✅ R2 upload works (tested with Sept 2025 data)
+✅ R2 deduplication works (skips existing files)
+✅ Fallback logic works (tries current month, falls back to Sept)
+✅ Historical migration completed (227 files uploaded)
+✅ All linting passes
+
+## Metrics
+
+- **Code reduction:** ~40 lines removed, ~80 lines added (net +40 for R2 support)
+- **Download efficiency:** 220+ requests → 1-4 requests
+- **Storage structure:** Nested 3-level → Flat 1-level
+- **Files migrated:** 227 historical files to R2
+- **Time to migrate:** ~2 minutes for 227 files (~2.3 GB)
+
+## Next Steps
+
+1. Update SQLMesh raw model to support reading from R2 (future work)
+2. Merge branch to master
+3. Deploy to production
+4. Monitor daily extraction runs
+
+## References
+
+- Architecture pattern: Data-oriented design (identify data by content, not metadata)
+- Inspiration: ETag-based caching patterns
+- Storage: Cloudflare R2 (S3-compatible object storage)