Files

Deeman 320ddd5123 Add architectural plan document for PSD extraction refactoring

Documents the complete analysis, implementation, and results of the
PSD extraction refactoring from the architecture advisor's recommendations.

Includes:
- Problem statement and key insights
- Architecture analysis (data-oriented approach)
- Implementation phases and results
- Testing outcomes and metrics
- 227 files migrated, ~40 lines reduced, 220+ → 1-4 requests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-20 22:55:58 +02:00

5.6 KiB

Raw Blame History

PSD Extraction Refactoring Plan

Status: ✅ Completed Branch: refactor/psd-extraction-r2 Date: 2025-10-20

Problem Statement

The original PSD extraction implementation downloaded 220+ historical monthly archives from August 2006 to present, storing them in a nested {year}/{month}/{etag}.zip directory structure. This approach was overengineered because:

ETags already provide deduplication - Each unique data snapshot has a unique ETag
Historical year/month structure was redundant - The publication date (year/month) is metadata, not data identity
No R2 support - Files could only be stored locally, not in production R2 bucket
Unnecessary complexity - Downloading 220+ URLs hoping to find unique ETags when we only need the latest

Architecture Analysis

Key Insight

What does each file represent?

USDA publishes monthly snapshots, but most months they re-publish the same data. The ETag tells you when the actual data has changed, not when USDA published it. The year/month structure is publication metadata, not data identity.

The Data-Oriented Question

What do we actually need?

We need to capture every unique data snapshot. The ETag already identifies unique snapshots. The current approach downloads 220+ URLs to find unique ETags. The direct approach: check 1 URL (current month), download if new ETag, done.

Proposed Solution

1. Simplify to Current-Month-Only Extraction

Old approach:

for year in range(2006, today.year+1):
    for month in range(1, 13):
        download(year, month)  # 220+ downloads

New approach:

# Try current month, fallback 3 months (handles publication lag)
for months_back in range(4):
    if download_if_exists(today.year, today.month - months_back):
        break

Why this works:

ETags naturally deduplicate
Historical snapshots already captured from previous runs
Only need to check for latest data
Same result, 220x less work

2. Flatten Storage Structure

Old: data/{year}/{month}/{etag}.zip New: data/{etag}.zip (local) or landing/psd/{etag}.zip (R2)

Benefits:

ETag is the natural identifier
Simpler to manage
No nested directory traversal
Works identically for local and R2

3. Dual Storage Modes

Local Mode (Development):

No R2 credentials → downloads to local directory
ETag-based deduplication via file existence check
Use case: Local development and testing

R2 Mode (Production):

R2 credentials present → uploads to R2 only (no local storage)
ETag-based deduplication via S3 HEAD request
Use case: Production pipelines on ephemeral workers

Mode Detection:

use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY])

4. R2 Integration

Configuration:

Bucket: beanflows-data-prod
Path: landing/psd/{etag}.zip
Credentials: Via Pulumi ESC (beanflows/prod)
Library: boto3 with S3-compatible API

Pulumi ESC Environment Variables:

R2_ENDPOINT: Account URL (without bucket path)
R2_BUCKET: beanflows-data-prod
R2_ADMIN_ACCESS_KEY_ID: Access key (fallback from R2_ACCESS_KEY)
R2_ADMIN_SECRET_ACCESS_KEY: Secret key (fallback from R2_SECRET_KEY)

Implementation Summary

Phase 1: Simplify Extraction ✅

Changed loop from 220+ historical downloads to current month check
Added fallback logic (tries 4 months back for publication lag)
Flattened storage to {etag}.zip
Updated raw SQLMesh model pattern to *.zip

Phase 2: Add R2 Support ✅

Added boto3 dependency
Implemented R2 upload with ETag deduplication
Added support for ESC variable names
Updated Pulumi ESC environment with R2_BUCKET and fixed R2_ENDPOINT

Phase 3: Historical Migration ✅

Created temporary script to upload 227 existing files to R2
All files now in landing/psd/*.zip
Verified deduplication works on both local and R2

Phase 4: Documentation ✅

Updated CLAUDE.md with Pulumi ESC usage guide
Fixed supervisor bootstrap documentation (automatic in CI/CD)
Added examples for running commands with ESC secrets

Benefits Achieved

Simplicity: Single file check instead of 220+ URL attempts
Efficiency: ETag-based deduplication works naturally
Flexibility: Supports both local dev and production R2 storage
Maintainability: Removed unnecessary complexity
Cost Optimization: Ephemeral workers don't need local storage
Data Consistency: All historical data now in R2 landing bucket

Testing Results

✅ Local extraction works and respects ETags ✅ R2 upload works (tested with Sept 2025 data) ✅ R2 deduplication works (skips existing files) ✅ Fallback logic works (tries current month, falls back to Sept) ✅ Historical migration completed (227 files uploaded) ✅ All linting passes

Metrics

Code reduction: ~40 lines removed, ~80 lines added (net +40 for R2 support)
Download efficiency: 220+ requests → 1-4 requests
Storage structure: Nested 3-level → Flat 1-level
Files migrated: 227 historical files to R2
Time to migrate: ~2 minutes for 227 files (~2.3 GB)

Next Steps

Update SQLMesh raw model to support reading from R2 (future work)
Merge branch to master
Deploy to production
Monitor daily extraction runs

References

Architecture pattern: Data-oriented design (identify data by content, not metadata)
Inspiration: ETag-based caching patterns
Storage: Cloudflare R2 (S3-compatible object storage)

5.6 KiB Raw Blame History