Files

Deeman 38897617e7 Refactor PSD extraction: simplify to latest-only + add R2 support

## Key Changes

1. **Simplified extraction logic**
   - Changed from downloading 220+ historical archives to checking only latest available month
   - Tries current month and falls back up to 3 months (handles USDA publication lag)
   - Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary

2. **Flat storage structure**
   - Old: `data/{year}/{month}/{etag}.zip`
   - New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
   - Migrated 226 existing files to flat structure

3. **Dual storage modes**
   - **Local mode**: Downloads to local directory (development)
   - **R2 mode**: Uploads to Cloudflare R2 (production)
   - Mode determined by presence of R2 environment variables
   - Added boto3 dependency for S3-compatible R2 API

4. **Updated raw SQLMesh model**
   - Changed pattern from `**/*.zip` to `*.zip` to match flat structure

## Benefits

- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity

## Testing

- ✅ Local extraction works and respects ETags
- ✅ Falls back correctly when current month unavailable
- ✅ Linting passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-20 22:02:15 +02:00

11 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.

Tech Stack:

Python 3.13 with uv package manager
SQLMesh for SQL transformation and orchestration
DuckDB as the analytical database
Workspace structure with separate extract and transform packages

Environment Setup

Install dependencies:

uv sync

Setup pre-commit hooks:

pre-commit install

Add new dependencies:

uv add <package-name>

Project Structure

This is a uv workspace with three main components:

1. Extract Layer (`extract/`)

Contains extraction packages for pulling data from external sources.

extract/psdonline/: Extracts USDA PSD commodity data
- Entry point: extract_psd CLI command (defined in extract/psdonline/src/psdonline/execute.py)
- Checks latest available monthly snapshot (tries current month and 3 months back)
- Uses ETags to avoid re-downloading unchanged files
- Storage modes:
  - Local mode (no R2 credentials): Downloads to extract/psdonline/src/psdonline/data/{etag}.zip
  - R2 mode (R2 credentials present): Uploads to s3://bucket/psd/{etag}.zip
- Flat structure: files named by ETag for natural deduplication

Run extraction:

extract_psd  # Local mode (default)

# R2 mode (requires env vars: R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY)
export R2_ENDPOINT=...
export R2_BUCKET=...
export R2_ACCESS_KEY=...
export R2_SECRET_KEY=...
extract_psd

2. Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a layered data architecture.

Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/

Key commands:

cd transform/sqlmesh_materia

# Local development (creates virtual environment)
sqlmesh plan dev_<username>

# Production
sqlmesh plan prod

# Run tests
sqlmesh test

# Validate models
sqlmesh validate

# Run audits
sqlmesh audit

# Format SQL
sqlmesh format

# Start UI
sqlmesh ui

Configuration:

Config: transform/sqlmesh_materia/config.yaml
Single gateway: prod (connects to R2 Iceberg catalog)
Uses virtual environments for dev isolation (e.g., dev_deeman)
Production uses prod environment
Auto-apply enabled, no interactive prompts
DuckDB extensions: httpfs, iceberg

Environment Strategy:

All environments connect to the same R2 Iceberg catalog
Dev environments (e.g., dev_deeman) are isolated virtual environments
SQLMesh manages environment isolation and table versioning
No local DuckDB files needed

3. Core Package (`src/materia/`)

Currently minimal; main logic resides in workspace packages.

Data Architecture

SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:

Layer 1: Raw (`models/raw/`)

Purpose: Immutable archive of source data
Pattern: Directly reads from extraction outputs
Example: raw.psd_alldata reads zip files using DuckDB's read_csv('zip://...') function
Grain: Defines unique keys for each raw table

Layer 2: Staging (`models/staging/`)

Purpose: Apply schema, cast types, basic cleansing
Pattern: stg_[source]__[entity]
Example: stg_psdalldata__commodity.sql casts raw strings to proper types, joins lookup tables
Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources

Layer 3: Cleaned (`models/cleaned/`)

Purpose: Integration, business logic, unified models
Pattern: cln_[entity] or cln_[vault_component]_[entity]
Example: cln_psdalldata__commodity_pivoted.sql pivots commodity attributes into columns

Layer 4: Serving (`models/serving/`)

Purpose: Analytics-ready models (star schema, aggregates)
Patterns:
- dim_[entity] for dimensions
- fct_[process] for facts
- agg_[description] for aggregates
- obt_[description] for one-big-tables
Example: obt_commodity_metrics.sql provides wide table for analysis

Model Development

Incremental models:

Use INCREMENTAL_BY_TIME_RANGE kind
Define time_column (usually ingest_date)
Filter with WHERE time_column BETWEEN @start_ds AND @end_ds

Full refresh models:

Use FULL kind for small lookup tables and raw sources

Model properties:

grain: Define unique key columns for data quality
start: Historical backfill start date (project default: 2025-07-07)
cron: Schedule (project default: '@daily')

Linting and Formatting

Run linting:

ruff check .

Auto-fix issues:

ruff check --fix .

Format code:

ruff format .

Pre-commit hooks automatically run ruff on commits.

Testing

Run SQLMesh tests:

cd transform/sqlmesh_materia
sqlmesh test

Run Python tests (if configured):

pytest --cov=./ --cov-report=xml

CI/CD Pipeline and Production Architecture

CI/CD Pipeline (`.gitlab-ci.yml`)

3 Stages: Lint → Test → Deploy

1. Lint Stage

Runs ruff check on every commit
Validates code quality

2. Test Stage

test:cli: Runs pytest on materia CLI with 71% coverage
- Tests secrets management (Pulumi ESC integration)
- Tests worker lifecycle (create, list, destroy)
- Tests pipeline execution (extract, transform)
- Exports coverage reports to GitLab
test:sqlmesh: Runs SQLMesh model tests in transform layer

3. Deploy Stage (only on master branch)

deploy:infra: Runs pulumi up to ensure supervisor instance exists
- Runs on every master push
- Creates/updates Hetzner CPX11 supervisor instance (~€4.49/mo)
- Uses Pulumi ESC (beanflows/prod) for all secrets
deploy:supervisor: Checks supervisor status
- Verifies supervisor is bootstrapped
- Supervisor auto-updates via git pull every 15 minutes (no CI/CD deployment needed)

Note: No build artifacts! Supervisor pulls code directly from git and runs via uv.

Production Architecture: Git-Based Deployment with Ephemeral Workers

Design Philosophy:

No always-on workers (cost optimization)
Supervisor pulls latest code from git (no artifact builds)
Supervisor dynamically creates/destroys workers on-demand
Simple, inspectable, easy to test locally
Multi-cloud abstraction for pricing optimization

Components:

1. Supervisor Instance (Small Hetzner VM)

Runs supervisor.sh - continuous orchestration loop (inspired by TigerBeetle's CFO supervisor)
Hetzner CPX11: 2 vCPU (shared), 2GB RAM (~€4.49/mo)
Always-on, minimal resource usage
Git-based deployment: git pull every 15 minutes for auto-updates
Runs pipelines on schedule:
- Extract: Daily at 2 AM UTC
- Transform: Daily at 3 AM UTC
Uses systemd service for automatic restart on failure
Pulls secrets from Pulumi ESC

Bootstrap (one-time):

# Get supervisor IP from Pulumi
cd infra && pulumi stack output supervisor_ip -s prod

# Run bootstrap script
export PULUMI_ACCESS_TOKEN=<your-token>
ssh root@<supervisor-ip> 'bash -s' < infra/bootstrap_supervisor.sh

2. Ephemeral Workers (On-Demand)

Created for each pipeline execution by materia CLI
Receives secrets via SSH environment variable injection
Destroyed immediately after job completion
Different instance types per pipeline:
- Extract: ccx12 (2 vCPU, 8GB RAM)
- Transform: ccx22 (4 vCPU, 16GB RAM)

3. Secrets Flow

Pulumi ESC (beanflows/prod)
  ↓
Supervisor Instance (via esc CLI)
  ↓
Workers (injected as env vars via SSH)

4. Code Deployment Flow

GitLab (master branch)
  ↓
Supervisor: git pull origin master (every 15 min)
  ↓
Supervisor: uv sync (update dependencies)
  ↓
Supervisor: uv run materia pipeline run <pipeline>

5. Data Storage

All environments: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- ACID transactions on object storage
- No persistent database on workers
- Virtual environments for dev isolation (e.g., dev_deeman)

Execution Flow:

Supervisor loop wakes up every 15 minutes
Runs git fetch and checks if new commits on master
If updates available: git pull && uv sync
Checks if current time matches pipeline schedule (e.g., 2 AM for extract)
If scheduled: uv run materia pipeline run extract
CLI creates Hetzner worker with SSH key
CLI injects secrets via SSH and executes pipeline
Pipeline executes, writes to R2 Iceberg catalog
Worker destroyed (entire lifecycle ~5-10 minutes)
Supervisor logs results and continues loop

Multi-Cloud Provider Abstraction:

Protocol-based interface (data-oriented design, no OOP)
Providers: Hetzner (implemented), OVH, Scaleway, Oracle (stubs)
Allows switching providers for cost optimization
Each provider implements: create_instance, destroy_instance, list_instances, wait_for_ssh

Key Design Patterns

Raw data ingestion:

DuckDB reads directly from zip archives using read_csv('zip://...')
filename=true captures source file path for metadata
union_by_name=true handles schema evolution

Deduplication:

Use hash() function to create unique keys
Use any_value() with GROUP BY hkey to deduplicate
Preserve all metadata in hash key for change detection

Date handling:

Extract ingest dates from file paths: make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)
Calculate market dates: last_day(make_date(market_year, month, 1))

SQLMesh best practices:

Always define grain for data quality validation
Use meaningful model names following layer conventions
Leverage SQLMesh's built-in time macros (@start_ds, @end_ds)
Keep raw layer thin, push transformations to staging+

Data Storage

All data is stored in Cloudflare R2 Data Catalog (Apache Iceberg) via REST API:

Production environment: prod
Dev environments: dev_<username> (virtual environments)
SQLMesh manages environment isolation and table versioning
No local database files needed
We use a monorepo with uv workspaces
The pulumi env is called beanflows/prod
NEVER hardcode secrets in plaintext
Never add ssh keys to the git repo!
If there is a simpler more direct solution and there is no other tradeoff, always choose the simpler solution

11 KiB Raw Blame History