Corrected SQLMesh commands to show proper usage: - Run from project root (not from transform/sqlmesh_materia/) - Use -p flag to specify project directory - Use uv run for all commands - Use esc run for commands requiring secrets (plan, audit, ui) - Clarified which commands need secrets vs local-only This aligns with the actual working pattern and Pulumi ESC integration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.
Tech Stack:
- Python 3.13 with
uvpackage manager - SQLMesh for SQL transformation and orchestration
- DuckDB as the analytical database
- Workspace structure with separate extract and transform packages
Environment Setup
Install dependencies:
uv sync
Setup pre-commit hooks:
pre-commit install
Add new dependencies:
uv add <package-name>
Secrets Management with Pulumi ESC
All secrets are managed via Pulumi ESC (Environment, Secrets, and Configuration). The production environment is beanflows/prod.
Load secrets into your shell:
# Login to Pulumi ESC (one-time)
esc login
# Load secrets as environment variables
eval $(esc env open beanflows/prod --format shell)
# Now all secrets are available as env vars
echo $R2_ENDPOINT # Example: access R2 endpoint
Run commands with ESC secrets:
# Run a command with secrets loaded
esc run beanflows/prod -- uv run extract_psd
# Run multiple commands
esc run beanflows/prod -- bash -c "
uv run extract_psd
cd transform/sqlmesh_materia && uv run sqlmesh plan prod
"
Available secrets in beanflows/prod:
- R2 storage:
R2_ENDPOINT,R2_BUCKET,R2_ACCESS_KEY,R2_SECRET_KEY - Hetzner Cloud:
HETZNER_TOKEN, SSH keys - GitLab:
GITLAB_READ_TOKEN - Iceberg catalog credentials
Note: Never hardcode secrets! Always use Pulumi ESC or environment variables.
Project Structure
This is a uv workspace with three main components:
1. Extract Layer (extract/)
Contains extraction packages for pulling data from external sources.
extract/psdonline/: Extracts USDA PSD commodity data- Entry point:
extract_psdCLI command (defined inextract/psdonline/src/psdonline/execute.py) - Checks latest available monthly snapshot (tries current month and 3 months back)
- Uses ETags to avoid re-downloading unchanged files
- Storage modes:
- Local mode (no R2 credentials): Downloads to
extract/psdonline/src/psdonline/data/{etag}.zip - R2 mode (R2 credentials present): Uploads to
s3://bucket/psd/{etag}.zip
- Local mode (no R2 credentials): Downloads to
- Flat structure: files named by ETag for natural deduplication
- Entry point:
Run extraction:
extract_psd # Local mode (default)
# R2 mode (requires env vars: R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY)
export R2_ENDPOINT=...
export R2_BUCKET=...
export R2_ACCESS_KEY=...
export R2_SECRET_KEY=...
extract_psd
2. Transform Layer (transform/sqlmesh_materia/)
SQLMesh project implementing a layered data architecture.
Working directory: All SQLMesh commands must be run from the project root with -p transform/sqlmesh_materia
Key commands:
# Load secrets and run SQLMesh commands
# Always run from project root with -p flag and uv
# Local development (creates virtual environment)
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
# Run tests
uv run sqlmesh -p transform/sqlmesh_materia test
# Validate models
uv run sqlmesh -p transform/sqlmesh_materia validate
# Run audits (requires secrets)
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia audit
# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format
# Start UI (requires secrets)
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia ui
Configuration:
- Config:
transform/sqlmesh_materia/config.yaml - Single gateway:
prod(connects to R2 Iceberg catalog) - Uses virtual environments for dev isolation (e.g.,
dev_deeman) - Production uses
prodenvironment - Auto-apply enabled, no interactive prompts
- DuckDB extensions: httpfs, iceberg
Environment Strategy:
- All environments connect to the same R2 Iceberg catalog
- Dev environments (e.g.,
dev_deeman) are isolated virtual environments - SQLMesh manages environment isolation and table versioning
- No local DuckDB files needed
3. Core Package (src/materia/)
Currently minimal; main logic resides in workspace packages.
Data Architecture
SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:
Layer 1: Raw (models/raw/)
- Purpose: Immutable archive of source data
- Pattern: Directly reads from extraction outputs
- Example:
raw.psd_alldatareads zip files using DuckDB'sread_csv('zip://...')function - Grain: Defines unique keys for each raw table
Layer 2: Staging (models/staging/)
- Purpose: Apply schema, cast types, basic cleansing
- Pattern:
stg_[source]__[entity] - Example:
stg_psdalldata__commodity.sqlcasts raw strings to proper types, joins lookup tables - Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources
Layer 3: Cleaned (models/cleaned/)
- Purpose: Integration, business logic, unified models
- Pattern:
cln_[entity]orcln_[vault_component]_[entity] - Example:
cln_psdalldata__commodity_pivoted.sqlpivots commodity attributes into columns
Layer 4: Serving (models/serving/)
- Purpose: Analytics-ready models (star schema, aggregates)
- Patterns:
dim_[entity]for dimensionsfct_[process]for factsagg_[description]for aggregatesobt_[description]for one-big-tables
- Example:
obt_commodity_metrics.sqlprovides wide table for analysis
Model Development
Incremental models:
- Use
INCREMENTAL_BY_TIME_RANGEkind - Define
time_column(usuallyingest_date) - Filter with
WHERE time_column BETWEEN @start_ds AND @end_ds
Full refresh models:
- Use
FULLkind for small lookup tables and raw sources
Model properties:
grain: Define unique key columns for data qualitystart: Historical backfill start date (project default: 2025-07-07)cron: Schedule (project default: '@daily')
Linting and Formatting
Run linting:
ruff check .
Auto-fix issues:
ruff check --fix .
Format code:
ruff format .
Pre-commit hooks automatically run ruff on commits.
Testing
Run SQLMesh tests:
cd transform/sqlmesh_materia
sqlmesh test
Run Python tests (if configured):
pytest --cov=./ --cov-report=xml
CI/CD Pipeline and Production Architecture
CI/CD Pipeline (.gitlab-ci.yml)
3 Stages: Lint → Test → Deploy
1. Lint Stage
- Runs
ruff checkon every commit - Validates code quality
2. Test Stage
test:cli: Runs pytest on materia CLI with 71% coverage- Tests secrets management (Pulumi ESC integration)
- Tests worker lifecycle (create, list, destroy)
- Tests pipeline execution (extract, transform)
- Exports coverage reports to GitLab
test:sqlmesh: Runs SQLMesh model tests in transform layer
3. Deploy Stage (only on master branch)
deploy:infra: Runspulumi upto ensure supervisor instance exists- Runs on every master push
- Creates/updates Hetzner CPX11 supervisor instance (~€4.49/mo)
- Uses Pulumi ESC (
beanflows/prod) for all secrets
deploy:supervisor: Bootstraps and monitors supervisor- Checks if supervisor is already bootstrapped (
test -d /opt/materia/.git) - If not bootstrapped: Runs
infra/bootstrap_supervisor.shautomatically - If already bootstrapped: Verifies service status
- After bootstrap: Supervisor auto-updates via
git pullevery 15 minutes
- Checks if supervisor is already bootstrapped (
Note: No build artifacts! Supervisor pulls code directly from git and runs via uv.
Production Architecture: Git-Based Deployment with Ephemeral Workers
Design Philosophy:
- No always-on workers (cost optimization)
- Supervisor pulls latest code from git (no artifact builds)
- Supervisor dynamically creates/destroys workers on-demand
- Simple, inspectable, easy to test locally
- Multi-cloud abstraction for pricing optimization
Components:
1. Supervisor Instance (Small Hetzner VM)
- Runs
supervisor.sh- continuous orchestration loop (inspired by TigerBeetle's CFO supervisor) - Hetzner CPX11: 2 vCPU (shared), 2GB RAM (~€4.49/mo)
- Always-on, minimal resource usage
- Git-based deployment:
git pullevery 15 minutes for auto-updates - Runs pipelines on schedule:
- Extract: Daily at 2 AM UTC
- Transform: Daily at 3 AM UTC
- Uses systemd service for automatic restart on failure
- Pulls secrets from Pulumi ESC
Bootstrap:
Bootstrapping happens automatically in CI/CD (deploy:supervisor stage). The pipeline:
- Checks if supervisor is already bootstrapped
- If not: Runs
infra/bootstrap_supervisor.shwith secrets injected - If yes: Verifies systemd service status
Manual bootstrap (if needed):
cd infra && pulumi stack output supervisor_ip -s prod
export PULUMI_ACCESS_TOKEN=<your-token>
ssh root@<supervisor-ip> 'bash -s' < infra/bootstrap_supervisor.sh
2. Ephemeral Workers (On-Demand)
- Created for each pipeline execution by materia CLI
- Receives secrets via SSH environment variable injection
- Destroyed immediately after job completion
- Different instance types per pipeline:
- Extract:
ccx12(2 vCPU, 8GB RAM) - Transform:
ccx22(4 vCPU, 16GB RAM)
- Extract:
3. Secrets Flow
Pulumi ESC (beanflows/prod)
↓
Supervisor Instance (via esc CLI)
↓
Workers (injected as env vars via SSH)
4. Code Deployment Flow
GitLab (master branch)
↓
Supervisor: git pull origin master (every 15 min)
↓
Supervisor: uv sync (update dependencies)
↓
Supervisor: uv run materia pipeline run <pipeline>
5. Data Storage
- All environments: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- ACID transactions on object storage
- No persistent database on workers
- Virtual environments for dev isolation (e.g.,
dev_deeman)
Execution Flow:
- Supervisor loop wakes up every 15 minutes
- Runs
git fetchand checks if new commits on master - If updates available:
git pull && uv sync - Checks if current time matches pipeline schedule (e.g., 2 AM for extract)
- If scheduled:
uv run materia pipeline run extract - CLI creates Hetzner worker with SSH key
- CLI injects secrets via SSH and executes pipeline
- Pipeline executes, writes to R2 Iceberg catalog
- Worker destroyed (entire lifecycle ~5-10 minutes)
- Supervisor logs results and continues loop
Multi-Cloud Provider Abstraction:
- Protocol-based interface (data-oriented design, no OOP)
- Providers: Hetzner (implemented), OVH, Scaleway, Oracle (stubs)
- Allows switching providers for cost optimization
- Each provider implements:
create_instance,destroy_instance,list_instances,wait_for_ssh
Key Design Patterns
Raw data ingestion:
- DuckDB reads directly from zip archives using
read_csv('zip://...') filename=truecaptures source file path for metadataunion_by_name=truehandles schema evolution
Deduplication:
- Use
hash()function to create unique keys - Use
any_value()withGROUP BY hkeyto deduplicate - Preserve all metadata in hash key for change detection
Date handling:
- Extract ingest dates from file paths:
make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1) - Calculate market dates:
last_day(make_date(market_year, month, 1))
SQLMesh best practices:
- Always define
grainfor data quality validation - Use meaningful model names following layer conventions
- Leverage SQLMesh's built-in time macros (
@start_ds,@end_ds) - Keep raw layer thin, push transformations to staging+
Data Storage
All data is stored in Cloudflare R2 Data Catalog (Apache Iceberg) via REST API:
- Production environment:
prod - Dev environments:
dev_<username>(virtual environments) - SQLMesh manages environment isolation and table versioning
- No local database files needed
- We use a monorepo with uv workspaces
- The pulumi env is called beanflows/prod
- NEVER hardcode secrets in plaintext
- Never add ssh keys to the git repo!
- If there is a simpler more direct solution and there is no other tradeoff, always choose the simpler solution