9.3 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.
Tech Stack:
- Python 3.13 with
uvpackage manager - SQLMesh for SQL transformation and orchestration
- DuckDB as the analytical database
- Workspace structure with separate extract and transform packages
Environment Setup
Install dependencies:
uv sync
Setup pre-commit hooks:
pre-commit install
Add new dependencies:
uv add <package-name>
Project Structure
This is a uv workspace with three main components:
1. Extract Layer (extract/)
Contains extraction packages for pulling data from external sources.
extract/psdonline/: Extracts USDA PSD commodity data from archives dating back to 2006- Entry point:
extract_psdCLI command (defined inextract/psdonline/src/psdonline/execute.py) - Downloads monthly zip archives to
extract/psdonline/src/psdonline/data/ - Uses ETags to avoid re-downloading unchanged files
- Entry point:
Run extraction:
extract_psd
2. Transform Layer (transform/sqlmesh_materia/)
SQLMesh project implementing a layered data architecture.
Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/
Key commands:
cd transform/sqlmesh_materia
# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan
# Run tests
sqlmesh test
# Validate models
sqlmesh validate
# Run audits
sqlmesh audit
# Format SQL
sqlmesh format
# Start UI
sqlmesh ui
Configuration:
- Config:
transform/sqlmesh_materia/config.yaml - Default gateway:
dev(usesmateria_dev.db) - Production gateway:
prod(usesmateria_prod.db) - Auto-apply enabled, no interactive prompts
- DuckDB extensions: zipfs, httpfs, iceberg
3. Core Package (src/materia/)
Currently minimal; main logic resides in workspace packages.
Data Architecture
SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:
Layer 1: Raw (models/raw/)
- Purpose: Immutable archive of source data
- Pattern: Directly reads from extraction outputs
- Example:
raw.psd_alldatareads zip files using DuckDB'sread_csv('zip://...')function - Grain: Defines unique keys for each raw table
Layer 2: Staging (models/staging/)
- Purpose: Apply schema, cast types, basic cleansing
- Pattern:
stg_[source]__[entity] - Example:
stg_psdalldata__commodity.sqlcasts raw strings to proper types, joins lookup tables - Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources
Layer 3: Cleaned (models/cleaned/)
- Purpose: Integration, business logic, unified models
- Pattern:
cln_[entity]orcln_[vault_component]_[entity] - Example:
cln_psdalldata__commodity_pivoted.sqlpivots commodity attributes into columns
Layer 4: Serving (models/serving/)
- Purpose: Analytics-ready models (star schema, aggregates)
- Patterns:
dim_[entity]for dimensionsfct_[process]for factsagg_[description]for aggregatesobt_[description]for one-big-tables
- Example:
obt_commodity_metrics.sqlprovides wide table for analysis
Model Development
Incremental models:
- Use
INCREMENTAL_BY_TIME_RANGEkind - Define
time_column(usuallyingest_date) - Filter with
WHERE time_column BETWEEN @start_ds AND @end_ds
Full refresh models:
- Use
FULLkind for small lookup tables and raw sources
Model properties:
grain: Define unique key columns for data qualitystart: Historical backfill start date (project default: 2025-07-07)cron: Schedule (project default: '@daily')
Linting and Formatting
Run linting:
ruff check .
Auto-fix issues:
ruff check --fix .
Format code:
ruff format .
Pre-commit hooks automatically run ruff on commits.
Testing
Run SQLMesh tests:
cd transform/sqlmesh_materia
sqlmesh test
Run Python tests (if configured):
pytest --cov=./ --cov-report=xml
CI/CD Pipeline and Production Architecture
CI/CD Pipeline (.gitlab-ci.yml)
4 Stages: Lint → Test → Build → Deploy
1. Lint Stage
- Runs
ruff checkandruff format --check - Validates code quality on every commit
2. Test Stage
test:cli: Runs pytest on materia CLI with 71% coverage- Tests secrets management (Pulumi ESC integration)
- Tests worker lifecycle (create, list, destroy)
- Tests pipeline execution (extract, transform)
- Exports coverage reports to GitLab
test:sqlmesh: Runs SQLMesh model tests in transform layer
3. Build Stage (only on master branch)
Creates separate artifacts for each workspace package:
build:extract: Buildsmateria-extract-latest.tar.gz(psdonline package)build:transform: Buildsmateria-transform-latest.tar.gz(sqlmesh_materia package)build:cli: Buildsmateria-cli-latest.tar.gz(materia management CLI)
Each artifact is a self-contained tarball with all dependencies.
4. Deploy Stage (only on master branch)
deploy:r2: Uploads artifacts to Cloudflare R2 using rclone- Loads secrets from Pulumi ESC (
beanflows/prod) - Only requires
PULUMI_ACCESS_TOKENin GitLab variables - All other secrets (R2 credentials, SSH keys, API tokens) come from ESC
- Loads secrets from Pulumi ESC (
deploy:infra: Runspulumi upto deploy infrastructure changes- Only triggers when
infra/**/*files change
- Only triggers when
Production Architecture: Ephemeral Worker Model
Design Philosophy:
- No always-on workers (cost optimization)
- Supervisor instance dynamically creates/destroys workers on-demand
- Language-agnostic artifacts enable future migration to C/Rust/Go
- Multi-cloud abstraction for pricing optimization
Components:
1. Supervisor Instance (Small Hetzner VM)
- Runs the
materiamanagement CLI - Small, always-on instance (cheap)
- Pulls secrets from Pulumi ESC
- Orchestrates worker lifecycle via cloud provider APIs
2. Ephemeral Workers (On-Demand)
- Created for each pipeline execution
- Downloads pre-built artifacts from R2 (no git, no uv on worker)
- Receives secrets via SSH environment variable injection
- Destroyed immediately after job completion
- Different instance types per pipeline:
- Extract:
ccx12(2 vCPU, 8GB RAM) - Transform:
ccx22(4 vCPU, 16GB RAM)
- Extract:
3. Secrets Flow
Pulumi ESC (beanflows/prod)
↓
Supervisor Instance (materia CLI)
↓
Workers (injected as env vars via SSH)
4. Artifact Flow
GitLab CI: uv build → tar.gz
↓
Cloudflare R2 (artifact storage)
↓
Worker: curl → extract → execute
5. Data Storage
- Dev: Local DuckDB file (
materia_dev.db) - Prod: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- ACID transactions on object storage
- No persistent database on workers
Execution Flow:
- Supervisor receives schedule trigger (cron/manual)
- CLI runs:
materia pipeline run extract - Creates Hetzner worker with SSH key
- Worker downloads
materia-extract-latest.tar.gzfrom R2 - CLI injects secrets via SSH:
export R2_ACCESS_KEY_ID=... && ./extract_psd - Pipeline executes, writes to R2 Iceberg catalog
- Worker destroyed (entire lifecycle ~5-10 minutes)
Multi-Cloud Provider Abstraction:
- Protocol-based interface (data-oriented design, no OOP)
- Providers: Hetzner (implemented), OVH, Scaleway, Oracle (stubs)
- Allows switching providers for cost optimization
- Each provider implements:
create_instance,destroy_instance,list_instances,wait_for_ssh
Key Design Patterns
Raw data ingestion:
- DuckDB reads directly from zip archives using
read_csv('zip://...') filename=truecaptures source file path for metadataunion_by_name=truehandles schema evolution
Deduplication:
- Use
hash()function to create unique keys - Use
any_value()withGROUP BY hkeyto deduplicate - Preserve all metadata in hash key for change detection
Date handling:
- Extract ingest dates from file paths:
make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1) - Calculate market dates:
last_day(make_date(market_year, month, 1))
SQLMesh best practices:
- Always define
grainfor data quality validation - Use meaningful model names following layer conventions
- Leverage SQLMesh's built-in time macros (
@start_ds,@end_ds) - Keep raw layer thin, push transformations to staging+
Database Location
- Dev database:
materia_dev.db(13GB, in project root) - Prod database:
materia_prod.db(not yet created)
Note: The dev database is large and should not be committed to git (.gitignore already configured).
- We use a monorepo with uv workspaces
- The pulumi env is called beanflows/prod