beanflows/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for orchestrating cloud workers and pipelines.

## Commands

```bash
# Install dependencies
uv sync

# Lint & format
ruff check .            # Check
ruff check --fix .      # Auto-fix
ruff format .           # Format

# Tests
uv run pytest tests/ -v --cov=src/materia         # CLI/Python tests
cd transform/sqlmesh_materia && uv run sqlmesh test  # SQLMesh model tests

# Run a single test
uv run pytest tests/test_cli.py::test_name -v

# Extract data
uv run extract_psd

# SQLMesh (from repo root)
uv run sqlmesh -p transform/sqlmesh_materia plan              # Plans to dev_<username> by default
uv run sqlmesh -p transform/sqlmesh_materia plan prod          # Production
uv run sqlmesh -p transform/sqlmesh_materia test               # Run model tests
uv run sqlmesh -p transform/sqlmesh_materia format             # Format SQL

# With production secrets
esc run beanflows/prod -- <command>

# CLI
uv run materia worker create|destroy|list
uv run materia pipeline run
uv run materia secrets get
```

## Architecture

**Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, uploads to R2
- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (DuckDB + Iceberg)
- `src/materia/` — CLI (Typer) for worker management, pipeline orchestration, secrets
- `web/` — Future web frontend

**Data flow:**
```
USDA API → extract (psdonline) → R2/local CSV → SQLMesh transforms → DuckDB/Iceberg
```

**SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`):
1. `raw/` — Immutable source reads (read_csv from extracted files)
2. `staging/` — Type casting, lookup joins, basic cleansing
3. `cleaned/` — Business logic, pivoting, integration
4. `serving/` — Analytics-ready facts, dimensions, aggregates

**CLI modules** (`src/materia/`):
- `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
- `workers.py` — Ephemeral cloud instance management (Hetzner, with planned OVH/Scaleway/Oracle)
- `pipelines.py` — SSH-based pipeline execution on workers (download artifact, run, destroy)
- `secrets.py` — Pulumi ESC integration for environment secrets

**Infrastructure** (`infra/`):
- Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
- Supervisor systemd service for always-on orchestration (pulls git every 15 min)

## Coding Philosophy

Read `coding_philosophy.md` for the full guide. Key points:

- **Simple, procedural code** — Functions over classes, no inheritance hierarchies, no "Manager" patterns
- **Data-oriented** — Use dicts/lists/tuples, not objects hiding data behind getters
- **Keep logic in SQL** — Let DuckDB do the heavy lifting, don't pull data into Python to transform it
- **Build minimum that works** — No premature abstraction, three examples before generalizing
- **Explicit over implicit** — No framework magic, no metaprogramming, no hidden behavior
- **Question every dependency** — Can you write it simply yourself? Are you using 5% of a large framework?

## Key Configuration

- **Python 3.13** (`.python-version`)
- **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
- **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
- **Storage**: Cloudflare R2 with Iceberg catalog (zero egress cost)
- **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
- **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
- **Pre-commit hooks**: installed via `pre-commit install`