# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution. ## Commands ```bash # Install dependencies uv sync # Lint & format ruff check . # Check ruff check --fix . # Auto-fix ruff format . # Format # Tests uv run pytest tests/ -v --cov=src/materia # CLI/Python tests cd transform/sqlmesh_materia && uv run sqlmesh test # SQLMesh model tests # Run a single test uv run pytest tests/test_cli.py::test_name -v # Extract data LANDING_DIR=data/landing uv run extract_psd # SQLMesh (from repo root) uv run sqlmesh -p transform/sqlmesh_materia plan # Plans to dev_ by default uv run sqlmesh -p transform/sqlmesh_materia plan prod # Production uv run sqlmesh -p transform/sqlmesh_materia test # Run model tests uv run sqlmesh -p transform/sqlmesh_materia format # Format SQL # CLI uv run materia pipeline run extract|transform uv run materia pipeline list uv run materia worker create|destroy|list uv run materia secrets get ``` ## Architecture **Workspace packages** (`pyproject.toml` → `tool.uv.workspace`): - `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory - `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (local DuckDB) - `src/materia/` — CLI (Typer) for pipeline execution, worker management, secrets - `web/` — Future web frontend **Data flow:** ``` USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip → rclone cron syncs landing/ to R2 → SQLMesh raw → staging → cleaned → serving → /data/materia/lakehouse.duckdb → Web app reads lakehouse.duckdb (read-only) ``` **SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`): 1. `raw/` — Immutable source reads (read_csv from landing directory) 2. `staging/` — Type casting, lookup joins, basic cleansing 3. `cleaned/` — Business logic, pivoting, integration 4. `serving/` — Analytics-ready facts, dimensions, aggregates **CLI modules** (`src/materia/`): - `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version - `workers.py` — Hetzner cloud instance management (for ad-hoc compute) - `pipelines.py` — Local subprocess pipeline execution with bounded timeouts - `secrets.py` — Pulumi ESC integration for environment secrets **Infrastructure** (`infra/`): - Pulumi IaC for Cloudflare R2 buckets and Hetzner compute - Supervisor systemd service for always-on orchestration (pulls git, runs pipelines) - rclone systemd timer for landing data backup to R2 ## Coding Philosophy Read `coding_philosophy.md` for the full guide. Key points: - **Simple, procedural code** — Functions over classes, no inheritance hierarchies, no "Manager" patterns - **Data-oriented** — Use dicts/lists/tuples, not objects hiding data behind getters - **Keep logic in SQL** — Let DuckDB do the heavy lifting, don't pull data into Python to transform it - **Build minimum that works** — No premature abstraction, three examples before generalizing - **Explicit over implicit** — No framework magic, no metaprogramming, no hidden behavior - **Question every dependency** — Can you write it simply yourself? Are you using 5% of a large framework? ## Key Configuration - **Python 3.13** (`.python-version`) - **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length) - **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}` - **Storage**: Local NVMe (`LANDING_DIR`, `DUCKDB_PATH`), R2 for backup via rclone - **Secrets**: Pulumi ESC (`esc run beanflows/prod -- `) - **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR - **Pre-commit hooks**: installed via `pre-commit install` ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `LANDING_DIR` | `data/landing` | Root directory for extracted landing data | | `DUCKDB_PATH` | `local.duckdb` | Path to the DuckDB lakehouse database |