# Materia A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis. ## Tech Stack - **Python 3.13** with `uv` package manager - **SQLMesh** for SQL transformation and orchestration - **DuckDB** as the analytical database - **Cloudflare R2** (Iceberg) for data storage - **Pulumi ESC** for secrets management - **Hetzner Cloud** for infrastructure ## Quick Start ### 1. Install UV UV is our Python package manager for faster, more reliable dependency management. ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` 📚 [UV Documentation](https://docs.astral.sh/uv/) ### 2. Install Dependencies ```bash uv sync ``` This installs Python and all dependencies declared in `pyproject.toml`. ### 3. Setup Pre-commit Hooks ```bash pre-commit install ``` This enables automatic linting with `ruff` on every commit. ### 4. Install Pulumi ESC (for running with secrets) ```bash # Install ESC CLI curl -fsSL https://get.pulumi.com/esc/install.sh | sh # Login esc login ``` ## Project Structure This is a `uv` workspace with three main packages: ### Extract Layer (`extract/`) **psdonline** - Extracts USDA PSD commodity data ```bash # Local development (downloads to local directory) uv run extract_psd # Production (uploads to R2) esc run beanflows/prod -- uv run extract_psd ``` ### Transform Layer (`transform/sqlmesh_materia/`) SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving). **All commands run from project root with `-p transform/sqlmesh_materia`:** ```bash # Local development esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_ # Production esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod # Run tests (no secrets needed) uv run sqlmesh -p transform/sqlmesh_materia test # Format SQL uv run sqlmesh -p transform/sqlmesh_materia format ``` ### Core Package (`src/materia/`) CLI for managing infrastructure and pipelines (currently minimal). ## Development Workflow ### Adding Dependencies For workspace root: ```bash uv add ``` For specific package: ```bash uv add --package psdonline ``` ### Linting and Formatting ```bash # Check for issues ruff check . # Auto-fix issues ruff check --fix . # Format code ruff format . ``` ### Running Tests ```bash # Python tests uv run pytest tests/ -v --cov=src/materia # SQLMesh tests uv run sqlmesh -p transform/sqlmesh_materia test ``` ## Secrets Management All secrets are managed via **Pulumi ESC** environment `beanflows/prod`. ### Load secrets into shell: ```bash eval $(esc env open beanflows/prod --format shell) ``` ### Run commands with secrets: ```bash # Single command esc run beanflows/prod -- uv run extract_psd # Multiple commands esc run beanflows/prod -- bash -c " uv run extract_psd uv run sqlmesh -p transform/sqlmesh_materia plan prod " ``` ## Production Architecture ### Git-Based Deployment - **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes - **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion - **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API) ### CI/CD Pipeline **GitLab CI** runs on every push to master: 1. **Lint** - `ruff check` 2. **Test** - pytest + SQLMesh tests 3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed No build artifacts - supervisor pulls code directly from git! ## Architecture Principles - **Simplicity First** - Avoid unnecessary abstractions - **Data-Oriented Design** - Identify data by content, not metadata - **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure - **Inspectable** - Easy to understand, test locally, and debug ## Resources - **Architecture Plans**: See `.claude/plans/` for design decisions - **UV Docs**: https://docs.astral.sh/uv/ - **SQLMesh Docs**: https://sqlmesh.readthedocs.io/