diff --git a/README.md b/README.md index e227e0f..cb83bf4 100644 --- a/README.md +++ b/README.md @@ -1,39 +1,181 @@ -# Materia Environment Setup +# Materia -We use `uv` as our Python package manager for faster, more reliable dependency management. -https://docs.astral.sh/uv/ +A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis. -We recommend using vscode as your IDE. -https://code.visualstudio.com/ +## Tech Stack + +- **Python 3.13** with `uv` package manager +- **SQLMesh** for SQL transformation and orchestration +- **DuckDB** as the analytical database +- **Cloudflare R2** (Iceberg) for data storage +- **Pulumi ESC** for secrets management +- **Hetzner Cloud** for infrastructure + +## Quick Start ### 1. Install UV +UV is our Python package manager for faster, more reliable dependency management. + ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` -### 2. Setup the env -Simply run: +📚 [UV Documentation](https://docs.astral.sh/uv/) + +### 2. Install Dependencies ```bash uv sync ``` -This will install python & the dependencies declared so far -### 3. Setup pre-commit +This installs Python and all dependencies declared in `pyproject.toml`. + +### 3. Setup Pre-commit Hooks + ```bash pre-commit install ``` -### 4. Adding a dependency +This enables automatic linting with `ruff` on every commit. + +### 4. Install Pulumi ESC (for running with secrets) ```bash -uv add requests +# Install ESC CLI +curl -fsSL https://get.pulumi.com/esc/install.sh | sh + +# Login +esc login ``` -# Managing a project with uv +## Project Structure -https://docs.astral.sh/uv/guides/projects/#managing-dependencies +This is a `uv` workspace with three main packages: +### Extract Layer (`extract/`) -test +**psdonline** - Extracts USDA PSD commodity data + +```bash +# Local development (downloads to local directory) +uv run extract_psd + +# Production (uploads to R2) +esc run beanflows/prod -- uv run extract_psd +``` + +### Transform Layer (`transform/sqlmesh_materia/`) + +SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving). + +**All commands run from project root with `-p transform/sqlmesh_materia`:** + +```bash +# Local development +esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_ + +# Production +esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod + +# Run tests (no secrets needed) +uv run sqlmesh -p transform/sqlmesh_materia test + +# Format SQL +uv run sqlmesh -p transform/sqlmesh_materia format +``` + +### Core Package (`src/materia/`) + +CLI for managing infrastructure and pipelines (currently minimal). + +## Development Workflow + +### Adding Dependencies + +For workspace root: +```bash +uv add +``` + +For specific package: +```bash +uv add --package psdonline +``` + +### Linting and Formatting + +```bash +# Check for issues +ruff check . + +# Auto-fix issues +ruff check --fix . + +# Format code +ruff format . +``` + +### Running Tests + +```bash +# Python tests +uv run pytest tests/ -v --cov=src/materia + +# SQLMesh tests +uv run sqlmesh -p transform/sqlmesh_materia test +``` + +## Secrets Management + +All secrets are managed via **Pulumi ESC** environment `beanflows/prod`. + +### Load secrets into shell: + +```bash +eval $(esc env open beanflows/prod --format shell) +``` + +### Run commands with secrets: + +```bash +# Single command +esc run beanflows/prod -- uv run extract_psd + +# Multiple commands +esc run beanflows/prod -- bash -c " + uv run extract_psd + uv run sqlmesh -p transform/sqlmesh_materia plan prod +" +``` + +## Production Architecture + +### Git-Based Deployment + +- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes +- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion +- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API) + +### CI/CD Pipeline + +**GitLab CI** runs on every push to master: + +1. **Lint** - `ruff check` +2. **Test** - pytest + SQLMesh tests +3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed + +No build artifacts - supervisor pulls code directly from git! + +## Architecture Principles + +- **Simplicity First** - Avoid unnecessary abstractions +- **Data-Oriented Design** - Identify data by content, not metadata +- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure +- **Inspectable** - Easy to understand, test locally, and debug + +## Resources + +- **Architecture Plans**: See `.claude/plans/` for design decisions +- **UV Docs**: https://docs.astral.sh/uv/ +- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/