# Materia

A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.

## Tech Stack

- **Python 3.13** with `uv` package manager
- **SQLMesh** for SQL transformation and orchestration
- **DuckDB** as the analytical database
- **Cloudflare R2** (Iceberg) for data storage
- **Pulumi ESC** for secrets management
- **Hetzner Cloud** for infrastructure

## Quick Start

### 1. Install UV

UV is our Python package manager for faster, more reliable dependency management.

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

📚 [UV Documentation](https://docs.astral.sh/uv/)

### 2. Install Dependencies

```bash
uv sync
```

This installs Python and all dependencies declared in `pyproject.toml`.

### 3. Setup Pre-commit Hooks

```bash
pre-commit install
```

This enables automatic linting with `ruff` on every commit.

### 4. Install Pulumi ESC (for running with secrets)

```bash
# Install ESC CLI
curl -fsSL https://get.pulumi.com/esc/install.sh | sh

# Login
esc login
```

## Project Structure

This is a `uv` workspace with three main packages:

### Extract Layer (`extract/`)

**psdonline** - Extracts USDA PSD commodity data

```bash
# Local development (downloads to local directory)
uv run extract_psd

# Production (uploads to R2)
esc run beanflows/prod -- uv run extract_psd
```

### Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).

**All commands run from project root with `-p transform/sqlmesh_materia`:**

```bash
# Local development
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>

# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run tests (no secrets needed)
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format
```

### Core Package (`src/materia/`)

CLI for managing infrastructure and pipelines (currently minimal).

## Development Workflow

### Adding Dependencies

For workspace root:
```bash
uv add <package-name>
```

For specific package:
```bash
uv add --package psdonline <package-name>
```

### Linting and Formatting

```bash
# Check for issues
ruff check .

# Auto-fix issues
ruff check --fix .

# Format code
ruff format .
```

### Running Tests

```bash
# Python tests
uv run pytest tests/ -v --cov=src/materia

# SQLMesh tests
uv run sqlmesh -p transform/sqlmesh_materia test
```

## Secrets Management

All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.

### Load secrets into shell:

```bash
eval $(esc env open beanflows/prod --format shell)
```

### Run commands with secrets:

```bash
# Single command
esc run beanflows/prod -- uv run extract_psd

# Multiple commands
esc run beanflows/prod -- bash -c "
  uv run extract_psd
  uv run sqlmesh -p transform/sqlmesh_materia plan prod
"
```

## Production Architecture

### Git-Based Deployment

- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)

### CI/CD Pipeline

**GitLab CI** runs on every push to master:

1. **Lint** - `ruff check`
2. **Test** - pytest + SQLMesh tests
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed

No build artifacts - supervisor pulls code directly from git!

## Architecture Principles

- **Simplicity First** - Avoid unnecessary abstractions
- **Data-Oriented Design** - Identify data by content, not metadata
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
- **Inspectable** - Easy to understand, test locally, and debug

## Resources

- **Architecture Plans**: See `.claude/plans/` for design decisions
- **UV Docs**: https://docs.astral.sh/uv/
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/