Update README with comprehensive project documentation

Added complete project overview including:
- Tech stack and architecture overview
- Quick start guide with UV and Pulumi ESC setup
- Project structure (extract, transform, core packages)
- Development workflow (dependencies, linting, testing)
- Secrets management with ESC examples
- Production architecture explanation
- Architecture principles

Removed outdated content and references to CLAUDE.md (internal memory only).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Deeman
2025-10-21 21:51:52 +02:00
parent d4e6c65f97
commit 3c7a99a699

170
README.md
View File

@@ -1,39 +1,181 @@
# Materia Environment Setup # Materia
We use `uv` as our Python package manager for faster, more reliable dependency management. A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.
https://docs.astral.sh/uv/
We recommend using vscode as your IDE. ## Tech Stack
https://code.visualstudio.com/
- **Python 3.13** with `uv` package manager
- **SQLMesh** for SQL transformation and orchestration
- **DuckDB** as the analytical database
- **Cloudflare R2** (Iceberg) for data storage
- **Pulumi ESC** for secrets management
- **Hetzner Cloud** for infrastructure
## Quick Start
### 1. Install UV ### 1. Install UV
UV is our Python package manager for faster, more reliable dependency management.
```bash ```bash
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
``` ```
### 2. Setup the env 📚 [UV Documentation](https://docs.astral.sh/uv/)
Simply run:
### 2. Install Dependencies
```bash ```bash
uv sync uv sync
``` ```
This will install python & the dependencies declared so far
### 3. Setup pre-commit This installs Python and all dependencies declared in `pyproject.toml`.
### 3. Setup Pre-commit Hooks
```bash ```bash
pre-commit install pre-commit install
``` ```
### 4. Adding a dependency This enables automatic linting with `ruff` on every commit.
### 4. Install Pulumi ESC (for running with secrets)
```bash ```bash
uv add requests # Install ESC CLI
curl -fsSL https://get.pulumi.com/esc/install.sh | sh
# Login
esc login
``` ```
# Managing a project with uv ## Project Structure
https://docs.astral.sh/uv/guides/projects/#managing-dependencies This is a `uv` workspace with three main packages:
### Extract Layer (`extract/`)
test **psdonline** - Extracts USDA PSD commodity data
```bash
# Local development (downloads to local directory)
uv run extract_psd
# Production (uploads to R2)
esc run beanflows/prod -- uv run extract_psd
```
### Transform Layer (`transform/sqlmesh_materia/`)
SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).
**All commands run from project root with `-p transform/sqlmesh_materia`:**
```bash
# Local development
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
# Run tests (no secrets needed)
uv run sqlmesh -p transform/sqlmesh_materia test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format
```
### Core Package (`src/materia/`)
CLI for managing infrastructure and pipelines (currently minimal).
## Development Workflow
### Adding Dependencies
For workspace root:
```bash
uv add <package-name>
```
For specific package:
```bash
uv add --package psdonline <package-name>
```
### Linting and Formatting
```bash
# Check for issues
ruff check .
# Auto-fix issues
ruff check --fix .
# Format code
ruff format .
```
### Running Tests
```bash
# Python tests
uv run pytest tests/ -v --cov=src/materia
# SQLMesh tests
uv run sqlmesh -p transform/sqlmesh_materia test
```
## Secrets Management
All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.
### Load secrets into shell:
```bash
eval $(esc env open beanflows/prod --format shell)
```
### Run commands with secrets:
```bash
# Single command
esc run beanflows/prod -- uv run extract_psd
# Multiple commands
esc run beanflows/prod -- bash -c "
uv run extract_psd
uv run sqlmesh -p transform/sqlmesh_materia plan prod
"
```
## Production Architecture
### Git-Based Deployment
- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)
### CI/CD Pipeline
**GitLab CI** runs on every push to master:
1. **Lint** - `ruff check`
2. **Test** - pytest + SQLMesh tests
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed
No build artifacts - supervisor pulls code directly from git!
## Architecture Principles
- **Simplicity First** - Avoid unnecessary abstractions
- **Data-Oriented Design** - Identify data by content, not metadata
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
- **Inspectable** - Easy to understand, test locally, and debug
## Resources
- **Architecture Plans**: See `.claude/plans/` for design decisions
- **UV Docs**: https://docs.astral.sh/uv/
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/