Update README with comprehensive project documentation
Added complete project overview including: - Tech stack and architecture overview - Quick start guide with UV and Pulumi ESC setup - Project structure (extract, transform, core packages) - Development workflow (dependencies, linting, testing) - Secrets management with ESC examples - Production architecture explanation - Architecture principles Removed outdated content and references to CLAUDE.md (internal memory only). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
170
README.md
170
README.md
@@ -1,39 +1,181 @@
|
||||
# Materia Environment Setup
|
||||
# Materia
|
||||
|
||||
We use `uv` as our Python package manager for faster, more reliable dependency management.
|
||||
https://docs.astral.sh/uv/
|
||||
A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.
|
||||
|
||||
We recommend using vscode as your IDE.
|
||||
https://code.visualstudio.com/
|
||||
## Tech Stack
|
||||
|
||||
- **Python 3.13** with `uv` package manager
|
||||
- **SQLMesh** for SQL transformation and orchestration
|
||||
- **DuckDB** as the analytical database
|
||||
- **Cloudflare R2** (Iceberg) for data storage
|
||||
- **Pulumi ESC** for secrets management
|
||||
- **Hetzner Cloud** for infrastructure
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install UV
|
||||
|
||||
UV is our Python package manager for faster, more reliable dependency management.
|
||||
|
||||
```bash
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
```
|
||||
|
||||
### 2. Setup the env
|
||||
Simply run:
|
||||
📚 [UV Documentation](https://docs.astral.sh/uv/)
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
This will install python & the dependencies declared so far
|
||||
|
||||
### 3. Setup pre-commit
|
||||
This installs Python and all dependencies declared in `pyproject.toml`.
|
||||
|
||||
### 3. Setup Pre-commit Hooks
|
||||
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
### 4. Adding a dependency
|
||||
This enables automatic linting with `ruff` on every commit.
|
||||
|
||||
### 4. Install Pulumi ESC (for running with secrets)
|
||||
|
||||
```bash
|
||||
uv add requests
|
||||
# Install ESC CLI
|
||||
curl -fsSL https://get.pulumi.com/esc/install.sh | sh
|
||||
|
||||
# Login
|
||||
esc login
|
||||
```
|
||||
|
||||
# Managing a project with uv
|
||||
## Project Structure
|
||||
|
||||
https://docs.astral.sh/uv/guides/projects/#managing-dependencies
|
||||
This is a `uv` workspace with three main packages:
|
||||
|
||||
### Extract Layer (`extract/`)
|
||||
|
||||
test
|
||||
**psdonline** - Extracts USDA PSD commodity data
|
||||
|
||||
```bash
|
||||
# Local development (downloads to local directory)
|
||||
uv run extract_psd
|
||||
|
||||
# Production (uploads to R2)
|
||||
esc run beanflows/prod -- uv run extract_psd
|
||||
```
|
||||
|
||||
### Transform Layer (`transform/sqlmesh_materia/`)
|
||||
|
||||
SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).
|
||||
|
||||
**All commands run from project root with `-p transform/sqlmesh_materia`:**
|
||||
|
||||
```bash
|
||||
# Local development
|
||||
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
|
||||
|
||||
# Production
|
||||
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||
|
||||
# Run tests (no secrets needed)
|
||||
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||
|
||||
# Format SQL
|
||||
uv run sqlmesh -p transform/sqlmesh_materia format
|
||||
```
|
||||
|
||||
### Core Package (`src/materia/`)
|
||||
|
||||
CLI for managing infrastructure and pipelines (currently minimal).
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Adding Dependencies
|
||||
|
||||
For workspace root:
|
||||
```bash
|
||||
uv add <package-name>
|
||||
```
|
||||
|
||||
For specific package:
|
||||
```bash
|
||||
uv add --package psdonline <package-name>
|
||||
```
|
||||
|
||||
### Linting and Formatting
|
||||
|
||||
```bash
|
||||
# Check for issues
|
||||
ruff check .
|
||||
|
||||
# Auto-fix issues
|
||||
ruff check --fix .
|
||||
|
||||
# Format code
|
||||
ruff format .
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Python tests
|
||||
uv run pytest tests/ -v --cov=src/materia
|
||||
|
||||
# SQLMesh tests
|
||||
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||
```
|
||||
|
||||
## Secrets Management
|
||||
|
||||
All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.
|
||||
|
||||
### Load secrets into shell:
|
||||
|
||||
```bash
|
||||
eval $(esc env open beanflows/prod --format shell)
|
||||
```
|
||||
|
||||
### Run commands with secrets:
|
||||
|
||||
```bash
|
||||
# Single command
|
||||
esc run beanflows/prod -- uv run extract_psd
|
||||
|
||||
# Multiple commands
|
||||
esc run beanflows/prod -- bash -c "
|
||||
uv run extract_psd
|
||||
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||
"
|
||||
```
|
||||
|
||||
## Production Architecture
|
||||
|
||||
### Git-Based Deployment
|
||||
|
||||
- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
|
||||
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
|
||||
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)
|
||||
|
||||
### CI/CD Pipeline
|
||||
|
||||
**GitLab CI** runs on every push to master:
|
||||
|
||||
1. **Lint** - `ruff check`
|
||||
2. **Test** - pytest + SQLMesh tests
|
||||
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed
|
||||
|
||||
No build artifacts - supervisor pulls code directly from git!
|
||||
|
||||
## Architecture Principles
|
||||
|
||||
- **Simplicity First** - Avoid unnecessary abstractions
|
||||
- **Data-Oriented Design** - Identify data by content, not metadata
|
||||
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
|
||||
- **Inspectable** - Easy to understand, test locally, and debug
|
||||
|
||||
## Resources
|
||||
|
||||
- **Architecture Plans**: See `.claude/plans/` for design decisions
|
||||
- **UV Docs**: https://docs.astral.sh/uv/
|
||||
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/
|
||||
|
||||
Reference in New Issue
Block a user