Update README with comprehensive project documentation
Added complete project overview including: - Tech stack and architecture overview - Quick start guide with UV and Pulumi ESC setup - Project structure (extract, transform, core packages) - Development workflow (dependencies, linting, testing) - Secrets management with ESC examples - Production architecture explanation - Architecture principles Removed outdated content and references to CLAUDE.md (internal memory only). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
170
README.md
170
README.md
@@ -1,39 +1,181 @@
|
|||||||
# Materia Environment Setup
|
# Materia
|
||||||
|
|
||||||
We use `uv` as our Python package manager for faster, more reliable dependency management.
|
A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.
|
||||||
https://docs.astral.sh/uv/
|
|
||||||
|
|
||||||
We recommend using vscode as your IDE.
|
## Tech Stack
|
||||||
https://code.visualstudio.com/
|
|
||||||
|
- **Python 3.13** with `uv` package manager
|
||||||
|
- **SQLMesh** for SQL transformation and orchestration
|
||||||
|
- **DuckDB** as the analytical database
|
||||||
|
- **Cloudflare R2** (Iceberg) for data storage
|
||||||
|
- **Pulumi ESC** for secrets management
|
||||||
|
- **Hetzner Cloud** for infrastructure
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
### 1. Install UV
|
### 1. Install UV
|
||||||
|
|
||||||
|
UV is our Python package manager for faster, more reliable dependency management.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Setup the env
|
📚 [UV Documentation](https://docs.astral.sh/uv/)
|
||||||
Simply run:
|
|
||||||
|
### 2. Install Dependencies
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv sync
|
uv sync
|
||||||
```
|
```
|
||||||
This will install python & the dependencies declared so far
|
|
||||||
|
|
||||||
### 3. Setup pre-commit
|
This installs Python and all dependencies declared in `pyproject.toml`.
|
||||||
|
|
||||||
|
### 3. Setup Pre-commit Hooks
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pre-commit install
|
pre-commit install
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Adding a dependency
|
This enables automatic linting with `ruff` on every commit.
|
||||||
|
|
||||||
|
### 4. Install Pulumi ESC (for running with secrets)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv add requests
|
# Install ESC CLI
|
||||||
|
curl -fsSL https://get.pulumi.com/esc/install.sh | sh
|
||||||
|
|
||||||
|
# Login
|
||||||
|
esc login
|
||||||
```
|
```
|
||||||
|
|
||||||
# Managing a project with uv
|
## Project Structure
|
||||||
|
|
||||||
https://docs.astral.sh/uv/guides/projects/#managing-dependencies
|
This is a `uv` workspace with three main packages:
|
||||||
|
|
||||||
|
### Extract Layer (`extract/`)
|
||||||
|
|
||||||
test
|
**psdonline** - Extracts USDA PSD commodity data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Local development (downloads to local directory)
|
||||||
|
uv run extract_psd
|
||||||
|
|
||||||
|
# Production (uploads to R2)
|
||||||
|
esc run beanflows/prod -- uv run extract_psd
|
||||||
|
```
|
||||||
|
|
||||||
|
### Transform Layer (`transform/sqlmesh_materia/`)
|
||||||
|
|
||||||
|
SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).
|
||||||
|
|
||||||
|
**All commands run from project root with `-p transform/sqlmesh_materia`:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Local development
|
||||||
|
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
|
||||||
|
|
||||||
|
# Production
|
||||||
|
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||||
|
|
||||||
|
# Run tests (no secrets needed)
|
||||||
|
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||||
|
|
||||||
|
# Format SQL
|
||||||
|
uv run sqlmesh -p transform/sqlmesh_materia format
|
||||||
|
```
|
||||||
|
|
||||||
|
### Core Package (`src/materia/`)
|
||||||
|
|
||||||
|
CLI for managing infrastructure and pipelines (currently minimal).
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
### Adding Dependencies
|
||||||
|
|
||||||
|
For workspace root:
|
||||||
|
```bash
|
||||||
|
uv add <package-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
For specific package:
|
||||||
|
```bash
|
||||||
|
uv add --package psdonline <package-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Linting and Formatting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check for issues
|
||||||
|
ruff check .
|
||||||
|
|
||||||
|
# Auto-fix issues
|
||||||
|
ruff check --fix .
|
||||||
|
|
||||||
|
# Format code
|
||||||
|
ruff format .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Python tests
|
||||||
|
uv run pytest tests/ -v --cov=src/materia
|
||||||
|
|
||||||
|
# SQLMesh tests
|
||||||
|
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||||
|
```
|
||||||
|
|
||||||
|
## Secrets Management
|
||||||
|
|
||||||
|
All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.
|
||||||
|
|
||||||
|
### Load secrets into shell:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
eval $(esc env open beanflows/prod --format shell)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run commands with secrets:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Single command
|
||||||
|
esc run beanflows/prod -- uv run extract_psd
|
||||||
|
|
||||||
|
# Multiple commands
|
||||||
|
esc run beanflows/prod -- bash -c "
|
||||||
|
uv run extract_psd
|
||||||
|
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Production Architecture
|
||||||
|
|
||||||
|
### Git-Based Deployment
|
||||||
|
|
||||||
|
- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
|
||||||
|
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
|
||||||
|
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)
|
||||||
|
|
||||||
|
### CI/CD Pipeline
|
||||||
|
|
||||||
|
**GitLab CI** runs on every push to master:
|
||||||
|
|
||||||
|
1. **Lint** - `ruff check`
|
||||||
|
2. **Test** - pytest + SQLMesh tests
|
||||||
|
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed
|
||||||
|
|
||||||
|
No build artifacts - supervisor pulls code directly from git!
|
||||||
|
|
||||||
|
## Architecture Principles
|
||||||
|
|
||||||
|
- **Simplicity First** - Avoid unnecessary abstractions
|
||||||
|
- **Data-Oriented Design** - Identify data by content, not metadata
|
||||||
|
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
|
||||||
|
- **Inspectable** - Easy to understand, test locally, and debug
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **Architecture Plans**: See `.claude/plans/` for design decisions
|
||||||
|
- **UV Docs**: https://docs.astral.sh/uv/
|
||||||
|
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/
|
||||||
|
|||||||
Reference in New Issue
Block a user