Added complete project overview including: - Tech stack and architecture overview - Quick start guide with UV and Pulumi ESC setup - Project structure (extract, transform, core packages) - Development workflow (dependencies, linting, testing) - Secrets management with ESC examples - Production architecture explanation - Architecture principles Removed outdated content and references to CLAUDE.md (internal memory only). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
182 lines
4.1 KiB
Markdown
182 lines
4.1 KiB
Markdown
# Materia
|
|
|
|
A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.
|
|
|
|
## Tech Stack
|
|
|
|
- **Python 3.13** with `uv` package manager
|
|
- **SQLMesh** for SQL transformation and orchestration
|
|
- **DuckDB** as the analytical database
|
|
- **Cloudflare R2** (Iceberg) for data storage
|
|
- **Pulumi ESC** for secrets management
|
|
- **Hetzner Cloud** for infrastructure
|
|
|
|
## Quick Start
|
|
|
|
### 1. Install UV
|
|
|
|
UV is our Python package manager for faster, more reliable dependency management.
|
|
|
|
```bash
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
```
|
|
|
|
📚 [UV Documentation](https://docs.astral.sh/uv/)
|
|
|
|
### 2. Install Dependencies
|
|
|
|
```bash
|
|
uv sync
|
|
```
|
|
|
|
This installs Python and all dependencies declared in `pyproject.toml`.
|
|
|
|
### 3. Setup Pre-commit Hooks
|
|
|
|
```bash
|
|
pre-commit install
|
|
```
|
|
|
|
This enables automatic linting with `ruff` on every commit.
|
|
|
|
### 4. Install Pulumi ESC (for running with secrets)
|
|
|
|
```bash
|
|
# Install ESC CLI
|
|
curl -fsSL https://get.pulumi.com/esc/install.sh | sh
|
|
|
|
# Login
|
|
esc login
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
This is a `uv` workspace with three main packages:
|
|
|
|
### Extract Layer (`extract/`)
|
|
|
|
**psdonline** - Extracts USDA PSD commodity data
|
|
|
|
```bash
|
|
# Local development (downloads to local directory)
|
|
uv run extract_psd
|
|
|
|
# Production (uploads to R2)
|
|
esc run beanflows/prod -- uv run extract_psd
|
|
```
|
|
|
|
### Transform Layer (`transform/sqlmesh_materia/`)
|
|
|
|
SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).
|
|
|
|
**All commands run from project root with `-p transform/sqlmesh_materia`:**
|
|
|
|
```bash
|
|
# Local development
|
|
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
|
|
|
|
# Production
|
|
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
|
|
|
# Run tests (no secrets needed)
|
|
uv run sqlmesh -p transform/sqlmesh_materia test
|
|
|
|
# Format SQL
|
|
uv run sqlmesh -p transform/sqlmesh_materia format
|
|
```
|
|
|
|
### Core Package (`src/materia/`)
|
|
|
|
CLI for managing infrastructure and pipelines (currently minimal).
|
|
|
|
## Development Workflow
|
|
|
|
### Adding Dependencies
|
|
|
|
For workspace root:
|
|
```bash
|
|
uv add <package-name>
|
|
```
|
|
|
|
For specific package:
|
|
```bash
|
|
uv add --package psdonline <package-name>
|
|
```
|
|
|
|
### Linting and Formatting
|
|
|
|
```bash
|
|
# Check for issues
|
|
ruff check .
|
|
|
|
# Auto-fix issues
|
|
ruff check --fix .
|
|
|
|
# Format code
|
|
ruff format .
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Python tests
|
|
uv run pytest tests/ -v --cov=src/materia
|
|
|
|
# SQLMesh tests
|
|
uv run sqlmesh -p transform/sqlmesh_materia test
|
|
```
|
|
|
|
## Secrets Management
|
|
|
|
All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.
|
|
|
|
### Load secrets into shell:
|
|
|
|
```bash
|
|
eval $(esc env open beanflows/prod --format shell)
|
|
```
|
|
|
|
### Run commands with secrets:
|
|
|
|
```bash
|
|
# Single command
|
|
esc run beanflows/prod -- uv run extract_psd
|
|
|
|
# Multiple commands
|
|
esc run beanflows/prod -- bash -c "
|
|
uv run extract_psd
|
|
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
|
"
|
|
```
|
|
|
|
## Production Architecture
|
|
|
|
### Git-Based Deployment
|
|
|
|
- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
|
|
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
|
|
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)
|
|
|
|
### CI/CD Pipeline
|
|
|
|
**GitLab CI** runs on every push to master:
|
|
|
|
1. **Lint** - `ruff check`
|
|
2. **Test** - pytest + SQLMesh tests
|
|
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed
|
|
|
|
No build artifacts - supervisor pulls code directly from git!
|
|
|
|
## Architecture Principles
|
|
|
|
- **Simplicity First** - Avoid unnecessary abstractions
|
|
- **Data-Oriented Design** - Identify data by content, not metadata
|
|
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
|
|
- **Inspectable** - Easy to understand, test locally, and debug
|
|
|
|
## Resources
|
|
|
|
- **Architecture Plans**: See `.claude/plans/` for design decisions
|
|
- **UV Docs**: https://docs.astral.sh/uv/
|
|
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/
|