cleanup and prefect service setup
This commit is contained in:
181
readme.md
Normal file
181
readme.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Materia
|
||||
|
||||
A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Python 3.13** with `uv` package manager
|
||||
- **SQLMesh** for SQL transformation and orchestration
|
||||
- **DuckDB** as the analytical database
|
||||
- **Cloudflare R2** (Iceberg) for data storage
|
||||
- **Pulumi ESC** for secrets management
|
||||
- **Hetzner Cloud** for infrastructure
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install UV
|
||||
|
||||
UV is our Python package manager for faster, more reliable dependency management.
|
||||
|
||||
```bash
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
```
|
||||
|
||||
📚 [UV Documentation](https://docs.astral.sh/uv/)
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
This installs Python and all dependencies declared in `pyproject.toml`.
|
||||
|
||||
### 3. Setup Pre-commit Hooks
|
||||
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
This enables automatic linting with `ruff` on every commit.
|
||||
|
||||
### 4. Install Pulumi ESC (for running with secrets)
|
||||
|
||||
```bash
|
||||
# Install ESC CLI
|
||||
curl -fsSL https://get.pulumi.com/esc/install.sh | sh
|
||||
|
||||
# Login
|
||||
esc login
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
This is a `uv` workspace with three main packages:
|
||||
|
||||
### Extract Layer (`extract/`)
|
||||
|
||||
**psdonline** - Extracts USDA PSD commodity data
|
||||
|
||||
```bash
|
||||
# Local development (downloads to local directory)
|
||||
uv run extract_psd
|
||||
|
||||
# Production (uploads to R2)
|
||||
esc run beanflows/prod -- uv run extract_psd
|
||||
```
|
||||
|
||||
### Transform Layer (`transform/sqlmesh_materia/`)
|
||||
|
||||
SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).
|
||||
|
||||
**All commands run from project root with `-p transform/sqlmesh_materia`:**
|
||||
|
||||
```bash
|
||||
# Local development
|
||||
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>
|
||||
|
||||
# Production
|
||||
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||
|
||||
# Run tests (no secrets needed)
|
||||
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||
|
||||
# Format SQL
|
||||
uv run sqlmesh -p transform/sqlmesh_materia format
|
||||
```
|
||||
|
||||
### Core Package (`src/materia/`)
|
||||
|
||||
CLI for managing infrastructure and pipelines (currently minimal).
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Adding Dependencies
|
||||
|
||||
For workspace root:
|
||||
```bash
|
||||
uv add <package-name>
|
||||
```
|
||||
|
||||
For specific package:
|
||||
```bash
|
||||
uv add --package psdonline <package-name>
|
||||
```
|
||||
|
||||
### Linting and Formatting
|
||||
|
||||
```bash
|
||||
# Check for issues
|
||||
ruff check .
|
||||
|
||||
# Auto-fix issues
|
||||
ruff check --fix .
|
||||
|
||||
# Format code
|
||||
ruff format .
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Python tests
|
||||
uv run pytest tests/ -v --cov=src/materia
|
||||
|
||||
# SQLMesh tests
|
||||
uv run sqlmesh -p transform/sqlmesh_materia test
|
||||
```
|
||||
|
||||
## Secrets Management
|
||||
|
||||
All secrets are managed via **Pulumi ESC** environment `beanflows/prod`.
|
||||
|
||||
### Load secrets into shell:
|
||||
|
||||
```bash
|
||||
eval $(esc env open beanflows/prod --format shell)
|
||||
```
|
||||
|
||||
### Run commands with secrets:
|
||||
|
||||
```bash
|
||||
# Single command
|
||||
esc run beanflows/prod -- uv run extract_psd
|
||||
|
||||
# Multiple commands
|
||||
esc run beanflows/prod -- bash -c "
|
||||
uv run extract_psd
|
||||
uv run sqlmesh -p transform/sqlmesh_materia plan prod
|
||||
"
|
||||
```
|
||||
|
||||
## Production Architecture
|
||||
|
||||
### Git-Based Deployment
|
||||
|
||||
- **Supervisor** (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
|
||||
- **Workers** (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
|
||||
- **Storage**: Cloudflare R2 Data Catalog (Apache Iceberg REST API)
|
||||
|
||||
### CI/CD Pipeline
|
||||
|
||||
**GitLab CI** runs on every push to master:
|
||||
|
||||
1. **Lint** - `ruff check`
|
||||
2. **Test** - pytest + SQLMesh tests
|
||||
3. **Deploy** - Updates supervisor infrastructure and bootstraps if needed
|
||||
|
||||
No build artifacts - supervisor pulls code directly from git!
|
||||
|
||||
## Architecture Principles
|
||||
|
||||
- **Simplicity First** - Avoid unnecessary abstractions
|
||||
- **Data-Oriented Design** - Identify data by content, not metadata
|
||||
- **Cost Optimization** - Ephemeral workers, minimal always-on infrastructure
|
||||
- **Inspectable** - Easy to understand, test locally, and debug
|
||||
|
||||
## Resources
|
||||
|
||||
- **Architecture Plans**: See `.claude/plans/` for design decisions
|
||||
- **UV Docs**: https://docs.astral.sh/uv/
|
||||
- **SQLMesh Docs**: https://sqlmesh.readthedocs.io/
|
||||
Reference in New Issue
Block a user