Refactor to local-first architecture on Hetzner NVMe

Remove distributed R2/Iceberg/SSH pipeline architecture in favor of local subprocess execution with NVMe storage. Landing data backed up to R2 via rclone timer. - Strip Iceberg catalog, httpfs, boto3, paramiko, prefect, pyarrow - Pipelines run via subprocess.run() with bounded timeouts - Extract writes to {LANDING_DIR}/psd/{year}/{month}/{etag}.csv.gzip - SQLMesh reads LANDING_DIR variable, writes to DUCKDB_PATH - Delete unused provider stubs (ovh, scaleway, oracle) - Add rclone systemd timer for R2 backup every 6h - Update supervisor to run pipelines with env vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:05:41 +01:00
parent 910424c956
commit c1d00dcdc4
25 changed files with 231 additions and 1807 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

 ## Project Overview

-Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for orchestrating cloud workers and pipelines.
+Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.

 ## Commands

@@ -25,7 +25,7 @@ cd transform/sqlmesh_materia && uv run sqlmesh test  # SQLMesh model tests
 uv run pytest tests/test_cli.py::test_name -v

 # Extract data
-uv run extract_psd
+LANDING_DIR=data/landing uv run extract_psd

 # SQLMesh (from repo root)
 uv run sqlmesh -p transform/sqlmesh_materia plan              # Plans to dev_<username> by default
@@ -33,43 +33,45 @@ uv run sqlmesh -p transform/sqlmesh_materia plan prod          # Production
 uv run sqlmesh -p transform/sqlmesh_materia test               # Run model tests
 uv run sqlmesh -p transform/sqlmesh_materia format             # Format SQL

-# With production secrets
-esc run beanflows/prod -- <command>
-
 # CLI
+uv run materia pipeline run extract|transform
+uv run materia pipeline list
 uv run materia worker create|destroy|list
-uv run materia pipeline run
 uv run materia secrets get
 ```

 ## Architecture

 **Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, uploads to R2
- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (DuckDB + Iceberg)
- `src/materia/` — CLI (Typer) for worker management, pipeline orchestration, secrets
+- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory
+- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (local DuckDB)
+- `src/materia/` — CLI (Typer) for pipeline execution, worker management, secrets
 - `web/` — Future web frontend

 **Data flow:**
 ```
-USDA API → extract (psdonline) → R2/local CSV → SQLMesh transforms → DuckDB/Iceberg
+USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
+         → rclone cron syncs landing/ to R2
+         → SQLMesh raw → staging → cleaned → serving → /data/materia/lakehouse.duckdb
+         → Web app reads lakehouse.duckdb (read-only)
 ```

 **SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`):
-1. `raw/` — Immutable source reads (read_csv from extracted files)
+1. `raw/` — Immutable source reads (read_csv from landing directory)
 2. `staging/` — Type casting, lookup joins, basic cleansing
 3. `cleaned/` — Business logic, pivoting, integration
 4. `serving/` — Analytics-ready facts, dimensions, aggregates

 **CLI modules** (`src/materia/`):
 - `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
- `workers.py` — Ephemeral cloud instance management (Hetzner, with planned OVH/Scaleway/Oracle)
- `pipelines.py` — SSH-based pipeline execution on workers (download artifact, run, destroy)
+- `workers.py` — Hetzner cloud instance management (for ad-hoc compute)
+- `pipelines.py` — Local subprocess pipeline execution with bounded timeouts
 - `secrets.py` — Pulumi ESC integration for environment secrets

 **Infrastructure** (`infra/`):
 - Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
- Supervisor systemd service for always-on orchestration (pulls git every 15 min)
+- Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
+- rclone systemd timer for landing data backup to R2

 ## Coding Philosophy

@@ -87,7 +89,14 @@ Read `coding_philosophy.md` for the full guide. Key points:
 - **Python 3.13** (`.python-version`)
 - **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
 - **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
- **Storage**: Cloudflare R2 with Iceberg catalog (zero egress cost)
+- **Storage**: Local NVMe (`LANDING_DIR`, `DUCKDB_PATH`), R2 for backup via rclone
 - **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
 - **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
 - **Pre-commit hooks**: installed via `pre-commit install`
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LANDING_DIR` | `data/landing` | Root directory for extracted landing data |
+| `DUCKDB_PATH` | `local.duckdb` | Path to the DuckDB lakehouse database |