Refactor to git-based deployment: simplify CI/CD and supervisor

Addresses GitLab PR comments:
1. Remove hardcoded secrets from Pulumi.prod.yaml, use ESC environment
2. Simplify deployment by using git pull instead of R2 artifacts
3. Add bootstrap script for one-time supervisor setup

Major changes:
- **Pulumi config**: Use ESC environment (beanflows/prod) for all secrets
- **Supervisor script**: Git-based deployment (git pull every 15 min)
  * No more artifact downloads from R2
  * Runs code directly via `uv run materia`
  * Self-updating from master branch
- **Bootstrap script**: New infra/bootstrap_supervisor.sh for initial setup
  * One-time script to clone repo and setup systemd service
  * Idempotent and simple
- **CI/CD simplification**: Remove build and R2 deployment stages
  * Eliminated build:extract, build:transform, build:cli jobs
  * Eliminated deploy:r2 job
  * Simplified deploy:supervisor to just check bootstrap status
  * Reduced from 4 stages to 3 stages (Lint → Test → Deploy)
- **Documentation**: Updated CLAUDE.md with new architecture
  * Git-based deployment flow
  * Bootstrap instructions
  * Simplified execution model

Benefits:
-  No hardcoded secrets in config files
-  Simpler deployment (no artifact builds)
-  Easy to test locally (just git clone + uv sync)
-  Auto-updates every 15 minutes
-  Fewer CI/CD jobs (faster pipelines)
-  Cleaner separation of concerns

Inspired by TigerBeetle's CFO supervisor pattern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Deeman
2025-10-13 20:31:38 +02:00
parent 60989675b0
commit 558829f70b
7 changed files with 285 additions and 296 deletions

View File

@@ -168,11 +168,11 @@ pytest --cov=./ --cov-report=xml
### CI/CD Pipeline (`.gitlab-ci.yml`)
**4 Stages: Lint → Test → Build → Deploy**
**3 Stages: Lint → Test → Deploy**
#### 1. Lint Stage
- Runs `ruff check` and `ruff format --check`
- Validates code quality on every commit
- Runs `ruff check` on every commit
- Validates code quality
#### 2. Test Stage
- **`test:cli`**: Runs pytest on materia CLI with 71% coverage
@@ -182,53 +182,51 @@ pytest --cov=./ --cov-report=xml
- Exports coverage reports to GitLab
- **`test:sqlmesh`**: Runs SQLMesh model tests in transform layer
#### 3. Build Stage (only on master branch)
Creates separate artifacts for each workspace package:
- **`build:extract`**: Builds `materia-extract-latest.tar.gz` (psdonline package)
- **`build:transform`**: Builds `materia-transform-latest.tar.gz` (sqlmesh_materia package)
- **`build:cli`**: Builds `materia-cli-latest.tar.gz` (materia management CLI)
Each artifact is a self-contained tarball with all dependencies.
#### 4. Deploy Stage (only on master branch)
- **`deploy:r2`**: Uploads artifacts to Cloudflare R2 using rclone
- Loads secrets from Pulumi ESC (`beanflows/prod`)
- Only requires `PULUMI_ACCESS_TOKEN` in GitLab variables
- All other secrets (R2 credentials, SSH keys, API tokens) come from ESC
#### 3. Deploy Stage (only on master branch)
- **`deploy:infra`**: Runs `pulumi up` to ensure supervisor instance exists
- Runs on every master push (not just on infra changes)
- Creates/updates Hetzner CCX11 supervisor instance
- Configures Cloudflare R2 buckets (`beanflows-artifacts`, `beanflows-data-prod`)
- **`deploy:supervisor`**: Deploys supervisor script and materia CLI
- Runs after `deploy:r2` and `deploy:infra`
- Copies `supervisor.sh` and systemd service to supervisor instance
- Downloads and installs latest materia CLI from R2
- Restarts supervisor service to pick up changes
- Runs on every master push
- Creates/updates Hetzner CPX11 supervisor instance (~€4.49/mo)
- Uses Pulumi ESC (`beanflows/prod`) for all secrets
- **`deploy:supervisor`**: Checks supervisor status
- Verifies supervisor is bootstrapped
- Supervisor auto-updates via `git pull` every 15 minutes (no CI/CD deployment needed)
### Production Architecture: Ephemeral Worker Model
**Note:** No build artifacts! Supervisor pulls code directly from git and runs via `uv`.
### Production Architecture: Git-Based Deployment with Ephemeral Workers
**Design Philosophy:**
- No always-on workers (cost optimization)
- Supervisor instance dynamically creates/destroys workers on-demand
- Language-agnostic artifacts enable future migration to C/Rust/Go
- Supervisor pulls latest code from git (no artifact builds)
- Supervisor dynamically creates/destroys workers on-demand
- Simple, inspectable, easy to test locally
- Multi-cloud abstraction for pricing optimization
**Components:**
#### 1. Supervisor Instance (Small Hetzner VM)
- Runs `supervisor.sh` - continuous orchestration loop (inspired by TigerBeetle's CFO supervisor)
- Hetzner CCX11: 2 vCPU, 4GB RAM (~€4/mo)
- Hetzner CPX11: 2 vCPU (shared), 2GB RAM (~€4.49/mo)
- Always-on, minimal resource usage
- Checks for new CLI versions every hour (self-updating)
- Git-based deployment: `git pull` every 15 minutes for auto-updates
- Runs pipelines on schedule:
- Extract: Daily at 2 AM UTC
- Transform: Daily at 3 AM UTC
- Uses systemd service for automatic restart on failure
- Pulls secrets from Pulumi ESC and passes to workers
- Pulls secrets from Pulumi ESC
**Bootstrap (one-time):**
```bash
# Get supervisor IP from Pulumi
cd infra && pulumi stack output supervisor_ip -s prod
# Run bootstrap script
export PULUMI_ACCESS_TOKEN=<your-token>
ssh root@<supervisor-ip> 'bash -s' < infra/bootstrap_supervisor.sh
```
#### 2. Ephemeral Workers (On-Demand)
- Created for each pipeline execution
- Downloads pre-built artifacts from R2 (no git, no uv on worker)
- Created for each pipeline execution by materia CLI
- Receives secrets via SSH environment variable injection
- Destroyed immediately after job completion
- Different instance types per pipeline:
@@ -239,18 +237,20 @@ Each artifact is a self-contained tarball with all dependencies.
```
Pulumi ESC (beanflows/prod)
Supervisor Instance (materia CLI)
Supervisor Instance (via esc CLI)
Workers (injected as env vars via SSH)
```
#### 4. Artifact Flow
#### 4. Code Deployment Flow
```
GitLab CI: uv build → tar.gz
GitLab (master branch)
Cloudflare R2 (artifact storage)
Supervisor: git pull origin master (every 15 min)
Worker: curl → extract → execute
Supervisor: uv sync (update dependencies)
Supervisor: uv run materia pipeline run <pipeline>
```
#### 5. Data Storage
@@ -261,12 +261,12 @@ Worker: curl → extract → execute
**Execution Flow:**
1. Supervisor loop wakes up every 15 minutes
2. Checks if current time matches pipeline schedule (e.g., 2 AM for extract)
3. Checks for CLI updates (hourly) and self-updates if needed
4. CLI runs: `materia pipeline run extract`
5. Creates Hetzner worker with SSH key
6. Worker downloads `materia-extract-latest.tar.gz` from R2
7. CLI injects secrets via SSH: `export R2_ACCESS_KEY_ID=... && ./extract_psd`
2. Runs `git fetch` and checks if new commits on master
3. If updates available: `git pull && uv sync`
4. Checks if current time matches pipeline schedule (e.g., 2 AM for extract)
5. If scheduled: `uv run materia pipeline run extract`
6. CLI creates Hetzner worker with SSH key
7. CLI injects secrets via SSH and executes pipeline
8. Pipeline executes, writes to R2 Iceberg catalog
9. Worker destroyed (entire lifecycle ~5-10 minutes)
10. Supervisor logs results and continues loop