cleanup and prefect service setup

2026-02-04 22:24:55 +01:00
parent fc27d5f887
commit 6d4377ccf9
41 changed files with 15888 additions and 2591 deletions
--- a/infra/readme.md
+++ b/infra/readme.md
@@ -0,0 +1,161 @@
+# Materia Infrastructure
+
+Pulumi-managed infrastructure for BeanFlows.coffee
+
+## Stack Overview
+
+- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
+- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
+- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)
+
+## Prerequisites
+
+1. **Cloudflare Account**
+   - Sign up at https://dash.cloudflare.com
+   - Create API token with R2 + Data Catalog permissions
+   - Get your Account ID from dashboard
+
+2. **Hetzner Cloud Account**
+   - Sign up at https://console.hetzner.cloud
+   - Create API token with Read & Write permissions
+
+3. **Pulumi Account** (optional, can use local state)
+   - Sign up at https://app.pulumi.com
+   - Or use local state with `pulumi login --local`
+
+4. **SSH Key**
+   - Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`
+
+## Initial Setup
+
+```bash
+cd infra
+
+# Login to Pulumi (local or cloud)
+pulumi login  # or: pulumi login --local
+
+# Initialize the stack
+pulumi stack init dev
+
+# Configure secrets
+pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
+pulumi config set cloudflare_account_id <your-account-id>
+pulumi config set --secret hcloud:token <your-hetzner-token>
+pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"
+
+# Preview changes
+pulumi preview
+
+# Deploy infrastructure
+pulumi up
+```
+
+## What Gets Provisioned
+
+### Cloudflare R2 Buckets
+
+1. **materia-raw** - Raw data from extraction (immutable archives)
+2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)
+
+### Hetzner Cloud Servers
+
+1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
+   - Runs cron scheduler
+   - Lightweight orchestration tasks
+   - Always-on, low cost (~€6/mo)
+
+2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
+   - Heavy SQLMesh transformations
+   - Can be stopped when not in use
+   - Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)
+
+3. **materia-firewall**
+   - SSH access (port 22)
+   - All outbound traffic allowed
+   - No inbound HTTP/HTTPS (we're not running web services yet)
+
+## Enabling R2 Data Catalog (Iceberg)
+
+As of October 2025, R2 Data Catalog is in public beta. Enable it manually:
+
+1. Go to Cloudflare Dashboard → R2
+2. Select the `materia-lakehouse` bucket
+3. Navigate to Settings → Data Catalog
+4. Click "Enable Data Catalog"
+
+Once enabled, you can connect DuckDB to the Iceberg REST catalog:
+
+```python
+import duckdb
+
+# Get catalog URI from Pulumi outputs
+# pulumi stack output duckdb_r2_config
+
+conn = duckdb.connect()
+conn.execute("INSTALL iceberg; LOAD iceberg;")
+conn.execute(f"""
+    ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
+    AS lakehouse (
+        TYPE ICEBERG_REST,
+        SECRET '<r2_api_token>'
+    );
+""")
+```
+
+## Server Access
+
+Get server IPs from Pulumi outputs:
+
+```bash
+pulumi stack output scheduler_ip
+pulumi stack output worker_ip
+```
+
+SSH into servers:
+
+```bash
+ssh root@<scheduler_ip>
+ssh root@<worker_ip>
+```
+
+## Cost Estimates (Monthly)
+
+| Resource | Type | Cost |
+|----------|------|------|
+| R2 Storage | 10 GB | $0.15 |
+| R2 Operations | 1M reads | $0.36 |
+| R2 Egress | Unlimited | $0.00 (zero egress!) |
+| Scheduler | CCX12 | €6.00 |
+| Worker (on-demand) | CCX22 | €24.00 |
+| **Total** | | **~€30/mo (~$33)** |
+
+Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.
+
+## Scaling Workers
+
+To add more worker capacity or different instance sizes:
+
+1. Edit `infra/__main__.py` to add new server resources
+2. Update worker config in `src/orchestrator/workers.yaml`
+3. Run `pulumi up` to provision
+
+Example worker sizes:
+- CCX12: 2 vCPU, 8GB RAM (light workloads)
+- CCX22: 4 vCPU, 16GB RAM (medium workloads)
+- CCX32: 8 vCPU, 32GB RAM (heavy workloads)
+- CCX42: 16 vCPU, 64GB RAM (very heavy workloads)
+
+## Destroying Infrastructure
+
+```bash
+cd infra
+pulumi destroy
+```
+
+**Warning:** This will delete all buckets and servers. Backup data first!
+
+## Next Steps
+
+1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
+2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
+3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)