This commit is contained in:
Deeman
2025-10-12 14:26:55 +02:00
parent 77dd277ebf
commit 790e802edd
6 changed files with 708 additions and 0 deletions

15
infra/Pulumi.dev.yaml Normal file
View File

@@ -0,0 +1,15 @@
# Development stack configuration
# Set actual values with: pulumi config set <key> <value>
# Set secrets with: pulumi config set --secret <key> <value>
config:
# Cloudflare configuration
cloudflare:apiToken: # Set with: pulumi config set --secret cloudflare:apiToken <token>
materia-infrastructure:cloudflare_account_id: # Set with: pulumi config set cloudflare_account_id <id>
# Hetzner configuration
hcloud:token: # Set with: pulumi config set --secret hcloud:token <token>
materia-infrastructure:hetzner_location: "nbg1" # Nuremberg, Germany
# SSH key for server access
materia-infrastructure:ssh_public_key: # Set with: pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_rsa.pub)"

6
infra/Pulumi.yaml Normal file
View File

@@ -0,0 +1,6 @@
name: materia-infrastructure
runtime:
name: python
options:
virtualenv: ../.venv
description: BeanFlows.coffee infrastructure on Cloudflare R2 + Hetzner

161
infra/README.md Normal file
View File

@@ -0,0 +1,161 @@
# Materia Infrastructure
Pulumi-managed infrastructure for BeanFlows.coffee
## Stack Overview
- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)
## Prerequisites
1. **Cloudflare Account**
- Sign up at https://dash.cloudflare.com
- Create API token with R2 + Data Catalog permissions
- Get your Account ID from dashboard
2. **Hetzner Cloud Account**
- Sign up at https://console.hetzner.cloud
- Create API token with Read & Write permissions
3. **Pulumi Account** (optional, can use local state)
- Sign up at https://app.pulumi.com
- Or use local state with `pulumi login --local`
4. **SSH Key**
- Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`
## Initial Setup
```bash
cd infra
# Login to Pulumi (local or cloud)
pulumi login # or: pulumi login --local
# Initialize the stack
pulumi stack init dev
# Configure secrets
pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
pulumi config set cloudflare_account_id <your-account-id>
pulumi config set --secret hcloud:token <your-hetzner-token>
pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"
# Preview changes
pulumi preview
# Deploy infrastructure
pulumi up
```
## What Gets Provisioned
### Cloudflare R2 Buckets
1. **materia-raw** - Raw data from extraction (immutable archives)
2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)
### Hetzner Cloud Servers
1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
- Runs cron scheduler
- Lightweight orchestration tasks
- Always-on, low cost (~€6/mo)
2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
- Heavy SQLMesh transformations
- Can be stopped when not in use
- Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)
3. **materia-firewall**
- SSH access (port 22)
- All outbound traffic allowed
- No inbound HTTP/HTTPS (we're not running web services yet)
## Enabling R2 Data Catalog (Iceberg)
As of October 2025, R2 Data Catalog is in public beta. Enable it manually:
1. Go to Cloudflare Dashboard → R2
2. Select the `materia-lakehouse` bucket
3. Navigate to Settings → Data Catalog
4. Click "Enable Data Catalog"
Once enabled, you can connect DuckDB to the Iceberg REST catalog:
```python
import duckdb
# Get catalog URI from Pulumi outputs
# pulumi stack output duckdb_r2_config
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
conn.execute(f"""
ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
AS lakehouse (
TYPE ICEBERG_REST,
SECRET '<r2_api_token>'
);
""")
```
## Server Access
Get server IPs from Pulumi outputs:
```bash
pulumi stack output scheduler_ip
pulumi stack output worker_ip
```
SSH into servers:
```bash
ssh root@<scheduler_ip>
ssh root@<worker_ip>
```
## Cost Estimates (Monthly)
| Resource | Type | Cost |
|----------|------|------|
| R2 Storage | 10 GB | $0.15 |
| R2 Operations | 1M reads | $0.36 |
| R2 Egress | Unlimited | $0.00 (zero egress!) |
| Scheduler | CCX12 | €6.00 |
| Worker (on-demand) | CCX22 | €24.00 |
| **Total** | | **~€30/mo (~$33)** |
Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.
## Scaling Workers
To add more worker capacity or different instance sizes:
1. Edit `infra/__main__.py` to add new server resources
2. Update worker config in `src/orchestrator/workers.yaml`
3. Run `pulumi up` to provision
Example worker sizes:
- CCX12: 2 vCPU, 8GB RAM (light workloads)
- CCX22: 4 vCPU, 16GB RAM (medium workloads)
- CCX32: 8 vCPU, 32GB RAM (heavy workloads)
- CCX42: 16 vCPU, 64GB RAM (very heavy workloads)
## Destroying Infrastructure
```bash
cd infra
pulumi destroy
```
**Warning:** This will delete all buckets and servers. Backup data first!
## Next Steps
1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)

167
infra/__main__.py Normal file
View File

@@ -0,0 +1,167 @@
"""
BeanFlows.coffee Infrastructure
Cloudflare R2 + Iceberg + Hetzner compute stack
"""
import pulumi
import pulumi_cloudflare as cloudflare
import pulumi_hcloud as hcloud
# Load configuration
config = pulumi.Config()
cloudflare_account_id = config.require("cloudflare_account_id")
hetzner_location = config.get("hetzner_location") or "nbg1" # Nuremberg datacenter
# ============================================================
# Cloudflare R2 Storage + Data Catalog (Iceberg)
# ============================================================
# R2 bucket for raw data (extraction outputs)
raw_bucket = cloudflare.R2Bucket(
"materia-raw",
account_id=cloudflare_account_id,
name="materia-raw",
location="weur", # Western Europe
)
# R2 bucket for lakehouse (Iceberg tables)
lakehouse_bucket = cloudflare.R2Bucket(
"materia-lakehouse",
account_id=cloudflare_account_id,
name="materia-lakehouse",
location="weur",
)
# TODO: Enable R2 Data Catalog (Iceberg) on lakehouse bucket
# Note: As of Oct 2025, R2 Data Catalog is in public beta
# May need to enable via Cloudflare dashboard or API once SDK supports it
# For now, document manual step in README
# API token for R2 access (needs R2 + Data Catalog permissions)
# Note: Create this manually in Cloudflare dashboard and store in Pulumi config
# pulumi config set --secret cloudflare_r2_token <token>
# ============================================================
# Hetzner Cloud Infrastructure
# ============================================================
# SSH key for server access
ssh_key = hcloud.SshKey(
"materia-ssh-key",
name="materia-deployment-key",
public_key=config.require_secret("ssh_public_key"),
)
# Small CCX instance for scheduler/orchestrator
# This runs the cron scheduler + lightweight tasks
scheduler_server = hcloud.Server(
"materia-scheduler",
name="materia-scheduler",
server_type="ccx12", # 2 vCPU, 8GB RAM, ~€6/mo
image="ubuntu-24.04",
location=hetzner_location,
ssh_keys=[ssh_key.id],
labels={
"role": "scheduler",
"project": "materia",
},
user_data="""#!/bin/bash
# Basic server setup
apt-get update
apt-get install -y python3.13 python3-pip git curl
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Configure environment
echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> /root/.bashrc
""",
)
# Larger CCX instance for heavy SQLMesh workloads
# This gets spun up on-demand for big transformations
worker_server = hcloud.Server(
"materia-worker-01",
name="materia-worker-01",
server_type="ccx22", # 4 vCPU, 16GB RAM, ~€24/mo
image="ubuntu-24.04",
location=hetzner_location,
ssh_keys=[ssh_key.id],
labels={
"role": "worker",
"project": "materia",
},
user_data="""#!/bin/bash
# Basic server setup
apt-get update
apt-get install -y python3.13 python3-pip git curl
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Configure environment
echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> /root/.bashrc
""",
)
# Firewall for servers (restrict to SSH + outbound only)
firewall = hcloud.Firewall(
"materia-firewall",
name="materia-firewall",
rules=[
# Allow SSH from anywhere (consider restricting to your IP)
hcloud.FirewallRuleArgs(
direction="in",
protocol="tcp",
port="22",
source_ips=["0.0.0.0/0", "::/0"],
),
# Allow all outbound traffic
hcloud.FirewallRuleArgs(
direction="out",
protocol="tcp",
port="any",
destination_ips=["0.0.0.0/0", "::/0"],
),
hcloud.FirewallRuleArgs(
direction="out",
protocol="udp",
port="any",
destination_ips=["0.0.0.0/0", "::/0"],
),
],
)
# Apply firewall to all servers
scheduler_firewall = hcloud.FirewallAttachment(
"scheduler-firewall",
firewall_id=firewall.id,
server_ids=[scheduler_server.id],
)
worker_firewall = hcloud.FirewallAttachment(
"worker-firewall",
firewall_id=firewall.id,
server_ids=[worker_server.id],
)
# ============================================================
# Outputs
# ============================================================
pulumi.export("raw_bucket_name", raw_bucket.name)
pulumi.export("lakehouse_bucket_name", lakehouse_bucket.name)
pulumi.export("scheduler_ip", scheduler_server.ipv4_address)
pulumi.export("worker_ip", worker_server.ipv4_address)
# Export connection info for DuckDB
pulumi.export(
"duckdb_r2_config",
pulumi.Output.all(cloudflare_account_id, lakehouse_bucket.name).apply(
lambda args: {
"account_id": args[0],
"bucket": args[1],
"catalog_uri": f"https://catalog.cloudflarestorage.com/{args[0]}/r2-data-catalog",
}
),
)