beanflows/infra/readme.md

# Materia Infrastructure

Pulumi-managed infrastructure for BeanFlows.coffee

## Stack Overview

- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)

## Prerequisites

1. **Cloudflare Account**
   - Sign up at https://dash.cloudflare.com
   - Create API token with R2 + Data Catalog permissions
   - Get your Account ID from dashboard

2. **Hetzner Cloud Account**
   - Sign up at https://console.hetzner.cloud
   - Create API token with Read & Write permissions

3. **Pulumi Account** (optional, can use local state)
   - Sign up at https://app.pulumi.com
   - Or use local state with `pulumi login --local`

4. **SSH Key**
   - Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`

## Initial Setup

```bash
cd infra

# Login to Pulumi (local or cloud)
pulumi login  # or: pulumi login --local

# Initialize the stack
pulumi stack init dev

# Configure secrets
pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
pulumi config set cloudflare_account_id <your-account-id>
pulumi config set --secret hcloud:token <your-hetzner-token>
pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"

# Preview changes
pulumi preview

# Deploy infrastructure
pulumi up
```

## What Gets Provisioned

### Cloudflare R2 Buckets

1. **materia-raw** - Raw data from extraction (immutable archives)
2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)

### Hetzner Cloud Servers

1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
   - Runs cron scheduler
   - Lightweight orchestration tasks
   - Always-on, low cost (~€6/mo)

2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
   - Heavy SQLMesh transformations
   - Can be stopped when not in use
   - Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)

3. **materia-firewall**
   - SSH access (port 22)
   - All outbound traffic allowed
   - No inbound HTTP/HTTPS (we're not running web services yet)

## Enabling R2 Data Catalog (Iceberg)

As of October 2025, R2 Data Catalog is in public beta. Enable it manually:

1. Go to Cloudflare Dashboard → R2
2. Select the `materia-lakehouse` bucket
3. Navigate to Settings → Data Catalog
4. Click "Enable Data Catalog"

Once enabled, you can connect DuckDB to the Iceberg REST catalog:

```python
import duckdb

# Get catalog URI from Pulumi outputs
# pulumi stack output duckdb_r2_config

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
conn.execute(f"""
    ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
    AS lakehouse (
        TYPE ICEBERG_REST,
        SECRET '<r2_api_token>'
    );
""")
```

## Server Access

Get server IPs from Pulumi outputs:

```bash
pulumi stack output scheduler_ip
pulumi stack output worker_ip
```

SSH into servers:

```bash
ssh root@<scheduler_ip>
ssh root@<worker_ip>
```

## Cost Estimates (Monthly)

| Resource | Type | Cost |
|----------|------|------|
| R2 Storage | 10 GB | $0.15 |
| R2 Operations | 1M reads | $0.36 |
| R2 Egress | Unlimited | $0.00 (zero egress!) |
| Scheduler | CCX12 | €6.00 |
| Worker (on-demand) | CCX22 | €24.00 |
| **Total** | | **~€30/mo (~$33)** |

Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.

## Scaling Workers

To add more worker capacity or different instance sizes:

1. Edit `infra/__main__.py` to add new server resources
2. Update worker config in `src/orchestrator/workers.yaml`
3. Run `pulumi up` to provision

Example worker sizes:
- CCX12: 2 vCPU, 8GB RAM (light workloads)
- CCX22: 4 vCPU, 16GB RAM (medium workloads)
- CCX32: 8 vCPU, 32GB RAM (heavy workloads)
- CCX42: 16 vCPU, 64GB RAM (very heavy workloads)

## Destroying Infrastructure

```bash
cd infra
pulumi destroy
```

**Warning:** This will delete all buckets and servers. Backup data first!

## Next Steps

1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)