# Materia Infrastructure Pulumi-managed infrastructure for BeanFlows.coffee ## Stack Overview - **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog - **Compute:** Hetzner Cloud CCX dedicated vCPU instances - **Orchestration:** Custom Python scheduler (see `src/orchestrator/`) ## Prerequisites 1. **Cloudflare Account** - Sign up at https://dash.cloudflare.com - Create API token with R2 + Data Catalog permissions - Get your Account ID from dashboard 2. **Hetzner Cloud Account** - Sign up at https://console.hetzner.cloud - Create API token with Read & Write permissions 3. **Pulumi Account** (optional, can use local state) - Sign up at https://app.pulumi.com - Or use local state with `pulumi login --local` 4. **SSH Key** - Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"` ## Initial Setup ```bash cd infra # Login to Pulumi (local or cloud) pulumi login # or: pulumi login --local # Initialize the stack pulumi stack init dev # Configure secrets pulumi config set --secret cloudflare:apiToken pulumi config set cloudflare_account_id pulumi config set --secret hcloud:token pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)" # Preview changes pulumi preview # Deploy infrastructure pulumi up ``` ## What Gets Provisioned ### Cloudflare R2 Buckets 1. **materia-raw** - Raw data from extraction (immutable archives) 2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions) ### Hetzner Cloud Servers 1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM) - Runs cron scheduler - Lightweight orchestration tasks - Always-on, low cost (~€6/mo) 2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM) - Heavy SQLMesh transformations - Can be stopped when not in use - Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo) 3. **materia-firewall** - SSH access (port 22) - All outbound traffic allowed - No inbound HTTP/HTTPS (we're not running web services yet) ## Enabling R2 Data Catalog (Iceberg) As of October 2025, R2 Data Catalog is in public beta. Enable it manually: 1. Go to Cloudflare Dashboard → R2 2. Select the `materia-lakehouse` bucket 3. Navigate to Settings → Data Catalog 4. Click "Enable Data Catalog" Once enabled, you can connect DuckDB to the Iceberg REST catalog: ```python import duckdb # Get catalog URI from Pulumi outputs # pulumi stack output duckdb_r2_config conn = duckdb.connect() conn.execute("INSTALL iceberg; LOAD iceberg;") conn.execute(f""" ATTACH 'iceberg_rest://catalog.cloudflarestorage.com//r2-data-catalog' AS lakehouse ( TYPE ICEBERG_REST, SECRET '' ); """) ``` ## Server Access Get server IPs from Pulumi outputs: ```bash pulumi stack output scheduler_ip pulumi stack output worker_ip ``` SSH into servers: ```bash ssh root@ ssh root@ ``` ## Cost Estimates (Monthly) | Resource | Type | Cost | |----------|------|------| | R2 Storage | 10 GB | $0.15 | | R2 Operations | 1M reads | $0.36 | | R2 Egress | Unlimited | $0.00 (zero egress!) | | Scheduler | CCX12 | €6.00 | | Worker (on-demand) | CCX22 | €24.00 | | **Total** | | **~€30/mo (~$33)** | Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees. ## Scaling Workers To add more worker capacity or different instance sizes: 1. Edit `infra/__main__.py` to add new server resources 2. Update worker config in `src/orchestrator/workers.yaml` 3. Run `pulumi up` to provision Example worker sizes: - CCX12: 2 vCPU, 8GB RAM (light workloads) - CCX22: 4 vCPU, 16GB RAM (medium workloads) - CCX32: 8 vCPU, 32GB RAM (heavy workloads) - CCX42: 16 vCPU, 64GB RAM (very heavy workloads) ## Destroying Infrastructure ```bash cd infra pulumi destroy ``` **Warning:** This will delete all buckets and servers. Backup data first! ## Next Steps 1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`) 2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`) 3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)