162 lines
4.2 KiB
Markdown
162 lines
4.2 KiB
Markdown
# Materia Infrastructure
|
|
|
|
Pulumi-managed infrastructure for BeanFlows.coffee
|
|
|
|
## Stack Overview
|
|
|
|
- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
|
|
- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
|
|
- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)
|
|
|
|
## Prerequisites
|
|
|
|
1. **Cloudflare Account**
|
|
- Sign up at https://dash.cloudflare.com
|
|
- Create API token with R2 + Data Catalog permissions
|
|
- Get your Account ID from dashboard
|
|
|
|
2. **Hetzner Cloud Account**
|
|
- Sign up at https://console.hetzner.cloud
|
|
- Create API token with Read & Write permissions
|
|
|
|
3. **Pulumi Account** (optional, can use local state)
|
|
- Sign up at https://app.pulumi.com
|
|
- Or use local state with `pulumi login --local`
|
|
|
|
4. **SSH Key**
|
|
- Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`
|
|
|
|
## Initial Setup
|
|
|
|
```bash
|
|
cd infra
|
|
|
|
# Login to Pulumi (local or cloud)
|
|
pulumi login # or: pulumi login --local
|
|
|
|
# Initialize the stack
|
|
pulumi stack init dev
|
|
|
|
# Configure secrets
|
|
pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
|
|
pulumi config set cloudflare_account_id <your-account-id>
|
|
pulumi config set --secret hcloud:token <your-hetzner-token>
|
|
pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"
|
|
|
|
# Preview changes
|
|
pulumi preview
|
|
|
|
# Deploy infrastructure
|
|
pulumi up
|
|
```
|
|
|
|
## What Gets Provisioned
|
|
|
|
### Cloudflare R2 Buckets
|
|
|
|
1. **materia-raw** - Raw data from extraction (immutable archives)
|
|
2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)
|
|
|
|
### Hetzner Cloud Servers
|
|
|
|
1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
|
|
- Runs cron scheduler
|
|
- Lightweight orchestration tasks
|
|
- Always-on, low cost (~€6/mo)
|
|
|
|
2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
|
|
- Heavy SQLMesh transformations
|
|
- Can be stopped when not in use
|
|
- Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)
|
|
|
|
3. **materia-firewall**
|
|
- SSH access (port 22)
|
|
- All outbound traffic allowed
|
|
- No inbound HTTP/HTTPS (we're not running web services yet)
|
|
|
|
## Enabling R2 Data Catalog (Iceberg)
|
|
|
|
As of October 2025, R2 Data Catalog is in public beta. Enable it manually:
|
|
|
|
1. Go to Cloudflare Dashboard → R2
|
|
2. Select the `materia-lakehouse` bucket
|
|
3. Navigate to Settings → Data Catalog
|
|
4. Click "Enable Data Catalog"
|
|
|
|
Once enabled, you can connect DuckDB to the Iceberg REST catalog:
|
|
|
|
```python
|
|
import duckdb
|
|
|
|
# Get catalog URI from Pulumi outputs
|
|
# pulumi stack output duckdb_r2_config
|
|
|
|
conn = duckdb.connect()
|
|
conn.execute("INSTALL iceberg; LOAD iceberg;")
|
|
conn.execute(f"""
|
|
ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
|
|
AS lakehouse (
|
|
TYPE ICEBERG_REST,
|
|
SECRET '<r2_api_token>'
|
|
);
|
|
""")
|
|
```
|
|
|
|
## Server Access
|
|
|
|
Get server IPs from Pulumi outputs:
|
|
|
|
```bash
|
|
pulumi stack output scheduler_ip
|
|
pulumi stack output worker_ip
|
|
```
|
|
|
|
SSH into servers:
|
|
|
|
```bash
|
|
ssh root@<scheduler_ip>
|
|
ssh root@<worker_ip>
|
|
```
|
|
|
|
## Cost Estimates (Monthly)
|
|
|
|
| Resource | Type | Cost |
|
|
|----------|------|------|
|
|
| R2 Storage | 10 GB | $0.15 |
|
|
| R2 Operations | 1M reads | $0.36 |
|
|
| R2 Egress | Unlimited | $0.00 (zero egress!) |
|
|
| Scheduler | CCX12 | €6.00 |
|
|
| Worker (on-demand) | CCX22 | €24.00 |
|
|
| **Total** | | **~€30/mo (~$33)** |
|
|
|
|
Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.
|
|
|
|
## Scaling Workers
|
|
|
|
To add more worker capacity or different instance sizes:
|
|
|
|
1. Edit `infra/__main__.py` to add new server resources
|
|
2. Update worker config in `src/orchestrator/workers.yaml`
|
|
3. Run `pulumi up` to provision
|
|
|
|
Example worker sizes:
|
|
- CCX12: 2 vCPU, 8GB RAM (light workloads)
|
|
- CCX22: 4 vCPU, 16GB RAM (medium workloads)
|
|
- CCX32: 8 vCPU, 32GB RAM (heavy workloads)
|
|
- CCX42: 16 vCPU, 64GB RAM (very heavy workloads)
|
|
|
|
## Destroying Infrastructure
|
|
|
|
```bash
|
|
cd infra
|
|
pulumi destroy
|
|
```
|
|
|
|
**Warning:** This will delete all buckets and servers. Backup data first!
|
|
|
|
## Next Steps
|
|
|
|
1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
|
|
2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
|
|
3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)
|