Simplify SQLMesh to use single prod gateway with virtual environments

- Remove dev gateway (local DuckDB file no longer needed)
- Single prod gateway connects to R2 Iceberg catalog
- Use virtual environments for dev isolation (e.g., dev_<username>)
- Update CLAUDE.md with new workflow and environment strategy
- Create comprehensive transform/sqlmesh_materia/README.md

Benefits:
- Simpler configuration (one gateway instead of two)
- All environments use same R2 Iceberg catalog
- SQLMesh handles environment isolation automatically
- No need to maintain local 13GB materia_dev.db file
- before_all hooks only run for prod gateway (no conditional logic needed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Deeman
2025-10-13 21:47:04 +02:00
parent 6536724e00
commit d2352c1876
3 changed files with 121 additions and 29 deletions

View File

@@ -55,8 +55,11 @@ SQLMesh project implementing a layered data architecture.
```bash
cd transform/sqlmesh_materia
# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan
# Local development (creates virtual environment)
sqlmesh plan dev_<username>
# Production
sqlmesh plan prod
# Run tests
sqlmesh test
@@ -76,10 +79,17 @@ sqlmesh ui
**Configuration:**
- Config: `transform/sqlmesh_materia/config.yaml`
- Default gateway: `dev` (uses `materia_dev.db`)
- Production gateway: `prod` (uses `materia_prod.db`)
- Single gateway: `prod` (connects to R2 Iceberg catalog)
- Uses virtual environments for dev isolation (e.g., `dev_deeman`)
- Production uses `prod` environment
- Auto-apply enabled, no interactive prompts
- DuckDB extensions: zipfs, httpfs, iceberg
- DuckDB extensions: httpfs, iceberg
**Environment Strategy:**
- All environments connect to the same R2 Iceberg catalog
- Dev environments (e.g., `dev_deeman`) are isolated virtual environments
- SQLMesh manages environment isolation and table versioning
- No local DuckDB files needed
### 3. Core Package (`src/materia/`)
Currently minimal; main logic resides in workspace packages.
@@ -254,10 +264,10 @@ Supervisor: uv run materia pipeline run <pipeline>
```
#### 5. Data Storage
- **Dev**: Local DuckDB file (`materia_dev.db`)
- **Prod**: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- **All environments**: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- ACID transactions on object storage
- No persistent database on workers
- Virtual environments for dev isolation (e.g., `dev_deeman`)
**Execution Flow:**
1. Supervisor loop wakes up every 15 minutes
@@ -299,12 +309,13 @@ Supervisor: uv run materia pipeline run <pipeline>
- Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`)
- Keep raw layer thin, push transformations to staging+
## Database Location
## Data Storage
- **Dev database:** `materia_dev.db` (13GB, in project root)
- **Prod database:** `materia_prod.db` (not yet created)
Note: The dev database is large and should not be committed to git (.gitignore already configured).
All data is stored in Cloudflare R2 Data Catalog (Apache Iceberg) via REST API:
- **Production environment:** `prod`
- **Dev environments:** `dev_<username>` (virtual environments)
- SQLMesh manages environment isolation and table versioning
- No local database files needed
- We use a monorepo with uv workspaces
- The pulumi env is called beanflows/prod
- NEVER hardcode secrets in plaintext

View File

@@ -0,0 +1,92 @@
# Materia SQLMesh Transform Layer
Data transformation pipeline using SQLMesh and DuckDB, implementing a 4-layer architecture.
## Quick Start
```bash
cd transform/sqlmesh_materia
# Local development (virtual environment)
sqlmesh plan dev_<username>
# Production
sqlmesh plan prod
# Run tests
sqlmesh test
# Format SQL
sqlmesh format
```
## Architecture
### Gateway Configuration
**Single Gateway:** All environments connect to Cloudflare R2 Data Catalog (Apache Iceberg)
- **Production:** `sqlmesh plan prod`
- **Development:** `sqlmesh plan dev_<username>` (isolated virtual environment)
SQLMesh manages environment isolation automatically - no need for separate local databases.
### 4-Layer Data Model
See `models/README.md` for detailed architecture documentation:
1. **Raw** - Immutable source data
2. **Staging** - Schema, types, basic cleansing
3. **Cleaned** - Business logic, integration
4. **Serving** - Analytics-ready (facts, dimensions, aggregates)
## Configuration
**Config:** `config.yaml`
- DuckDB in-memory with R2 Iceberg catalog
- Extensions: httpfs, iceberg
- Auto-apply enabled (no prompts)
- Initialization hooks for R2 secret/catalog attachment
## Commands
```bash
# Plan changes for dev environment
sqlmesh plan dev_yourname
# Plan changes for prod
sqlmesh plan prod
# Run tests
sqlmesh test
# Validate models
sqlmesh validate
# Run audits
sqlmesh audit
# Format SQL files
sqlmesh format
# Start web UI
sqlmesh ui
```
## Environment Variables (Prod)
Required for production R2 Iceberg catalog:
- `CLOUDFLARE_API_TOKEN` - R2 API token
- `ICEBERG_REST_URI` - R2 catalog REST endpoint
- `R2_WAREHOUSE_NAME` - Warehouse name (default: "materia")
These are injected via Pulumi ESC (`beanflows/prod`) on the supervisor instance.
## Development Workflow
1. Make changes to models in `models/`
2. Test locally: `sqlmesh test`
3. Plan changes: `sqlmesh plan dev_yourname`
4. Review and apply changes
5. Commit and push to trigger CI/CD
SQLMesh will handle environment isolation, table versioning, and incremental updates automatically.

View File

@@ -1,18 +1,8 @@
# --- Gateway Connection ---
# Single gateway connecting to R2 Iceberg catalog
# Local dev uses virtual environments (e.g., dev_<username>)
# Production uses the 'prod' environment
gateways:
dev:
connection:
# For more information on configuring the connection to your execution engine, visit:
# https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#connection
# https://sqlmesh.readthedocs.io/en/stable/integrations/engines/duckdb/#connection-options
type: duckdb
database: materia_dev.db
extensions:
- name: zipfs
- name: httpfs
- name: iceberg
prod:
connection:
type: duckdb
@@ -21,8 +11,7 @@ gateways:
- name: httpfs
- name: iceberg
default_gateway: dev
default_gateway: prod
# --- Hooks ---
# Run initialization SQL before all plans/runs