diff --git a/CLAUDE.md b/CLAUDE.md index 0861031..9f97d58 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -55,8 +55,11 @@ SQLMesh project implementing a layered data architecture. ```bash cd transform/sqlmesh_materia -# Plan changes (no prompts, auto-apply enabled in config) -sqlmesh plan +# Local development (creates virtual environment) +sqlmesh plan dev_ + +# Production +sqlmesh plan prod # Run tests sqlmesh test @@ -76,10 +79,17 @@ sqlmesh ui **Configuration:** - Config: `transform/sqlmesh_materia/config.yaml` -- Default gateway: `dev` (uses `materia_dev.db`) -- Production gateway: `prod` (uses `materia_prod.db`) +- Single gateway: `prod` (connects to R2 Iceberg catalog) +- Uses virtual environments for dev isolation (e.g., `dev_deeman`) +- Production uses `prod` environment - Auto-apply enabled, no interactive prompts -- DuckDB extensions: zipfs, httpfs, iceberg +- DuckDB extensions: httpfs, iceberg + +**Environment Strategy:** +- All environments connect to the same R2 Iceberg catalog +- Dev environments (e.g., `dev_deeman`) are isolated virtual environments +- SQLMesh manages environment isolation and table versioning +- No local DuckDB files needed ### 3. Core Package (`src/materia/`) Currently minimal; main logic resides in workspace packages. @@ -254,10 +264,10 @@ Supervisor: uv run materia pipeline run ``` #### 5. Data Storage -- **Dev**: Local DuckDB file (`materia_dev.db`) -- **Prod**: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API) +- **All environments**: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API) - ACID transactions on object storage - No persistent database on workers + - Virtual environments for dev isolation (e.g., `dev_deeman`) **Execution Flow:** 1. Supervisor loop wakes up every 15 minutes @@ -299,14 +309,15 @@ Supervisor: uv run materia pipeline run - Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`) - Keep raw layer thin, push transformations to staging+ -## Database Location +## Data Storage -- **Dev database:** `materia_dev.db` (13GB, in project root) -- **Prod database:** `materia_prod.db` (not yet created) - -Note: The dev database is large and should not be committed to git (.gitignore already configured). +All data is stored in Cloudflare R2 Data Catalog (Apache Iceberg) via REST API: +- **Production environment:** `prod` +- **Dev environments:** `dev_` (virtual environments) +- SQLMesh manages environment isolation and table versioning +- No local database files needed - We use a monorepo with uv workspaces - The pulumi env is called beanflows/prod -- NEVER hardcode secrets in plaintext -- Never add ssh keys to the git repo! +- NEVER hardcode secrets in plaintext +- Never add ssh keys to the git repo! - If there is a simpler more direct solution and there is no other tradeoff, always choose the simpler solution \ No newline at end of file diff --git a/transform/sqlmesh_materia/README.md b/transform/sqlmesh_materia/README.md index e69de29..a1bae2a 100644 --- a/transform/sqlmesh_materia/README.md +++ b/transform/sqlmesh_materia/README.md @@ -0,0 +1,92 @@ +# Materia SQLMesh Transform Layer + +Data transformation pipeline using SQLMesh and DuckDB, implementing a 4-layer architecture. + +## Quick Start + +```bash +cd transform/sqlmesh_materia + +# Local development (virtual environment) +sqlmesh plan dev_ + +# Production +sqlmesh plan prod + +# Run tests +sqlmesh test + +# Format SQL +sqlmesh format +``` + +## Architecture + +### Gateway Configuration + +**Single Gateway:** All environments connect to Cloudflare R2 Data Catalog (Apache Iceberg) +- **Production:** `sqlmesh plan prod` +- **Development:** `sqlmesh plan dev_` (isolated virtual environment) + +SQLMesh manages environment isolation automatically - no need for separate local databases. + +### 4-Layer Data Model + +See `models/README.md` for detailed architecture documentation: + +1. **Raw** - Immutable source data +2. **Staging** - Schema, types, basic cleansing +3. **Cleaned** - Business logic, integration +4. **Serving** - Analytics-ready (facts, dimensions, aggregates) + +## Configuration + +**Config:** `config.yaml` +- DuckDB in-memory with R2 Iceberg catalog +- Extensions: httpfs, iceberg +- Auto-apply enabled (no prompts) +- Initialization hooks for R2 secret/catalog attachment + +## Commands + +```bash +# Plan changes for dev environment +sqlmesh plan dev_yourname + +# Plan changes for prod +sqlmesh plan prod + +# Run tests +sqlmesh test + +# Validate models +sqlmesh validate + +# Run audits +sqlmesh audit + +# Format SQL files +sqlmesh format + +# Start web UI +sqlmesh ui +``` + +## Environment Variables (Prod) + +Required for production R2 Iceberg catalog: +- `CLOUDFLARE_API_TOKEN` - R2 API token +- `ICEBERG_REST_URI` - R2 catalog REST endpoint +- `R2_WAREHOUSE_NAME` - Warehouse name (default: "materia") + +These are injected via Pulumi ESC (`beanflows/prod`) on the supervisor instance. + +## Development Workflow + +1. Make changes to models in `models/` +2. Test locally: `sqlmesh test` +3. Plan changes: `sqlmesh plan dev_yourname` +4. Review and apply changes +5. Commit and push to trigger CI/CD + +SQLMesh will handle environment isolation, table versioning, and incremental updates automatically. diff --git a/transform/sqlmesh_materia/config.yaml b/transform/sqlmesh_materia/config.yaml index 870e642..7f5634d 100644 --- a/transform/sqlmesh_materia/config.yaml +++ b/transform/sqlmesh_materia/config.yaml @@ -1,18 +1,8 @@ # --- Gateway Connection --- +# Single gateway connecting to R2 Iceberg catalog +# Local dev uses virtual environments (e.g., dev_) +# Production uses the 'prod' environment gateways: - - dev: - connection: - # For more information on configuring the connection to your execution engine, visit: - # https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#connection - # https://sqlmesh.readthedocs.io/en/stable/integrations/engines/duckdb/#connection-options - type: duckdb - database: materia_dev.db - extensions: - - name: zipfs - - name: httpfs - - name: iceberg - prod: connection: type: duckdb @@ -21,8 +11,7 @@ gateways: - name: httpfs - name: iceberg - -default_gateway: dev +default_gateway: prod # --- Hooks --- # Run initialization SQL before all plans/runs