Commit Graph

136 Commits

Author SHA1 Message Date
Deeman
320ddd5123 Add architectural plan document for PSD extraction refactoring
Documents the complete analysis, implementation, and results of the
PSD extraction refactoring from the architecture advisor's recommendations.

Includes:
- Problem statement and key insights
- Architecture analysis (data-oriented approach)
- Implementation phases and results
- Testing outcomes and metrics
- 227 files migrated, ~40 lines reduced, 220+ → 1-4 requests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 22:55:58 +02:00
Deeman
d30ec9b66b Add R2 upload support with landing bucket path
## Changes

1. **Support ESC environment variable names**
   - Fallback to R2_ADMIN_ACCESS_KEY_ID if R2_ACCESS_KEY not set
   - Fallback to R2_ADMIN_SECRET_ACCESS_KEY if R2_SECRET_KEY not set
   - Allows script to work with Pulumi ESC (beanflows/prod) variables

2. **Use landing bucket path**
   - Changed R2 path from `psd/{etag}.zip` to `landing/psd/{etag}.zip`
   - All extracted data goes to landing bucket for consistent organization

3. **Updated Pulumi ESC environment**
   - Added R2_BUCKET=beanflows-data-prod
   - Fixed R2_ENDPOINT to remove bucket path (now just account URL)

## Testing

-  R2 upload works: Uploaded to landing/psd/316039e2612edc1_0.zip
-  R2 deduplication works: Skips upload if file exists
-  Local mode still works without credentials

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 22:45:30 +02:00
Deeman
57f2909001 Update documentation: Pulumi ESC usage and CI/CD bootstrap clarification
## Changes

1. **Added Pulumi ESC section**
   - How to login and load secrets into shell
   - `esc run` command for running commands with secrets
   - List of available secrets in `beanflows/prod` environment
   - Examples for common use cases

2. **Fixed supervisor bootstrap documentation**
   - Clarified that bootstrapping happens automatically in CI/CD
   - Pipeline checks if supervisor is already bootstrapped
   - Runs bootstrap script automatically only if needed
   - Removed misleading "one-time" manual bootstrap instructions
   - Added note that it's only needed manually in exceptional cases

3. **Updated deploy:supervisor stage description**
   - More accurate description of the bootstrap check logic
   - Explains the conditional execution (bootstrap vs status check)

These updates make the documentation more accurate and helpful for both
local development (with ESC) and understanding the production deployment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 22:07:24 +02:00
Deeman
38897617e7 Refactor PSD extraction: simplify to latest-only + add R2 support
## Key Changes

1. **Simplified extraction logic**
   - Changed from downloading 220+ historical archives to checking only latest available month
   - Tries current month and falls back up to 3 months (handles USDA publication lag)
   - Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary

2. **Flat storage structure**
   - Old: `data/{year}/{month}/{etag}.zip`
   - New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2)
   - Migrated 226 existing files to flat structure

3. **Dual storage modes**
   - **Local mode**: Downloads to local directory (development)
   - **R2 mode**: Uploads to Cloudflare R2 (production)
   - Mode determined by presence of R2 environment variables
   - Added boto3 dependency for S3-compatible R2 API

4. **Updated raw SQLMesh model**
   - Changed pattern from `**/*.zip` to `*.zip` to match flat structure

## Benefits

- Simpler: Single file check instead of 220+ URL attempts
- Efficient: ETag-based deduplication works naturally
- Flexible: Supports both local dev and production R2 storage
- Maintainable: Removed unnecessary complexity

## Testing

-  Local extraction works and respects ETags
-  Falls back correctly when current month unavailable
-  Linting passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 22:02:15 +02:00
Hendrik Dreesmann
8729848731 Merge branch 'fix/sqlmesh-config-and-ci-deployment' into 'master'
Fix SQLMesh config and CI/CD deployment issues

See merge request deemanone/materia!8
2025-10-13 22:26:58 +02:00
Deeman
2d248a2eef Fix SQLMesh config to use correct Pulumi ESC env var names
- Update secret token: CLOUDFLARE_API_TOKEN → R2_ADMIN_API_TOKEN
- Update warehouse name: R2_WAREHOUSE_NAME → ICEBERG_WAREHOUSE_NAME
- Update endpoint: ICEBERG_REST_URI → ICEBERG_CATALOG_URI

- Remove CREATE SCHEMA and USE statements
  - DuckDB has bug with Iceberg REST: missing Content-Type header
  - Schema creation via SQL currently not supported
  - Models will use fully-qualified table names instead

Successfully tested with real R2 credentials:
- Iceberg catalog attachment works ✓
- Plan dry-run executes ✓
- Only fails on missing source data (expected) ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 22:21:27 +02:00
Deeman
05ef15bfdf Configure Iceberg catalog with proper secret reference
- Add catalog ATTACH statement in before_all with SECRET parameter
  - References r2_secret created by connection configuration
  - Uses proper DuckDB ATTACH syntax per Cloudflare docs
  - Single-line format to avoid Jinja parsing issues

- Remove manual CREATE SECRET from before_all hooks
  - Secret automatically created by SQLMesh from connection config
  - Cleaner separation: connection defines credentials, hooks use them

Successfully tested - config validates without warnings.
Only fails on missing env vars (expected locally).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 22:10:51 +02:00
Deeman
2ad344abf4 Refactor SQLMesh config to use connection-level secrets
- Move Iceberg secret from before_all hook to connection.secrets
  - Fixes SQLMesh warning about unsupported @env_var syntax
  - Uses Jinja templating {{ env_var() }} instead of @env_var()

- Remove database: ':memory:' (incompatible with catalogs)
  - DuckDB doesn't allow both database and catalogs config
  - Connection defaults to in-memory when no database specified

- Simplify before_all hooks to only handle ATTACH and schema setup
  - Secret is now created automatically by SQLMesh
  - Cleaner separation: connection config vs runtime setup

Based on:
- https://developers.cloudflare.com/r2/data-catalog/config-examples/duckdb/
- https://sqlmesh.readthedocs.io/en/latest/integrations/engines/duckdb/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 22:04:25 +02:00
Deeman
120fef369a Fix SQLMesh config and CI/CD deployment issues
- Fix SQLMesh config: Add semicolons to SQL statements in before_all hooks
  - Resolves "unsupported syntax" warning for CREATE SECRET and ATTACH
  - DuckDB requires semicolons to terminate statements properly

- Fix deploy:infra job: Update Pulumi authentication
  - Remove `pulumi login --token` (not supported in Docker image)
  - Use PULUMI_ACCESS_TOKEN environment variable directly
  - Chain commands with && to avoid "unknown command 'sh'" error

- Fix deploy:supervisor job: Update esc login syntax
  - Change `esc login --token` to `esc login` (--token flag doesn't exist)
  - esc CLI reads token from PULUMI_ACCESS_TOKEN env var
  - Simplify Pulumi CLI installation (remove apk fallback logic)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 21:58:43 +02:00
Hendrik Dreesmann
70854394c3 Merge branch 'feature/supervisor-deployment' into 'master'
Add supervisor deployment with continuous pipeline orchestration

See merge request deemanone/materia!7
2025-10-13 21:51:05 +02:00
Deeman
d2352c1876 Simplify SQLMesh to use single prod gateway with virtual environments
- Remove dev gateway (local DuckDB file no longer needed)
- Single prod gateway connects to R2 Iceberg catalog
- Use virtual environments for dev isolation (e.g., dev_<username>)
- Update CLAUDE.md with new workflow and environment strategy
- Create comprehensive transform/sqlmesh_materia/README.md

Benefits:
- Simpler configuration (one gateway instead of two)
- All environments use same R2 Iceberg catalog
- SQLMesh handles environment isolation automatically
- No need to maintain local 13GB materia_dev.db file
- before_all hooks only run for prod gateway (no conditional logic needed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 21:47:04 +02:00
Deeman
6536724e00 Fix SQLMesh config: remove invalid init_script parameter
- Remove init_script from DuckDB connection config (not a valid parameter)
- Move R2 Iceberg catalog initialization to before_all hooks
- Hooks run before sqlmesh plan/run commands
- Uses SQLMesh @env_var() macro syntax for environment variables

Fixes CI/CD error: 'invalid duckdb connection config: invalid field init_script'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 21:31:56 +02:00
Deeman
2fff895a73 Simplify supervisor architecture and automate bootstrap
- Simplify supervisor.sh following TigerBeetle pattern
  - Remove complex functions, use simple while loop
  - Add || sleep 600 for resilience against crashes
  - Use git switch --discard-changes for clean updates
  - Run pipelines every hour (SQLMesh handles scheduling)
  - Use POSIX sh instead of bash

- Remove /repo subdirectory nesting
  - Repository clones directly to /opt/materia
  - Simpler paths throughout

- Move systemd service to repo
  - Bootstrap copies from repo instead of hardcoding
  - Service can be updated via git pull

- Automate bootstrap in CI/CD
  - deploy:supervisor now auto-bootstraps on first deploy
  - Waits for SSH to be ready (retry loop)
  - Injects secrets via SSH environment
  - Idempotent: detects if already bootstrapped

Result: Push to master and supervisor "just works"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 21:17:12 +02:00
Deeman
21f99767bf Use GitLab project access token instead of SSH deploy key
More secure approach:
- Uses HTTPS with token instead of SSH keys
- Token can be rotated without touching infrastructure
- Scoped to read_repository only
- Token stored in Pulumi ESC (beanflows/prod)

Setup:
1. Create project access token in GitLab with read_repository scope
2. Add GITLAB_READ_TOKEN to Pulumi ESC
3. Bootstrap script will use it for git clone/pull
2025-10-13 20:37:28 +02:00
Deeman
f46fd53d38 Update bootstrap script with correct GitLab repo URL 2025-10-13 20:36:08 +02:00
Deeman
558829f70b Refactor to git-based deployment: simplify CI/CD and supervisor
Addresses GitLab PR comments:
1. Remove hardcoded secrets from Pulumi.prod.yaml, use ESC environment
2. Simplify deployment by using git pull instead of R2 artifacts
3. Add bootstrap script for one-time supervisor setup

Major changes:
- **Pulumi config**: Use ESC environment (beanflows/prod) for all secrets
- **Supervisor script**: Git-based deployment (git pull every 15 min)
  * No more artifact downloads from R2
  * Runs code directly via `uv run materia`
  * Self-updating from master branch
- **Bootstrap script**: New infra/bootstrap_supervisor.sh for initial setup
  * One-time script to clone repo and setup systemd service
  * Idempotent and simple
- **CI/CD simplification**: Remove build and R2 deployment stages
  * Eliminated build:extract, build:transform, build:cli jobs
  * Eliminated deploy:r2 job
  * Simplified deploy:supervisor to just check bootstrap status
  * Reduced from 4 stages to 3 stages (Lint → Test → Deploy)
- **Documentation**: Updated CLAUDE.md with new architecture
  * Git-based deployment flow
  * Bootstrap instructions
  * Simplified execution model

Benefits:
-  No hardcoded secrets in config files
-  Simpler deployment (no artifact builds)
-  Easy to test locally (just git clone + uv sync)
-  Auto-updates every 15 minutes
-  Fewer CI/CD jobs (faster pipelines)
-  Cleaner separation of concerns

Inspired by TigerBeetle's CFO supervisor pattern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-13 20:31:38 +02:00
Deeman
60989675b0 Add Pulumi prod stack config file 2025-10-12 23:19:10 +02:00
Deeman
719aa8edd9 Remove R2 bucket management from Pulumi, use cpx11 for supervisor
- R2 buckets (beanflows-artifacts, beanflows-data-prod) managed manually in Cloudflare UI
- R2 API tokens don't work with Cloudflare Pulumi provider
- Use cpx11 (€4.49/mo) instead of non-existent ccx11
- Import existing SSH key (deeman@DeemanPC)
- Successfully deployed supervisor at 49.13.231.178
2025-10-12 23:18:52 +02:00
Deeman
da17a29987 Rename Pulumi resource names to match actual R2 bucket names 2025-10-12 22:31:59 +02:00
Deeman
f207fb441d Add supervisor deployment with continuous pipeline orchestration
Implements automated supervisor instance deployment that runs scheduled
pipelines using a TigerBeetle-inspired continuous orchestration pattern.

Infrastructure changes:
- Update Pulumi to use existing R2 buckets (beanflows-artifacts, beanflows-data-prod)
- Rename scheduler → supervisor, optimize to CCX11 (€4/mo)
- Remove always-on worker (workers are now ephemeral only)
- Add artifacts bucket resource for CLI/pipeline packages

Supervisor architecture:
- supervisor.sh: Continuous loop checking schedules every 15 minutes
- Self-updating: Checks for new CLI versions hourly
- Fixed schedules: Extract at 2 AM UTC, Transform at 3 AM UTC
- systemd service for automatic restart on failure
- Logs to systemd journal for observability

CI/CD changes:
- deploy:infra now runs on every master push (not just on changes)
- New deploy:supervisor job:
  * Deploys supervisor.sh and systemd service
  * Installs latest materia CLI from R2
  * Configures environment with Pulumi ESC secrets
  * Restarts supervisor service

Future enhancements documented:
- SQLMesh-aware scheduling (check models before running)
- Model tags for worker sizing (heavy/distributed hints)
- Multi-pipeline support, distributed execution
- Cost optimization with multi-cloud spot pricing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 22:23:55 +02:00
Deeman
7e6ff29dea add claude memory update 2025-10-12 21:52:39 +02:00
Deeman
6c93021f2d remove stupid rules 2025-10-12 21:44:56 +02:00
Deeman
7e06eae5ac Add comprehensive ruff linting rules and migrate to uv build backend
- Configure ruff with strict linting rules (pycodestyle, pyflakes, isort, pylint, etc.)
- Exclude notebooks folder from linting
- Set line length to 88 characters and target Python 3.13
- Migrate build backend from hatchling to uv_build for better integration
- Add per-file ignores for __init__.py and scripts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 21:41:39 +02:00
Deeman
ce1cad4c41 fix 2025-10-12 21:36:32 +02:00
Deeman
5ce112f44d Add comprehensive E2E tests for materia CLI
- Add pytest and pytest-cov for testing
- Add niquests for modern HTTP/2 support (keep requests for hcloud compatibility)
- Create 13 E2E tests covering CLI, workers, pipelines, and secrets (71% coverage)
- Fix Pulumi ESC environment path (beanflows/prod) and secret key names
- Update GitLab CI to run CLI tests with coverage reporting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 21:32:51 +02:00
Deeman
ca308a7275 delete todos 2025-10-12 21:05:21 +02:00
Deeman
55bb84f0fa implement cli/infra update cicd 2025-10-12 21:00:41 +02:00
Deeman
790e802edd updates 2025-10-12 14:26:55 +02:00
Deeman
77dd277ebf updates 2025-10-12 14:26:37 +02:00
Deeman
ac9b23af17 Add CLAUDE.md documentation for AI-assisted development
Comprehensive guide covering project architecture, SQLMesh workflow,
data layer conventions, and development commands for the Materia
commodity analytics platform.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 13:21:13 +02:00
Deeman
025dda16c6 update dedupe logic -> much faster now 2025-10-07 22:32:45 +02:00
Deeman
da89c2bf6e update staging pipeline 2025-10-07 22:20:48 +02:00
Deeman
0a409acbea update path 2025-09-10 18:56:32 +02:00
Deeman
85704a4bf1 Change layer naming 2025-09-10 18:46:18 +02:00
Deeman
f5f2dbc7a5 refactor 2025-08-25 20:50:25 +02:00
Hendrik Dreesmann
a2ffc96aa3 Merge branch 'CEC' into 'master'
Update file Commodity Exchange Codes.xls

See merge request deemanone/materia!6
2025-08-01 20:03:27 +02:00
Simon Dmsn
5588be152b Update 3 files
- /notebooks/03_Extraction.ipynb
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata_1_filter_silver_layer.sql
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata_2_filter_gold_layer.sql
2025-08-01 14:52:55 +00:00
Simon Dmsn
1c87488cc7 Update 4 files
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata.sql
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata_1_filter_silver_layer.sql
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata_2_filter_gold_layer.sql
- /transform/sqlmesh_materia/models/staging/stg_psd_alldata_0.sql
2025-08-01 14:45:34 +00:00
Simon Dmsn
82b27e7c55 Update 2 files
- /transform/sqlmesh_materia/seeds/commodity_exchange_codes.csv
- /transform/sqlmesh_materia/seeds/psd_codes_exchange_codes_merge.csv
2025-08-01 14:41:48 +00:00
Simon Dmsn
9d7cc4e1fb Update file commodity_exchange_codes.csv 2025-08-01 14:26:19 +00:00
Simon Dmsn
4ad4386ccc Update 2 files
- /transform/sqlmesh_materia/models/staging/Commodity Exchange Codes.xls
- /transform/sqlmesh_materia/seeds/commodity_exchange_codes.csv
2025-08-01 14:24:26 +00:00
Simon Dmsn
918b0071b1 Update file Commodity Exchange Codes.xls 2025-08-01 14:22:01 +00:00
Deeman
91f8968990 remove comment 2025-07-31 19:48:18 +02:00
Deeman
641f794d61 fix seeds; update models 2025-07-27 22:49:37 +02:00
Deeman
c0d8f60d1c add reference data 2025-07-27 18:28:30 +02:00
Deeman
ff283b62ff exclude dbs
'
2025-07-27 15:41:34 +02:00
Deeman
8b5d05b3c2 raw ingest model 2025-07-27 15:40:41 +02:00
Deeman
f5c73e32c5 testing sqlmesh 2025-07-27 00:18:14 +02:00
Deeman
9baa0d185c testing sqlmesh 2025-07-27 00:18:03 +02:00
Deeman
f0de8a505b update projects to packages 2025-07-26 22:32:47 +02:00