deemanone/beanflows

Fork 0

Files

Deeman 7e6ff29dea add claude memory update

2025-10-12 21:52:39 +02:00

9.3 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.

Tech Stack:

Python 3.13 with uv package manager
SQLMesh for SQL transformation and orchestration
DuckDB as the analytical database
Workspace structure with separate extract and transform packages

Environment Setup

Install dependencies:

uv sync

Setup pre-commit hooks:

pre-commit install

Add new dependencies:

uv add <package-name>

Project Structure

This is a uv workspace with three main components:

1. Extract Layer (`extract/`)

Contains extraction packages for pulling data from external sources.

extract/psdonline/: Extracts USDA PSD commodity data from archives dating back to 2006
- Entry point: extract_psd CLI command (defined in extract/psdonline/src/psdonline/execute.py)
- Downloads monthly zip archives to extract/psdonline/src/psdonline/data/
- Uses ETags to avoid re-downloading unchanged files

Run extraction:

extract_psd

2. Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a layered data architecture.

Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/

Key commands:

cd transform/sqlmesh_materia

# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan

# Run tests
sqlmesh test

# Validate models
sqlmesh validate

# Run audits
sqlmesh audit

# Format SQL
sqlmesh format

# Start UI
sqlmesh ui

Configuration:

Config: transform/sqlmesh_materia/config.yaml
Default gateway: dev (uses materia_dev.db)
Production gateway: prod (uses materia_prod.db)
Auto-apply enabled, no interactive prompts
DuckDB extensions: zipfs, httpfs, iceberg

3. Core Package (`src/materia/`)

Currently minimal; main logic resides in workspace packages.

Data Architecture

SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:

Layer 1: Raw (`models/raw/`)

Purpose: Immutable archive of source data
Pattern: Directly reads from extraction outputs
Example: raw.psd_alldata reads zip files using DuckDB's read_csv('zip://...') function
Grain: Defines unique keys for each raw table

Layer 2: Staging (`models/staging/`)

Purpose: Apply schema, cast types, basic cleansing
Pattern: stg_[source]__[entity]
Example: stg_psdalldata__commodity.sql casts raw strings to proper types, joins lookup tables
Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources

Layer 3: Cleaned (`models/cleaned/`)

Purpose: Integration, business logic, unified models
Pattern: cln_[entity] or cln_[vault_component]_[entity]
Example: cln_psdalldata__commodity_pivoted.sql pivots commodity attributes into columns

Layer 4: Serving (`models/serving/`)

Purpose: Analytics-ready models (star schema, aggregates)
Patterns:
- dim_[entity] for dimensions
- fct_[process] for facts
- agg_[description] for aggregates
- obt_[description] for one-big-tables
Example: obt_commodity_metrics.sql provides wide table for analysis

Model Development

Incremental models:

Use INCREMENTAL_BY_TIME_RANGE kind
Define time_column (usually ingest_date)
Filter with WHERE time_column BETWEEN @start_ds AND @end_ds

Full refresh models:

Use FULL kind for small lookup tables and raw sources

Model properties:

grain: Define unique key columns for data quality
start: Historical backfill start date (project default: 2025-07-07)
cron: Schedule (project default: '@daily')

Linting and Formatting

Run linting:

ruff check .

Auto-fix issues:

ruff check --fix .

Format code:

ruff format .

Pre-commit hooks automatically run ruff on commits.

Testing

Run SQLMesh tests:

cd transform/sqlmesh_materia
sqlmesh test

Run Python tests (if configured):

pytest --cov=./ --cov-report=xml

CI/CD Pipeline and Production Architecture

CI/CD Pipeline (`.gitlab-ci.yml`)

4 Stages: Lint → Test → Build → Deploy

1. Lint Stage

Runs ruff check and ruff format --check
Validates code quality on every commit

2. Test Stage

test:cli: Runs pytest on materia CLI with 71% coverage
- Tests secrets management (Pulumi ESC integration)
- Tests worker lifecycle (create, list, destroy)
- Tests pipeline execution (extract, transform)
- Exports coverage reports to GitLab
test:sqlmesh: Runs SQLMesh model tests in transform layer

3. Build Stage (only on master branch)

Creates separate artifacts for each workspace package:

build:extract: Builds materia-extract-latest.tar.gz (psdonline package)
build:transform: Builds materia-transform-latest.tar.gz (sqlmesh_materia package)
build:cli: Builds materia-cli-latest.tar.gz (materia management CLI)

Each artifact is a self-contained tarball with all dependencies.

4. Deploy Stage (only on master branch)

deploy:r2: Uploads artifacts to Cloudflare R2 using rclone
- Loads secrets from Pulumi ESC (beanflows/prod)
- Only requires PULUMI_ACCESS_TOKEN in GitLab variables
- All other secrets (R2 credentials, SSH keys, API tokens) come from ESC
deploy:infra: Runs pulumi up to deploy infrastructure changes
- Only triggers when infra/**/* files change

Production Architecture: Ephemeral Worker Model

Design Philosophy:

No always-on workers (cost optimization)
Supervisor instance dynamically creates/destroys workers on-demand
Language-agnostic artifacts enable future migration to C/Rust/Go
Multi-cloud abstraction for pricing optimization

Components:

1. Supervisor Instance (Small Hetzner VM)

Runs the materia management CLI
Small, always-on instance (cheap)
Pulls secrets from Pulumi ESC
Orchestrates worker lifecycle via cloud provider APIs

2. Ephemeral Workers (On-Demand)

Created for each pipeline execution
Downloads pre-built artifacts from R2 (no git, no uv on worker)
Receives secrets via SSH environment variable injection
Destroyed immediately after job completion
Different instance types per pipeline:
- Extract: ccx12 (2 vCPU, 8GB RAM)
- Transform: ccx22 (4 vCPU, 16GB RAM)

3. Secrets Flow

Pulumi ESC (beanflows/prod)
  ↓
Supervisor Instance (materia CLI)
  ↓
Workers (injected as env vars via SSH)

4. Artifact Flow

GitLab CI: uv build → tar.gz
  ↓
Cloudflare R2 (artifact storage)
  ↓
Worker: curl → extract → execute

5. Data Storage

Dev: Local DuckDB file (materia_dev.db)
Prod: DuckDB in-memory + Cloudflare R2 Data Catalog (Iceberg REST API)
- ACID transactions on object storage
- No persistent database on workers

Execution Flow:

Supervisor receives schedule trigger (cron/manual)
CLI runs: materia pipeline run extract
Creates Hetzner worker with SSH key
Worker downloads materia-extract-latest.tar.gz from R2
CLI injects secrets via SSH: export R2_ACCESS_KEY_ID=... && ./extract_psd
Pipeline executes, writes to R2 Iceberg catalog
Worker destroyed (entire lifecycle ~5-10 minutes)

Multi-Cloud Provider Abstraction:

Protocol-based interface (data-oriented design, no OOP)
Providers: Hetzner (implemented), OVH, Scaleway, Oracle (stubs)
Allows switching providers for cost optimization
Each provider implements: create_instance, destroy_instance, list_instances, wait_for_ssh

Key Design Patterns

Raw data ingestion:

DuckDB reads directly from zip archives using read_csv('zip://...')
filename=true captures source file path for metadata
union_by_name=true handles schema evolution

Deduplication:

Use hash() function to create unique keys
Use any_value() with GROUP BY hkey to deduplicate
Preserve all metadata in hash key for change detection

Date handling:

Extract ingest dates from file paths: make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)
Calculate market dates: last_day(make_date(market_year, month, 1))

SQLMesh best practices:

Always define grain for data quality validation
Use meaningful model names following layer conventions
Leverage SQLMesh's built-in time macros (@start_ds, @end_ds)
Keep raw layer thin, push transformations to staging+

Database Location

Dev database: materia_dev.db (13GB, in project root)
Prod database: materia_prod.db (not yet created)

Note: The dev database is large and should not be committed to git (.gitignore already configured).

We use a monorepo with uv workspaces
The pulumi env is called beanflows/prod

9.3 KiB Raw Blame History