Files
beanflows/CLAUDE.md
Deeman 5ce112f44d Add comprehensive E2E tests for materia CLI
- Add pytest and pytest-cov for testing
- Add niquests for modern HTTP/2 support (keep requests for hcloud compatibility)
- Create 13 E2E tests covering CLI, workers, pipelines, and secrets (71% coverage)
- Fix Pulumi ESC environment path (beanflows/prod) and secret key names
- Update GitLab CI to run CLI tests with coverage reporting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 21:32:51 +02:00

6.0 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.

Tech Stack:

  • Python 3.13 with uv package manager
  • SQLMesh for SQL transformation and orchestration
  • DuckDB as the analytical database
  • Workspace structure with separate extract and transform packages

Environment Setup

Install dependencies:

uv sync

Setup pre-commit hooks:

pre-commit install

Add new dependencies:

uv add <package-name>

Project Structure

This is a uv workspace with three main components:

1. Extract Layer (extract/)

Contains extraction packages for pulling data from external sources.

  • extract/psdonline/: Extracts USDA PSD commodity data from archives dating back to 2006
    • Entry point: extract_psd CLI command (defined in extract/psdonline/src/psdonline/execute.py)
    • Downloads monthly zip archives to extract/psdonline/src/psdonline/data/
    • Uses ETags to avoid re-downloading unchanged files

Run extraction:

extract_psd

2. Transform Layer (transform/sqlmesh_materia/)

SQLMesh project implementing a layered data architecture.

Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/

Key commands:

cd transform/sqlmesh_materia

# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan

# Run tests
sqlmesh test

# Validate models
sqlmesh validate

# Run audits
sqlmesh audit

# Format SQL
sqlmesh format

# Start UI
sqlmesh ui

Configuration:

  • Config: transform/sqlmesh_materia/config.yaml
  • Default gateway: dev (uses materia_dev.db)
  • Production gateway: prod (uses materia_prod.db)
  • Auto-apply enabled, no interactive prompts
  • DuckDB extensions: zipfs, httpfs, iceberg

3. Core Package (src/materia/)

Currently minimal; main logic resides in workspace packages.

Data Architecture

SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:

Layer 1: Raw (models/raw/)

  • Purpose: Immutable archive of source data
  • Pattern: Directly reads from extraction outputs
  • Example: raw.psd_alldata reads zip files using DuckDB's read_csv('zip://...') function
  • Grain: Defines unique keys for each raw table

Layer 2: Staging (models/staging/)

  • Purpose: Apply schema, cast types, basic cleansing
  • Pattern: stg_[source]__[entity]
  • Example: stg_psdalldata__commodity.sql casts raw strings to proper types, joins lookup tables
  • Features:
    • Deduplication using hash keys
    • Extracts metadata (ingest_date) from file paths
    • 1:1 relationship with raw sources

Layer 3: Cleaned (models/cleaned/)

  • Purpose: Integration, business logic, unified models
  • Pattern: cln_[entity] or cln_[vault_component]_[entity]
  • Example: cln_psdalldata__commodity_pivoted.sql pivots commodity attributes into columns

Layer 4: Serving (models/serving/)

  • Purpose: Analytics-ready models (star schema, aggregates)
  • Patterns:
    • dim_[entity] for dimensions
    • fct_[process] for facts
    • agg_[description] for aggregates
    • obt_[description] for one-big-tables
  • Example: obt_commodity_metrics.sql provides wide table for analysis

Model Development

Incremental models:

  • Use INCREMENTAL_BY_TIME_RANGE kind
  • Define time_column (usually ingest_date)
  • Filter with WHERE time_column BETWEEN @start_ds AND @end_ds

Full refresh models:

  • Use FULL kind for small lookup tables and raw sources

Model properties:

  • grain: Define unique key columns for data quality
  • start: Historical backfill start date (project default: 2025-07-07)
  • cron: Schedule (project default: '@daily')

Linting and Formatting

Run linting:

ruff check .

Auto-fix issues:

ruff check --fix .

Format code:

ruff format .

Pre-commit hooks automatically run ruff on commits.

Testing

Run SQLMesh tests:

cd transform/sqlmesh_materia
sqlmesh test

Run Python tests (if configured):

pytest --cov=./ --cov-report=xml

CI/CD Pipeline

GitLab CI runs three stages (.gitlab-ci.yml):

  1. Lint: Runs ruff check and format validation, plus pip-audit
  2. Test: Runs pytest with coverage
  3. Build: Creates distribution packages (on tags only)

Key Design Patterns

Raw data ingestion:

  • DuckDB reads directly from zip archives using read_csv('zip://...')
  • filename=true captures source file path for metadata
  • union_by_name=true handles schema evolution

Deduplication:

  • Use hash() function to create unique keys
  • Use any_value() with GROUP BY hkey to deduplicate
  • Preserve all metadata in hash key for change detection

Date handling:

  • Extract ingest dates from file paths: make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)
  • Calculate market dates: last_day(make_date(market_year, month, 1))

SQLMesh best practices:

  • Always define grain for data quality validation
  • Use meaningful model names following layer conventions
  • Leverage SQLMesh's built-in time macros (@start_ds, @end_ds)
  • Keep raw layer thin, push transformations to staging+

Database Location

  • Dev database: materia_dev.db (13GB, in project root)
  • Prod database: materia_prod.db (not yet created)

Note: The dev database is large and should not be committed to git (.gitignore already configured).

  • We use a monorepo with uv workspaces
  • The pulumi env is called beanflows/prod