Files

Deeman 5ce112f44d Add comprehensive E2E tests for materia CLI

- Add pytest and pytest-cov for testing
- Add niquests for modern HTTP/2 support (keep requests for hcloud compatibility)
- Create 13 E2E tests covering CLI, workers, pipelines, and secrets (71% coverage)
- Fix Pulumi ESC environment path (beanflows/prod) and secret key names
- Update GitLab CI to run CLI tests with coverage reporting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-12 21:32:51 +02:00

6.0 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.

Tech Stack:

Python 3.13 with uv package manager
SQLMesh for SQL transformation and orchestration
DuckDB as the analytical database
Workspace structure with separate extract and transform packages

Environment Setup

Install dependencies:

uv sync

Setup pre-commit hooks:

pre-commit install

Add new dependencies:

uv add <package-name>

Project Structure

This is a uv workspace with three main components:

1. Extract Layer (`extract/`)

Contains extraction packages for pulling data from external sources.

extract/psdonline/: Extracts USDA PSD commodity data from archives dating back to 2006
- Entry point: extract_psd CLI command (defined in extract/psdonline/src/psdonline/execute.py)
- Downloads monthly zip archives to extract/psdonline/src/psdonline/data/
- Uses ETags to avoid re-downloading unchanged files

Run extraction:

extract_psd

2. Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a layered data architecture.

Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/

Key commands:

cd transform/sqlmesh_materia

# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan

# Run tests
sqlmesh test

# Validate models
sqlmesh validate

# Run audits
sqlmesh audit

# Format SQL
sqlmesh format

# Start UI
sqlmesh ui

Configuration:

Config: transform/sqlmesh_materia/config.yaml
Default gateway: dev (uses materia_dev.db)
Production gateway: prod (uses materia_prod.db)
Auto-apply enabled, no interactive prompts
DuckDB extensions: zipfs, httpfs, iceberg

3. Core Package (`src/materia/`)

Currently minimal; main logic resides in workspace packages.

Data Architecture

SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:

Layer 1: Raw (`models/raw/`)

Purpose: Immutable archive of source data
Pattern: Directly reads from extraction outputs
Example: raw.psd_alldata reads zip files using DuckDB's read_csv('zip://...') function
Grain: Defines unique keys for each raw table

Layer 2: Staging (`models/staging/`)

Purpose: Apply schema, cast types, basic cleansing
Pattern: stg_[source]__[entity]
Example: stg_psdalldata__commodity.sql casts raw strings to proper types, joins lookup tables
Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources

Layer 3: Cleaned (`models/cleaned/`)

Purpose: Integration, business logic, unified models
Pattern: cln_[entity] or cln_[vault_component]_[entity]
Example: cln_psdalldata__commodity_pivoted.sql pivots commodity attributes into columns

Layer 4: Serving (`models/serving/`)

Purpose: Analytics-ready models (star schema, aggregates)
Patterns:
- dim_[entity] for dimensions
- fct_[process] for facts
- agg_[description] for aggregates
- obt_[description] for one-big-tables
Example: obt_commodity_metrics.sql provides wide table for analysis

Model Development

Incremental models:

Use INCREMENTAL_BY_TIME_RANGE kind
Define time_column (usually ingest_date)
Filter with WHERE time_column BETWEEN @start_ds AND @end_ds

Full refresh models:

Use FULL kind for small lookup tables and raw sources

Model properties:

grain: Define unique key columns for data quality
start: Historical backfill start date (project default: 2025-07-07)
cron: Schedule (project default: '@daily')

Linting and Formatting

Run linting:

ruff check .

Auto-fix issues:

ruff check --fix .

Format code:

ruff format .

Pre-commit hooks automatically run ruff on commits.

Testing

Run SQLMesh tests:

cd transform/sqlmesh_materia
sqlmesh test

Run Python tests (if configured):

pytest --cov=./ --cov-report=xml

CI/CD Pipeline

GitLab CI runs three stages (.gitlab-ci.yml):

Lint: Runs ruff check and format validation, plus pip-audit
Test: Runs pytest with coverage
Build: Creates distribution packages (on tags only)

Key Design Patterns

Raw data ingestion:

DuckDB reads directly from zip archives using read_csv('zip://...')
filename=true captures source file path for metadata
union_by_name=true handles schema evolution

Deduplication:

Use hash() function to create unique keys
Use any_value() with GROUP BY hkey to deduplicate
Preserve all metadata in hash key for change detection

Date handling:

Extract ingest dates from file paths: make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)
Calculate market dates: last_day(make_date(market_year, month, 1))

SQLMesh best practices:

Always define grain for data quality validation
Use meaningful model names following layer conventions
Leverage SQLMesh's built-in time macros (@start_ds, @end_ds)
Keep raw layer thin, push transformations to staging+

Database Location

Dev database: materia_dev.db (13GB, in project root)
Prod database: materia_prod.db (not yet created)

Note: The dev database is large and should not be committed to git (.gitignore already configured).

We use a monorepo with uv workspaces
The pulumi env is called beanflows/prod

6.0 KiB Raw Blame History