- Add pytest and pytest-cov for testing - Add niquests for modern HTTP/2 support (keep requests for hcloud compatibility) - Create 13 E2E tests covering CLI, workers, pipelines, and secrets (71% coverage) - Fix Pulumi ESC environment path (beanflows/prod) and secret key names - Update GitLab CI to run CLI tests with coverage reporting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.0 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.
Tech Stack:
- Python 3.13 with
uvpackage manager - SQLMesh for SQL transformation and orchestration
- DuckDB as the analytical database
- Workspace structure with separate extract and transform packages
Environment Setup
Install dependencies:
uv sync
Setup pre-commit hooks:
pre-commit install
Add new dependencies:
uv add <package-name>
Project Structure
This is a uv workspace with three main components:
1. Extract Layer (extract/)
Contains extraction packages for pulling data from external sources.
extract/psdonline/: Extracts USDA PSD commodity data from archives dating back to 2006- Entry point:
extract_psdCLI command (defined inextract/psdonline/src/psdonline/execute.py) - Downloads monthly zip archives to
extract/psdonline/src/psdonline/data/ - Uses ETags to avoid re-downloading unchanged files
- Entry point:
Run extraction:
extract_psd
2. Transform Layer (transform/sqlmesh_materia/)
SQLMesh project implementing a layered data architecture.
Working directory: All SQLMesh commands must be run from transform/sqlmesh_materia/
Key commands:
cd transform/sqlmesh_materia
# Plan changes (no prompts, auto-apply enabled in config)
sqlmesh plan
# Run tests
sqlmesh test
# Validate models
sqlmesh validate
# Run audits
sqlmesh audit
# Format SQL
sqlmesh format
# Start UI
sqlmesh ui
Configuration:
- Config:
transform/sqlmesh_materia/config.yaml - Default gateway:
dev(usesmateria_dev.db) - Production gateway:
prod(usesmateria_prod.db) - Auto-apply enabled, no interactive prompts
- DuckDB extensions: zipfs, httpfs, iceberg
3. Core Package (src/materia/)
Currently minimal; main logic resides in workspace packages.
Data Architecture
SQLMesh models follow a strict 4-layer architecture defined in transform/sqlmesh_materia/models/README.md:
Layer 1: Raw (models/raw/)
- Purpose: Immutable archive of source data
- Pattern: Directly reads from extraction outputs
- Example:
raw.psd_alldatareads zip files using DuckDB'sread_csv('zip://...')function - Grain: Defines unique keys for each raw table
Layer 2: Staging (models/staging/)
- Purpose: Apply schema, cast types, basic cleansing
- Pattern:
stg_[source]__[entity] - Example:
stg_psdalldata__commodity.sqlcasts raw strings to proper types, joins lookup tables - Features:
- Deduplication using hash keys
- Extracts metadata (ingest_date) from file paths
- 1:1 relationship with raw sources
Layer 3: Cleaned (models/cleaned/)
- Purpose: Integration, business logic, unified models
- Pattern:
cln_[entity]orcln_[vault_component]_[entity] - Example:
cln_psdalldata__commodity_pivoted.sqlpivots commodity attributes into columns
Layer 4: Serving (models/serving/)
- Purpose: Analytics-ready models (star schema, aggregates)
- Patterns:
dim_[entity]for dimensionsfct_[process]for factsagg_[description]for aggregatesobt_[description]for one-big-tables
- Example:
obt_commodity_metrics.sqlprovides wide table for analysis
Model Development
Incremental models:
- Use
INCREMENTAL_BY_TIME_RANGEkind - Define
time_column(usuallyingest_date) - Filter with
WHERE time_column BETWEEN @start_ds AND @end_ds
Full refresh models:
- Use
FULLkind for small lookup tables and raw sources
Model properties:
grain: Define unique key columns for data qualitystart: Historical backfill start date (project default: 2025-07-07)cron: Schedule (project default: '@daily')
Linting and Formatting
Run linting:
ruff check .
Auto-fix issues:
ruff check --fix .
Format code:
ruff format .
Pre-commit hooks automatically run ruff on commits.
Testing
Run SQLMesh tests:
cd transform/sqlmesh_materia
sqlmesh test
Run Python tests (if configured):
pytest --cov=./ --cov-report=xml
CI/CD Pipeline
GitLab CI runs three stages (.gitlab-ci.yml):
- Lint: Runs ruff check and format validation, plus pip-audit
- Test: Runs pytest with coverage
- Build: Creates distribution packages (on tags only)
Key Design Patterns
Raw data ingestion:
- DuckDB reads directly from zip archives using
read_csv('zip://...') filename=truecaptures source file path for metadataunion_by_name=truehandles schema evolution
Deduplication:
- Use
hash()function to create unique keys - Use
any_value()withGROUP BY hkeyto deduplicate - Preserve all metadata in hash key for change detection
Date handling:
- Extract ingest dates from file paths:
make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1) - Calculate market dates:
last_day(make_date(market_year, month, 1))
SQLMesh best practices:
- Always define
grainfor data quality validation - Use meaningful model names following layer conventions
- Leverage SQLMesh's built-in time macros (
@start_ds,@end_ds) - Keep raw layer thin, push transformations to staging+
Database Location
- Dev database:
materia_dev.db(13GB, in project root) - Prod database:
materia_prod.db(not yet created)
Note: The dev database is large and should not be committed to git (.gitignore already configured).
- We use a monorepo with uv workspaces
- The pulumi env is called beanflows/prod