# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis. **Tech Stack:** - Python 3.13 with `uv` package manager - SQLMesh for SQL transformation and orchestration - DuckDB as the analytical database - Workspace structure with separate extract and transform packages ## Environment Setup **Install dependencies:** ```bash uv sync ``` **Setup pre-commit hooks:** ```bash pre-commit install ``` **Add new dependencies:** ```bash uv add ``` ## Project Structure This is a uv workspace with three main components: ### 1. Extract Layer (`extract/`) Contains extraction packages for pulling data from external sources. - **`extract/psdonline/`**: Extracts USDA PSD commodity data from archives dating back to 2006 - Entry point: `extract_psd` CLI command (defined in `extract/psdonline/src/psdonline/execute.py`) - Downloads monthly zip archives to `extract/psdonline/src/psdonline/data/` - Uses ETags to avoid re-downloading unchanged files **Run extraction:** ```bash extract_psd ``` ### 2. Transform Layer (`transform/sqlmesh_materia/`) SQLMesh project implementing a layered data architecture. **Working directory:** All SQLMesh commands must be run from `transform/sqlmesh_materia/` **Key commands:** ```bash cd transform/sqlmesh_materia # Plan changes (no prompts, auto-apply enabled in config) sqlmesh plan # Run tests sqlmesh test # Validate models sqlmesh validate # Run audits sqlmesh audit # Format SQL sqlmesh format # Start UI sqlmesh ui ``` **Configuration:** - Config: `transform/sqlmesh_materia/config.yaml` - Default gateway: `dev` (uses `materia_dev.db`) - Production gateway: `prod` (uses `materia_prod.db`) - Auto-apply enabled, no interactive prompts - DuckDB extensions: zipfs, httpfs, iceberg ### 3. Core Package (`src/materia/`) Currently minimal; main logic resides in workspace packages. ## Data Architecture SQLMesh models follow a strict 4-layer architecture defined in `transform/sqlmesh_materia/models/README.md`: ### Layer 1: Raw (`models/raw/`) - **Purpose:** Immutable archive of source data - **Pattern:** Directly reads from extraction outputs - **Example:** `raw.psd_alldata` reads zip files using DuckDB's `read_csv('zip://...')` function - **Grain:** Defines unique keys for each raw table ### Layer 2: Staging (`models/staging/`) - **Purpose:** Apply schema, cast types, basic cleansing - **Pattern:** `stg_[source]__[entity]` - **Example:** `stg_psdalldata__commodity.sql` casts raw strings to proper types, joins lookup tables - **Features:** - Deduplication using hash keys - Extracts metadata (ingest_date) from file paths - 1:1 relationship with raw sources ### Layer 3: Cleaned (`models/cleaned/`) - **Purpose:** Integration, business logic, unified models - **Pattern:** `cln_[entity]` or `cln_[vault_component]_[entity]` - **Example:** `cln_psdalldata__commodity_pivoted.sql` pivots commodity attributes into columns ### Layer 4: Serving (`models/serving/`) - **Purpose:** Analytics-ready models (star schema, aggregates) - **Patterns:** - `dim_[entity]` for dimensions - `fct_[process]` for facts - `agg_[description]` for aggregates - `obt_[description]` for one-big-tables - **Example:** `obt_commodity_metrics.sql` provides wide table for analysis ## Model Development **Incremental models:** - Use `INCREMENTAL_BY_TIME_RANGE` kind - Define `time_column` (usually `ingest_date`) - Filter with `WHERE time_column BETWEEN @start_ds AND @end_ds` **Full refresh models:** - Use `FULL` kind for small lookup tables and raw sources **Model properties:** - `grain`: Define unique key columns for data quality - `start`: Historical backfill start date (project default: 2025-07-07) - `cron`: Schedule (project default: '@daily') ## Linting and Formatting **Run linting:** ```bash ruff check . ``` **Auto-fix issues:** ```bash ruff check --fix . ``` **Format code:** ```bash ruff format . ``` Pre-commit hooks automatically run ruff on commits. ## Testing **Run SQLMesh tests:** ```bash cd transform/sqlmesh_materia sqlmesh test ``` **Run Python tests (if configured):** ```bash pytest --cov=./ --cov-report=xml ``` ## CI/CD Pipeline GitLab CI runs three stages (`.gitlab-ci.yml`): 1. **Lint:** Runs ruff check and format validation, plus pip-audit 2. **Test:** Runs pytest with coverage 3. **Build:** Creates distribution packages (on tags only) ## Key Design Patterns **Raw data ingestion:** - DuckDB reads directly from zip archives using `read_csv('zip://...')` - `filename=true` captures source file path for metadata - `union_by_name=true` handles schema evolution **Deduplication:** - Use `hash()` function to create unique keys - Use `any_value()` with `GROUP BY hkey` to deduplicate - Preserve all metadata in hash key for change detection **Date handling:** - Extract ingest dates from file paths: `make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)` - Calculate market dates: `last_day(make_date(market_year, month, 1))` **SQLMesh best practices:** - Always define `grain` for data quality validation - Use meaningful model names following layer conventions - Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`) - Keep raw layer thin, push transformations to staging+ ## Database Location - **Dev database:** `materia_dev.db` (13GB, in project root) - **Prod database:** `materia_prod.db` (not yet created) Note: The dev database is large and should not be committed to git (.gitignore already configured). - Can you memorize what we discussed