From ac9b23af17fd90a5eef71ec24faa41bbbea5e45a Mon Sep 17 00:00:00 2001 From: Deeman Date: Sun, 12 Oct 2025 13:21:13 +0200 Subject: [PATCH] Add CLAUDE.md documentation for AI-assisted development MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive guide covering project architecture, SQLMesh workflow, data layer conventions, and development commands for the Materia commodity analytics platform. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- CLAUDE.md | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..024445e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,202 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis. + +**Tech Stack:** +- Python 3.13 with `uv` package manager +- SQLMesh for SQL transformation and orchestration +- DuckDB as the analytical database +- Workspace structure with separate extract and transform packages + +## Environment Setup + +**Install dependencies:** +```bash +uv sync +``` + +**Setup pre-commit hooks:** +```bash +pre-commit install +``` + +**Add new dependencies:** +```bash +uv add +``` + +## Project Structure + +This is a uv workspace with three main components: + +### 1. Extract Layer (`extract/`) +Contains extraction packages for pulling data from external sources. + +- **`extract/psdonline/`**: Extracts USDA PSD commodity data from archives dating back to 2006 + - Entry point: `extract_psd` CLI command (defined in `extract/psdonline/src/psdonline/execute.py`) + - Downloads monthly zip archives to `extract/psdonline/src/psdonline/data/` + - Uses ETags to avoid re-downloading unchanged files + +**Run extraction:** +```bash +extract_psd +``` + +### 2. Transform Layer (`transform/sqlmesh_materia/`) +SQLMesh project implementing a layered data architecture. + +**Working directory:** All SQLMesh commands must be run from `transform/sqlmesh_materia/` + +**Key commands:** +```bash +cd transform/sqlmesh_materia + +# Plan changes (no prompts, auto-apply enabled in config) +sqlmesh plan + +# Run tests +sqlmesh test + +# Validate models +sqlmesh validate + +# Run audits +sqlmesh audit + +# Format SQL +sqlmesh format + +# Start UI +sqlmesh ui +``` + +**Configuration:** +- Config: `transform/sqlmesh_materia/config.yaml` +- Default gateway: `dev` (uses `materia_dev.db`) +- Production gateway: `prod` (uses `materia_prod.db`) +- Auto-apply enabled, no interactive prompts +- DuckDB extensions: zipfs, httpfs, iceberg + +### 3. Core Package (`src/materia/`) +Currently minimal; main logic resides in workspace packages. + +## Data Architecture + +SQLMesh models follow a strict 4-layer architecture defined in `transform/sqlmesh_materia/models/README.md`: + +### Layer 1: Raw (`models/raw/`) +- **Purpose:** Immutable archive of source data +- **Pattern:** Directly reads from extraction outputs +- **Example:** `raw.psd_alldata` reads zip files using DuckDB's `read_csv('zip://...')` function +- **Grain:** Defines unique keys for each raw table + +### Layer 2: Staging (`models/staging/`) +- **Purpose:** Apply schema, cast types, basic cleansing +- **Pattern:** `stg_[source]__[entity]` +- **Example:** `stg_psdalldata__commodity.sql` casts raw strings to proper types, joins lookup tables +- **Features:** + - Deduplication using hash keys + - Extracts metadata (ingest_date) from file paths + - 1:1 relationship with raw sources + +### Layer 3: Cleaned (`models/cleaned/`) +- **Purpose:** Integration, business logic, unified models +- **Pattern:** `cln_[entity]` or `cln_[vault_component]_[entity]` +- **Example:** `cln_psdalldata__commodity_pivoted.sql` pivots commodity attributes into columns + +### Layer 4: Serving (`models/serving/`) +- **Purpose:** Analytics-ready models (star schema, aggregates) +- **Patterns:** + - `dim_[entity]` for dimensions + - `fct_[process]` for facts + - `agg_[description]` for aggregates + - `obt_[description]` for one-big-tables +- **Example:** `obt_commodity_metrics.sql` provides wide table for analysis + +## Model Development + +**Incremental models:** +- Use `INCREMENTAL_BY_TIME_RANGE` kind +- Define `time_column` (usually `ingest_date`) +- Filter with `WHERE time_column BETWEEN @start_ds AND @end_ds` + +**Full refresh models:** +- Use `FULL` kind for small lookup tables and raw sources + +**Model properties:** +- `grain`: Define unique key columns for data quality +- `start`: Historical backfill start date (project default: 2025-07-07) +- `cron`: Schedule (project default: '@daily') + +## Linting and Formatting + +**Run linting:** +```bash +ruff check . +``` + +**Auto-fix issues:** +```bash +ruff check --fix . +``` + +**Format code:** +```bash +ruff format . +``` + +Pre-commit hooks automatically run ruff on commits. + +## Testing + +**Run SQLMesh tests:** +```bash +cd transform/sqlmesh_materia +sqlmesh test +``` + +**Run Python tests (if configured):** +```bash +pytest --cov=./ --cov-report=xml +``` + +## CI/CD Pipeline + +GitLab CI runs three stages (`.gitlab-ci.yml`): + +1. **Lint:** Runs ruff check and format validation, plus pip-audit +2. **Test:** Runs pytest with coverage +3. **Build:** Creates distribution packages (on tags only) + +## Key Design Patterns + +**Raw data ingestion:** +- DuckDB reads directly from zip archives using `read_csv('zip://...')` +- `filename=true` captures source file path for metadata +- `union_by_name=true` handles schema evolution + +**Deduplication:** +- Use `hash()` function to create unique keys +- Use `any_value()` with `GROUP BY hkey` to deduplicate +- Preserve all metadata in hash key for change detection + +**Date handling:** +- Extract ingest dates from file paths: `make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)` +- Calculate market dates: `last_day(make_date(market_year, month, 1))` + +**SQLMesh best practices:** +- Always define `grain` for data quality validation +- Use meaningful model names following layer conventions +- Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`) +- Keep raw layer thin, push transformations to staging+ + +## Database Location + +- **Dev database:** `materia_dev.db` (13GB, in project root) +- **Prod database:** `materia_prod.db` (not yet created) + +Note: The dev database is large and should not be committed to git (.gitignore already configured).