Add CLAUDE.md documentation for AI-assisted development

Comprehensive guide covering project architecture, SQLMesh workflow, data layer conventions, and development commands for the Materia commodity analytics platform. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 13:21:13 +02:00
parent 025dda16c6
commit ac9b23af17
1 changed files with 202 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,202 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.
+
+**Tech Stack:**
+- Python 3.13 with `uv` package manager
+- SQLMesh for SQL transformation and orchestration
+- DuckDB as the analytical database
+- Workspace structure with separate extract and transform packages
+
+## Environment Setup
+
+**Install dependencies:**
+```bash
+uv sync
+```
+
+**Setup pre-commit hooks:**
+```bash
+pre-commit install
+```
+
+**Add new dependencies:**
+```bash
+uv add <package-name>
+```
+
+## Project Structure
+
+This is a uv workspace with three main components:
+
+### 1. Extract Layer (`extract/`)
+Contains extraction packages for pulling data from external sources.
+
+- **`extract/psdonline/`**: Extracts USDA PSD commodity data from archives dating back to 2006
+  - Entry point: `extract_psd` CLI command (defined in `extract/psdonline/src/psdonline/execute.py`)
+  - Downloads monthly zip archives to `extract/psdonline/src/psdonline/data/`
+  - Uses ETags to avoid re-downloading unchanged files
+
+**Run extraction:**
+```bash
+extract_psd
+```
+
+### 2. Transform Layer (`transform/sqlmesh_materia/`)
+SQLMesh project implementing a layered data architecture.
+
+**Working directory:** All SQLMesh commands must be run from `transform/sqlmesh_materia/`
+
+**Key commands:**
+```bash
+cd transform/sqlmesh_materia
+
+# Plan changes (no prompts, auto-apply enabled in config)
+sqlmesh plan
+
+# Run tests
+sqlmesh test
+
+# Validate models
+sqlmesh validate
+
+# Run audits
+sqlmesh audit
+
+# Format SQL
+sqlmesh format
+
+# Start UI
+sqlmesh ui
+```
+
+**Configuration:**
+- Config: `transform/sqlmesh_materia/config.yaml`
+- Default gateway: `dev` (uses `materia_dev.db`)
+- Production gateway: `prod` (uses `materia_prod.db`)
+- Auto-apply enabled, no interactive prompts
+- DuckDB extensions: zipfs, httpfs, iceberg
+
+### 3. Core Package (`src/materia/`)
+Currently minimal; main logic resides in workspace packages.
+
+## Data Architecture
+
+SQLMesh models follow a strict 4-layer architecture defined in `transform/sqlmesh_materia/models/README.md`:
+
+### Layer 1: Raw (`models/raw/`)
+- **Purpose:** Immutable archive of source data
+- **Pattern:** Directly reads from extraction outputs
+- **Example:** `raw.psd_alldata` reads zip files using DuckDB's `read_csv('zip://...')` function
+- **Grain:** Defines unique keys for each raw table
+
+### Layer 2: Staging (`models/staging/`)
+- **Purpose:** Apply schema, cast types, basic cleansing
+- **Pattern:** `stg_[source]__[entity]`
+- **Example:** `stg_psdalldata__commodity.sql` casts raw strings to proper types, joins lookup tables
+- **Features:**
+  - Deduplication using hash keys
+  - Extracts metadata (ingest_date) from file paths
+  - 1:1 relationship with raw sources
+
+### Layer 3: Cleaned (`models/cleaned/`)
+- **Purpose:** Integration, business logic, unified models
+- **Pattern:** `cln_[entity]` or `cln_[vault_component]_[entity]`
+- **Example:** `cln_psdalldata__commodity_pivoted.sql` pivots commodity attributes into columns
+
+### Layer 4: Serving (`models/serving/`)
+- **Purpose:** Analytics-ready models (star schema, aggregates)
+- **Patterns:**
+  - `dim_[entity]` for dimensions
+  - `fct_[process]` for facts
+  - `agg_[description]` for aggregates
+  - `obt_[description]` for one-big-tables
+- **Example:** `obt_commodity_metrics.sql` provides wide table for analysis
+
+## Model Development
+
+**Incremental models:**
+- Use `INCREMENTAL_BY_TIME_RANGE` kind
+- Define `time_column` (usually `ingest_date`)
+- Filter with `WHERE time_column BETWEEN @start_ds AND @end_ds`
+
+**Full refresh models:**
+- Use `FULL` kind for small lookup tables and raw sources
+
+**Model properties:**
+- `grain`: Define unique key columns for data quality
+- `start`: Historical backfill start date (project default: 2025-07-07)
+- `cron`: Schedule (project default: '@daily')
+
+## Linting and Formatting
+
+**Run linting:**
+```bash
+ruff check .
+```
+
+**Auto-fix issues:**
+```bash
+ruff check --fix .
+```
+
+**Format code:**
+```bash
+ruff format .
+```
+
+Pre-commit hooks automatically run ruff on commits.
+
+## Testing
+
+**Run SQLMesh tests:**
+```bash
+cd transform/sqlmesh_materia
+sqlmesh test
+```
+
+**Run Python tests (if configured):**
+```bash
+pytest --cov=./ --cov-report=xml
+```
+
+## CI/CD Pipeline
+
+GitLab CI runs three stages (`.gitlab-ci.yml`):
+
+1. **Lint:** Runs ruff check and format validation, plus pip-audit
+2. **Test:** Runs pytest with coverage
+3. **Build:** Creates distribution packages (on tags only)
+
+## Key Design Patterns
+
+**Raw data ingestion:**
+- DuckDB reads directly from zip archives using `read_csv('zip://...')`
+- `filename=true` captures source file path for metadata
+- `union_by_name=true` handles schema evolution
+
+**Deduplication:**
+- Use `hash()` function to create unique keys
+- Use `any_value()` with `GROUP BY hkey` to deduplicate
+- Preserve all metadata in hash key for change detection
+
+**Date handling:**
+- Extract ingest dates from file paths: `make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)`
+- Calculate market dates: `last_day(make_date(market_year, month, 1))`
+
+**SQLMesh best practices:**
+- Always define `grain` for data quality validation
+- Use meaningful model names following layer conventions
+- Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`)
+- Keep raw layer thin, push transformations to staging+
+
+## Database Location
+
+- **Dev database:** `materia_dev.db` (13GB, in project root)
+- **Prod database:** `materia_prod.db` (not yet created)
+
+Note: The dev database is large and should not be committed to git (.gitignore already configured).