Add CLAUDE.md documentation for AI-assisted development
Comprehensive guide covering project architecture, SQLMesh workflow, data layer conventions, and development commands for the Materia commodity analytics platform. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
202
CLAUDE.md
Normal file
202
CLAUDE.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
Materia is a commodity data analytics platform built on a modern data engineering stack. The project extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB for analysis.
|
||||
|
||||
**Tech Stack:**
|
||||
- Python 3.13 with `uv` package manager
|
||||
- SQLMesh for SQL transformation and orchestration
|
||||
- DuckDB as the analytical database
|
||||
- Workspace structure with separate extract and transform packages
|
||||
|
||||
## Environment Setup
|
||||
|
||||
**Install dependencies:**
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
**Setup pre-commit hooks:**
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
**Add new dependencies:**
|
||||
```bash
|
||||
uv add <package-name>
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
This is a uv workspace with three main components:
|
||||
|
||||
### 1. Extract Layer (`extract/`)
|
||||
Contains extraction packages for pulling data from external sources.
|
||||
|
||||
- **`extract/psdonline/`**: Extracts USDA PSD commodity data from archives dating back to 2006
|
||||
- Entry point: `extract_psd` CLI command (defined in `extract/psdonline/src/psdonline/execute.py`)
|
||||
- Downloads monthly zip archives to `extract/psdonline/src/psdonline/data/`
|
||||
- Uses ETags to avoid re-downloading unchanged files
|
||||
|
||||
**Run extraction:**
|
||||
```bash
|
||||
extract_psd
|
||||
```
|
||||
|
||||
### 2. Transform Layer (`transform/sqlmesh_materia/`)
|
||||
SQLMesh project implementing a layered data architecture.
|
||||
|
||||
**Working directory:** All SQLMesh commands must be run from `transform/sqlmesh_materia/`
|
||||
|
||||
**Key commands:**
|
||||
```bash
|
||||
cd transform/sqlmesh_materia
|
||||
|
||||
# Plan changes (no prompts, auto-apply enabled in config)
|
||||
sqlmesh plan
|
||||
|
||||
# Run tests
|
||||
sqlmesh test
|
||||
|
||||
# Validate models
|
||||
sqlmesh validate
|
||||
|
||||
# Run audits
|
||||
sqlmesh audit
|
||||
|
||||
# Format SQL
|
||||
sqlmesh format
|
||||
|
||||
# Start UI
|
||||
sqlmesh ui
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
- Config: `transform/sqlmesh_materia/config.yaml`
|
||||
- Default gateway: `dev` (uses `materia_dev.db`)
|
||||
- Production gateway: `prod` (uses `materia_prod.db`)
|
||||
- Auto-apply enabled, no interactive prompts
|
||||
- DuckDB extensions: zipfs, httpfs, iceberg
|
||||
|
||||
### 3. Core Package (`src/materia/`)
|
||||
Currently minimal; main logic resides in workspace packages.
|
||||
|
||||
## Data Architecture
|
||||
|
||||
SQLMesh models follow a strict 4-layer architecture defined in `transform/sqlmesh_materia/models/README.md`:
|
||||
|
||||
### Layer 1: Raw (`models/raw/`)
|
||||
- **Purpose:** Immutable archive of source data
|
||||
- **Pattern:** Directly reads from extraction outputs
|
||||
- **Example:** `raw.psd_alldata` reads zip files using DuckDB's `read_csv('zip://...')` function
|
||||
- **Grain:** Defines unique keys for each raw table
|
||||
|
||||
### Layer 2: Staging (`models/staging/`)
|
||||
- **Purpose:** Apply schema, cast types, basic cleansing
|
||||
- **Pattern:** `stg_[source]__[entity]`
|
||||
- **Example:** `stg_psdalldata__commodity.sql` casts raw strings to proper types, joins lookup tables
|
||||
- **Features:**
|
||||
- Deduplication using hash keys
|
||||
- Extracts metadata (ingest_date) from file paths
|
||||
- 1:1 relationship with raw sources
|
||||
|
||||
### Layer 3: Cleaned (`models/cleaned/`)
|
||||
- **Purpose:** Integration, business logic, unified models
|
||||
- **Pattern:** `cln_[entity]` or `cln_[vault_component]_[entity]`
|
||||
- **Example:** `cln_psdalldata__commodity_pivoted.sql` pivots commodity attributes into columns
|
||||
|
||||
### Layer 4: Serving (`models/serving/`)
|
||||
- **Purpose:** Analytics-ready models (star schema, aggregates)
|
||||
- **Patterns:**
|
||||
- `dim_[entity]` for dimensions
|
||||
- `fct_[process]` for facts
|
||||
- `agg_[description]` for aggregates
|
||||
- `obt_[description]` for one-big-tables
|
||||
- **Example:** `obt_commodity_metrics.sql` provides wide table for analysis
|
||||
|
||||
## Model Development
|
||||
|
||||
**Incremental models:**
|
||||
- Use `INCREMENTAL_BY_TIME_RANGE` kind
|
||||
- Define `time_column` (usually `ingest_date`)
|
||||
- Filter with `WHERE time_column BETWEEN @start_ds AND @end_ds`
|
||||
|
||||
**Full refresh models:**
|
||||
- Use `FULL` kind for small lookup tables and raw sources
|
||||
|
||||
**Model properties:**
|
||||
- `grain`: Define unique key columns for data quality
|
||||
- `start`: Historical backfill start date (project default: 2025-07-07)
|
||||
- `cron`: Schedule (project default: '@daily')
|
||||
|
||||
## Linting and Formatting
|
||||
|
||||
**Run linting:**
|
||||
```bash
|
||||
ruff check .
|
||||
```
|
||||
|
||||
**Auto-fix issues:**
|
||||
```bash
|
||||
ruff check --fix .
|
||||
```
|
||||
|
||||
**Format code:**
|
||||
```bash
|
||||
ruff format .
|
||||
```
|
||||
|
||||
Pre-commit hooks automatically run ruff on commits.
|
||||
|
||||
## Testing
|
||||
|
||||
**Run SQLMesh tests:**
|
||||
```bash
|
||||
cd transform/sqlmesh_materia
|
||||
sqlmesh test
|
||||
```
|
||||
|
||||
**Run Python tests (if configured):**
|
||||
```bash
|
||||
pytest --cov=./ --cov-report=xml
|
||||
```
|
||||
|
||||
## CI/CD Pipeline
|
||||
|
||||
GitLab CI runs three stages (`.gitlab-ci.yml`):
|
||||
|
||||
1. **Lint:** Runs ruff check and format validation, plus pip-audit
|
||||
2. **Test:** Runs pytest with coverage
|
||||
3. **Build:** Creates distribution packages (on tags only)
|
||||
|
||||
## Key Design Patterns
|
||||
|
||||
**Raw data ingestion:**
|
||||
- DuckDB reads directly from zip archives using `read_csv('zip://...')`
|
||||
- `filename=true` captures source file path for metadata
|
||||
- `union_by_name=true` handles schema evolution
|
||||
|
||||
**Deduplication:**
|
||||
- Use `hash()` function to create unique keys
|
||||
- Use `any_value()` with `GROUP BY hkey` to deduplicate
|
||||
- Preserve all metadata in hash key for change detection
|
||||
|
||||
**Date handling:**
|
||||
- Extract ingest dates from file paths: `make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)`
|
||||
- Calculate market dates: `last_day(make_date(market_year, month, 1))`
|
||||
|
||||
**SQLMesh best practices:**
|
||||
- Always define `grain` for data quality validation
|
||||
- Use meaningful model names following layer conventions
|
||||
- Leverage SQLMesh's built-in time macros (`@start_ds`, `@end_ds`)
|
||||
- Keep raw layer thin, push transformations to staging+
|
||||
|
||||
## Database Location
|
||||
|
||||
- **Dev database:** `materia_dev.db` (13GB, in project root)
|
||||
- **Prod database:** `materia_prod.db` (not yet created)
|
||||
|
||||
Note: The dev database is large and should not be committed to git (.gitignore already configured).
|
||||
Reference in New Issue
Block a user