Adds extract/openweathermap package with daily weather extraction for 8
coffee-growing regions (Brazil, Vietnam, Colombia, Ethiopia, Honduras,
Guatemala, Indonesia). Feeds crop stress signal for commodity sentiment score.
Extractor:
- OWM One Call API 3.0 / Day Summary — one JSON.gz per (location, date)
- extract_weather: daily, fetches yesterday + today (16 calls max)
- extract_weather_backfill: fills 2020-01-01 to yesterday, capped at 500
calls/run with resume cursor '{location_id}:{date}' for crash safety
- Full idempotency via file existence check; state tracking via extract_core
SQLMesh:
- seeds.weather_locations (8 regions with lat/lon/variety)
- foundation.fct_weather_daily: INCREMENTAL_BY_TIME_RANGE, grain
(location_id, observation_date), dedup via hash key, crop stress flags:
is_frost (<2°C), is_heat_stress (>35°C), is_drought (<1mm), in_growing_season
Landing path: LANDING_DIR/weather/{location_id}/{year}/{date}.json.gz
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
105 lines
4.7 KiB
Markdown
105 lines
4.7 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv sync
|
|
|
|
# Lint & format
|
|
ruff check . # Check
|
|
ruff check --fix . # Auto-fix
|
|
ruff format . # Format
|
|
|
|
# Tests
|
|
uv run pytest tests/ -v --cov=src/materia # CLI/Python tests
|
|
cd transform/sqlmesh_materia && uv run sqlmesh test # SQLMesh model tests
|
|
|
|
# Run a single test
|
|
uv run pytest tests/test_cli.py::test_name -v
|
|
|
|
# Extract data
|
|
LANDING_DIR=data/landing uv run extract_psd
|
|
|
|
# SQLMesh (from repo root)
|
|
uv run sqlmesh -p transform/sqlmesh_materia plan # Plans to dev_<username> by default
|
|
uv run sqlmesh -p transform/sqlmesh_materia plan prod # Production
|
|
uv run sqlmesh -p transform/sqlmesh_materia test # Run model tests
|
|
uv run sqlmesh -p transform/sqlmesh_materia format # Format SQL
|
|
|
|
# CLI
|
|
uv run materia pipeline run extract|transform
|
|
uv run materia pipeline list
|
|
uv run materia worker create|destroy|list
|
|
uv run materia secrets get
|
|
```
|
|
|
|
## Architecture
|
|
|
|
**Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
|
|
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory
|
|
- `extract/openweathermap/` — Daily weather for 8 coffee-growing regions (OWM One Call API 3.0)
|
|
- `transform/sqlmesh_materia/` — 3-layer SQL transformation pipeline (local DuckDB)
|
|
- `src/materia/` — CLI (Typer) for pipeline execution, worker management, secrets
|
|
- `web/` — Future web frontend
|
|
|
|
**Data flow:**
|
|
```
|
|
USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
|
|
OWM API → extract → /data/materia/landing/weather/{location_id}/{year}/{date}.json.gz
|
|
→ rclone cron syncs landing/ to R2
|
|
→ SQLMesh staging → foundation → serving → /data/materia/lakehouse.duckdb
|
|
→ Web app reads lakehouse.duckdb (read-only)
|
|
```
|
|
|
|
**SQLMesh 3-layer model structure** (`transform/sqlmesh_materia/models/`):
|
|
1. `staging/` — Type casting, lookup joins, basic cleansing (reads landing directly)
|
|
2. `foundation/` — Business logic, pivoting, dimensions, facts (also reads landing directly)
|
|
3. `serving/` — Analytics-ready aggregates for the web app
|
|
|
|
**CLI modules** (`src/materia/`):
|
|
- `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
|
|
- `workers.py` — Hetzner cloud instance management (for ad-hoc compute)
|
|
- `pipelines.py` — Local subprocess pipeline execution with bounded timeouts
|
|
- `secrets.py` — Pulumi ESC integration for environment secrets
|
|
|
|
**Infrastructure** (`infra/`):
|
|
- Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
|
|
- Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
|
|
- rclone systemd timer for landing data backup to R2
|
|
|
|
## Coding Philosophy
|
|
|
|
Read `coding_philosophy.md` for the full guide. Key points:
|
|
|
|
- **Simple, procedural code** — Functions over classes, no inheritance hierarchies, no "Manager" patterns
|
|
- **Data-oriented** — Use dicts/lists/tuples, not objects hiding data behind getters
|
|
- **Keep logic in SQL** — Let DuckDB do the heavy lifting, don't pull data into Python to transform it
|
|
- **Build minimum that works** — No premature abstraction, three examples before generalizing
|
|
- **Explicit over implicit** — No framework magic, no metaprogramming, no hidden behavior
|
|
- **Question every dependency** — Can you write it simply yourself? Are you using 5% of a large framework?
|
|
|
|
## Key Configuration
|
|
|
|
- **Python 3.13** (`.python-version`)
|
|
- **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
|
|
- **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
|
|
- **Storage**: Local NVMe (`LANDING_DIR`, `DUCKDB_PATH`), R2 for backup via rclone
|
|
- **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
|
|
- **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
|
|
- **Pre-commit hooks**: installed via `pre-commit install`
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `LANDING_DIR` | `data/landing` | Root directory for extracted landing data |
|
|
| `DUCKDB_PATH` | `local.duckdb` | Path to the DuckDB lakehouse database |
|
|
| `OPENWEATHERMAP_API_KEY` | — | OWM One Call API 3.0 key (required for weather extraction) |
|