Replaced the OWM extractor (8 locations, API key required, 14,600-call backfill over 30+ days) with Open-Meteo (12 locations, no API key, ERA5 reanalysis, full backfill in 12 API calls ~30 seconds). - Rename extract/openweathermap → extract/openmeteo (git mv) - Rewrite api.py: fetch_archive (ERA5, date-range) + fetch_recent (forecast, past_days=10 to cover ERA5 lag); 9 daily variables incl. et0 and VPD - Rewrite execute.py: _split_and_write() unzips parallel arrays into per-day flat JSON; no cursor / rate limiting / call cap needed - Update pipelines.py: --package openmeteo, timeout 120s (was 1200s) - Update fct_weather_daily.sql: flat Open-Meteo field names (temperature_2m_* etc.), remove pressure_afternoon_hpa, add et0_mm + vpd_max_kpa + is_high_vpd - Remove OPENWEATHERMAP_API_KEY from CLAUDE.md env vars table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.6 KiB
4.6 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Materia is a commodity data analytics platform (product: BeanFlows.coffee) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.
Commands
# Install dependencies
uv sync
# Lint & format
ruff check . # Check
ruff check --fix . # Auto-fix
ruff format . # Format
# Tests
uv run pytest tests/ -v --cov=src/materia # CLI/Python tests
cd transform/sqlmesh_materia && uv run sqlmesh test # SQLMesh model tests
# Run a single test
uv run pytest tests/test_cli.py::test_name -v
# Extract data
LANDING_DIR=data/landing uv run extract_psd
# SQLMesh (from repo root)
uv run sqlmesh -p transform/sqlmesh_materia plan # Plans to dev_<username> by default
uv run sqlmesh -p transform/sqlmesh_materia plan prod # Production
uv run sqlmesh -p transform/sqlmesh_materia test # Run model tests
uv run sqlmesh -p transform/sqlmesh_materia format # Format SQL
# CLI
uv run materia pipeline run extract|transform
uv run materia pipeline list
uv run materia worker create|destroy|list
uv run materia secrets get
Architecture
Workspace packages (pyproject.toml → tool.uv.workspace):
extract/psdonline/— Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directoryextract/openmeteo/— Daily weather for 12 coffee-growing regions (Open-Meteo, ERA5 reanalysis, no API key)transform/sqlmesh_materia/— 3-layer SQL transformation pipeline (local DuckDB)src/materia/— CLI (Typer) for pipeline execution, worker management, secretsweb/— Future web frontend
Data flow:
USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
Open-Meteo → extract → /data/materia/landing/weather/{location_id}/{year}/{date}.json.gz
→ rclone cron syncs landing/ to R2
→ SQLMesh staging → foundation → serving → /data/materia/lakehouse.duckdb
→ Web app reads lakehouse.duckdb (read-only)
SQLMesh 3-layer model structure (transform/sqlmesh_materia/models/):
staging/— Type casting, lookup joins, basic cleansing (reads landing directly)foundation/— Business logic, pivoting, dimensions, facts (also reads landing directly)serving/— Analytics-ready aggregates for the web app
CLI modules (src/materia/):
cli.py— Typer app with subcommands: worker, pipeline, secrets, versionworkers.py— Hetzner cloud instance management (for ad-hoc compute)pipelines.py— Local subprocess pipeline execution with bounded timeoutssecrets.py— Pulumi ESC integration for environment secrets
Infrastructure (infra/):
- Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
- Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
- rclone systemd timer for landing data backup to R2
Coding Philosophy
Read coding_philosophy.md for the full guide. Key points:
- Simple, procedural code — Functions over classes, no inheritance hierarchies, no "Manager" patterns
- Data-oriented — Use dicts/lists/tuples, not objects hiding data behind getters
- Keep logic in SQL — Let DuckDB do the heavy lifting, don't pull data into Python to transform it
- Build minimum that works — No premature abstraction, three examples before generalizing
- Explicit over implicit — No framework magic, no metaprogramming, no hidden behavior
- Question every dependency — Can you write it simply yourself? Are you using 5% of a large framework?
Key Configuration
- Python 3.13 (
.python-version) - Ruff: double quotes, spaces, E501 ignored (formatter handles line length)
- SQLMesh: DuckDB dialect,
@dailycron, start date2025-07-07, default envdev_{{ user() }} - Storage: Local NVMe (
LANDING_DIR,DUCKDB_PATH), R2 for backup via rclone - Secrets: Pulumi ESC (
esc run beanflows/prod -- <cmd>) - CI: GitLab CI (
.gitlab/.gitlab-ci.yml) — runs pytest and sqlmesh test on push/MR - Pre-commit hooks: installed via
pre-commit install
Environment Variables
| Variable | Default | Description |
|---|---|---|
LANDING_DIR |
data/landing |
Root directory for extracted landing data |
DUCKDB_PATH |
local.duckdb |
Path to the DuckDB lakehouse database |