Files

Deeman 08e74665bb feat(extract): add OpenWeatherMap daily weather extractor

Adds extract/openweathermap package with daily weather extraction for 8
coffee-growing regions (Brazil, Vietnam, Colombia, Ethiopia, Honduras,
Guatemala, Indonesia). Feeds crop stress signal for commodity sentiment score.

Extractor:
- OWM One Call API 3.0 / Day Summary — one JSON.gz per (location, date)
- extract_weather: daily, fetches yesterday + today (16 calls max)
- extract_weather_backfill: fills 2020-01-01 to yesterday, capped at 500
  calls/run with resume cursor '{location_id}:{date}' for crash safety
- Full idempotency via file existence check; state tracking via extract_core

SQLMesh:
- seeds.weather_locations (8 regions with lat/lon/variety)
- foundation.fct_weather_daily: INCREMENTAL_BY_TIME_RANGE, grain
  (location_id, observation_date), dedup via hash key, crop stress flags:
  is_frost (<2°C), is_heat_stress (>35°C), is_drought (<1mm), in_growing_season

Landing path: LANDING_DIR/weather/{location_id}/{year}/{date}.json.gz

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-25 22:40:27 +01:00

4.7 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Materia is a commodity data analytics platform (product: BeanFlows.coffee) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.

Commands

# Install dependencies
uv sync

# Lint & format
ruff check .            # Check
ruff check --fix .      # Auto-fix
ruff format .           # Format

# Tests
uv run pytest tests/ -v --cov=src/materia         # CLI/Python tests
cd transform/sqlmesh_materia && uv run sqlmesh test  # SQLMesh model tests

# Run a single test
uv run pytest tests/test_cli.py::test_name -v

# Extract data
LANDING_DIR=data/landing uv run extract_psd

# SQLMesh (from repo root)
uv run sqlmesh -p transform/sqlmesh_materia plan              # Plans to dev_<username> by default
uv run sqlmesh -p transform/sqlmesh_materia plan prod          # Production
uv run sqlmesh -p transform/sqlmesh_materia test               # Run model tests
uv run sqlmesh -p transform/sqlmesh_materia format             # Format SQL

# CLI
uv run materia pipeline run extract|transform
uv run materia pipeline list
uv run materia worker create|destroy|list
uv run materia secrets get

Architecture

Workspace packages (pyproject.toml → tool.uv.workspace):

extract/psdonline/ — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory
extract/openweathermap/ — Daily weather for 8 coffee-growing regions (OWM One Call API 3.0)
transform/sqlmesh_materia/ — 3-layer SQL transformation pipeline (local DuckDB)
src/materia/ — CLI (Typer) for pipeline execution, worker management, secrets
web/ — Future web frontend

Data flow:

USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
OWM API  → extract → /data/materia/landing/weather/{location_id}/{year}/{date}.json.gz
         → rclone cron syncs landing/ to R2
         → SQLMesh staging → foundation → serving → /data/materia/lakehouse.duckdb
         → Web app reads lakehouse.duckdb (read-only)

SQLMesh 3-layer model structure (transform/sqlmesh_materia/models/):

staging/ — Type casting, lookup joins, basic cleansing (reads landing directly)
foundation/ — Business logic, pivoting, dimensions, facts (also reads landing directly)
serving/ — Analytics-ready aggregates for the web app

CLI modules (src/materia/):

cli.py — Typer app with subcommands: worker, pipeline, secrets, version
workers.py — Hetzner cloud instance management (for ad-hoc compute)
pipelines.py — Local subprocess pipeline execution with bounded timeouts
secrets.py — Pulumi ESC integration for environment secrets

Infrastructure (infra/):

Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
rclone systemd timer for landing data backup to R2

Coding Philosophy

Read coding_philosophy.md for the full guide. Key points:

Simple, procedural code — Functions over classes, no inheritance hierarchies, no "Manager" patterns
Data-oriented — Use dicts/lists/tuples, not objects hiding data behind getters
Keep logic in SQL — Let DuckDB do the heavy lifting, don't pull data into Python to transform it
Build minimum that works — No premature abstraction, three examples before generalizing
Explicit over implicit — No framework magic, no metaprogramming, no hidden behavior
Question every dependency — Can you write it simply yourself? Are you using 5% of a large framework?

Key Configuration

Python 3.13 (.python-version)
Ruff: double quotes, spaces, E501 ignored (formatter handles line length)
SQLMesh: DuckDB dialect, @daily cron, start date 2025-07-07, default env dev_{{ user() }}
Storage: Local NVMe (LANDING_DIR, DUCKDB_PATH), R2 for backup via rclone
Secrets: Pulumi ESC (esc run beanflows/prod -- <cmd>)
CI: GitLab CI (.gitlab/.gitlab-ci.yml) — runs pytest and sqlmesh test on push/MR
Pre-commit hooks: installed via pre-commit install

Environment Variables

Variable	Default	Description
`LANDING_DIR`	`data/landing`	Root directory for extracted landing data
`DUCKDB_PATH`	`local.duckdb`	Path to the DuckDB lakehouse database
`OPENWEATHERMAP_API_KEY`	—	OWM One Call API 3.0 key (required for weather extraction)

4.7 KiB Raw Blame History