Go to file

Deeman 08e74665bb feat(extract): add OpenWeatherMap daily weather extractor

Adds extract/openweathermap package with daily weather extraction for 8
coffee-growing regions (Brazil, Vietnam, Colombia, Ethiopia, Honduras,
Guatemala, Indonesia). Feeds crop stress signal for commodity sentiment score.

Extractor:
- OWM One Call API 3.0 / Day Summary — one JSON.gz per (location, date)
- extract_weather: daily, fetches yesterday + today (16 calls max)
- extract_weather_backfill: fills 2020-01-01 to yesterday, capped at 500
  calls/run with resume cursor '{location_id}:{date}' for crash safety
- Full idempotency via file existence check; state tracking via extract_core

SQLMesh:
- seeds.weather_locations (8 regions with lat/lon/variety)
- foundation.fct_weather_daily: INCREMENTAL_BY_TIME_RANGE, grain
  (location_id, observation_date), dedup via hash key, crop stress flags:
  is_frost (<2°C), is_heat_stress (>35°C), is_drought (<1mm), in_growing_season

Landing path: LANDING_DIR/weather/{location_id}/{year}/{date}.json.gz

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-25 22:40:27 +01:00

.gitlab

fix: ADMIN_EMAIL → ADMIN_EMAILS, add default admin emails

2026-02-22 14:59:52 +01:00

assets

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

extract

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

infra

infra: fix CRLF line endings in setup_server.sh

2026-02-22 15:24:22 +01:00

notebooks

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

src/materia

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

tests

feat: extraction framework overhaul — extract_core shared package + SQLite state tracking

2026-02-22 14:37:50 +01:00

transform/sqlmesh_materia

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

web

legal: add imprint page, upgrade privacy policy to GDPR-proper

2026-02-22 15:54:26 +01:00

.gitignore

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

.mcp.json

scout: extract to standalone repo at Projects/scout

2026-02-21 17:58:03 +01:00

.python-version

Initial commit

2025-03-01 18:11:57 +01:00

CHANGELOG.md

changelog: bring up to date through Feb 2026

2026-02-21 23:22:04 +01:00

chatnotes.md

chat notes

2025-04-01 18:33:40 +02:00

CLAUDE.md

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

coding_philosophy.md

Add scout MCP server for browser recon + msgspec workspace dep

2026-02-21 15:44:02 +01:00

market_overview.md

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

materia.drawio

add prototype ui

2025-04-01 20:26:45 +02:00

pyproject.toml

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

readme.md

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

single_server_arch.excalidraw

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

uv.lock

feat(extract): add OpenWeatherMap daily weather extractor

2026-02-25 22:40:27 +01:00

vision.md

dashboard: JTBD-driven restructure — Pulse, Supply, Positioning, Warehouse

2026-02-22 01:27:44 +01:00

readme.md

Materia

A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.

Tech Stack

Python 3.13 with uv package manager
SQLMesh for SQL transformation and orchestration
DuckDB as the analytical database
Cloudflare R2 (Iceberg) for data storage
Pulumi ESC for secrets management
Hetzner Cloud for infrastructure

Quick Start

1. Install UV

UV is our Python package manager for faster, more reliable dependency management.

curl -LsSf https://astral.sh/uv/install.sh | sh

📚 UV Documentation

2. Install Dependencies

uv sync

This installs Python and all dependencies declared in pyproject.toml.

3. Setup Pre-commit Hooks

pre-commit install

This enables automatic linting with ruff on every commit.

4. Install Pulumi ESC (for running with secrets)

# Install ESC CLI
curl -fsSL https://get.pulumi.com/esc/install.sh | sh

# Login
esc login

Project Structure

This is a uv workspace with three main packages:

Extract Layer (`extract/`)

psdonline - Extracts USDA PSD commodity data

# Local development (downloads to local directory)
uv run extract_psd

# Production (uploads to R2)
esc run beanflows/prod -- uv run extract_psd

Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).

All commands run from project root with -p transform/sqlmesh_materia:

# Local development
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>

# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run tests (no secrets needed)
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format

Core Package (`src/materia/`)

CLI for managing infrastructure and pipelines (currently minimal).

Development Workflow

Adding Dependencies

For workspace root:

uv add <package-name>

For specific package:

uv add --package psdonline <package-name>

Linting and Formatting

# Check for issues
ruff check .

# Auto-fix issues
ruff check --fix .

# Format code
ruff format .

Running Tests

# Python tests
uv run pytest tests/ -v --cov=src/materia

# SQLMesh tests
uv run sqlmesh -p transform/sqlmesh_materia test

Secrets Management

All secrets are managed via Pulumi ESC environment beanflows/prod.

Load secrets into shell:

eval $(esc env open beanflows/prod --format shell)

Run commands with secrets:

# Single command
esc run beanflows/prod -- uv run extract_psd

# Multiple commands
esc run beanflows/prod -- bash -c "
  uv run extract_psd
  uv run sqlmesh -p transform/sqlmesh_materia plan prod
"

Production Architecture

Git-Based Deployment

Supervisor (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
Workers (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
Storage: Cloudflare R2 Data Catalog (Apache Iceberg REST API)

CI/CD Pipeline

GitLab CI runs on every push to master:

Lint - ruff check
Test - pytest + SQLMesh tests
Deploy - Updates supervisor infrastructure and bootstraps if needed

No build artifacts - supervisor pulls code directly from git!

Architecture Principles

Simplicity First - Avoid unnecessary abstractions
Data-Oriented Design - Identify data by content, not metadata
Cost Optimization - Ephemeral workers, minimal always-on infrastructure
Inspectable - Easy to understand, test locally, and debug

Resources

Architecture Plans: See .claude/plans/ for design decisions
UV Docs: https://docs.astral.sh/uv/
SQLMesh Docs: https://sqlmesh.readthedocs.io/

Languages

Python 50.8%

HTML 33.7%

Jupyter Notebook 8.3%

Shell 3.6%

CSS 2.9%

Other 0.7%

readme.md

Materia

Tech Stack

Quick Start

1. Install UV

2. Install Dependencies

3. Setup Pre-commit Hooks

4. Install Pulumi ESC (for running with secrets)

Project Structure

Extract Layer (extract/)

Transform Layer (transform/sqlmesh_materia/)

Core Package (src/materia/)

Development Workflow

Adding Dependencies

Linting and Formatting

Running Tests

Secrets Management

Load secrets into shell:

Run commands with secrets:

Production Architecture

Git-Based Deployment

CI/CD Pipeline

Architecture Principles

Resources

Extract Layer (`extract/`)

Transform Layer (`transform/sqlmesh_materia/`)

Core Package (`src/materia/`)