Go to file

Deeman 9ee7a3d9d3 fix: export_serving — Arrow-based copy, rename to analytics.duckdb

Two bugs fixed:

1. Cross-connection COPY: DuckDB doesn't support referencing another
   connection's tables as src.serving.table. Replace with Arrow as
   intermediate: src reads to Arrow, dst.register() + CREATE TABLE.

2. Catalog/schema name collision: naming the export file serving.duckdb
   made DuckDB assign catalog name "serving" — same as the schema we
   create inside it. Every serving.table query became ambiguous. Rename
   to analytics.duckdb (catalog "analytics", schema "serving" = no clash).

   SERVING_DUCKDB_PATH values updated: serving.duckdb → analytics.duckdb
   in supervisor, service, bootstrap, dev_run.sh, .env.example, docker-compose.

3. Temp file: use _export.duckdb (not serving.duckdb.tmp) to avoid
   the same catalog collision during the write phase.

Verified: 6 tables exported, serving.* queries work read-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-22 12:54:39 +01:00

.gitlab

update cicd & philosophy

2026-02-18 16:11:56 +01:00

assets

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

extract

ice_stocks: add backfill extractor for historical daily stocks

2026-02-22 01:35:57 +01:00

infra

fix: export_serving — Arrow-based copy, rename to analytics.duckdb

2026-02-22 12:54:39 +01:00

notebooks

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

src/materia

fix: export_serving — Arrow-based copy, rename to analytics.duckdb

2026-02-22 12:54:39 +01:00

tests

ICE extraction overhaul: API discovery + aging report + historical backfill

2026-02-21 21:13:18 +01:00

transform/sqlmesh_materia

ICE aging + by-port: serving models, API endpoints, dashboard integration

2026-02-21 21:52:35 +01:00

web

fix: export_serving — Arrow-based copy, rename to analytics.duckdb

2026-02-22 12:54:39 +01:00

.gitignore

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

.mcp.json

scout: extract to standalone repo at Projects/scout

2026-02-21 17:58:03 +01:00

.python-version

Initial commit

2025-03-01 18:11:57 +01:00

CHANGELOG.md

changelog: bring up to date through Feb 2026

2026-02-21 23:22:04 +01:00

chatnotes.md

chat notes

2025-04-01 18:33:40 +02:00

CLAUDE.md

Refactor to local-first architecture on Hetzner NVMe

2026-02-18 19:50:19 +01:00

coding_philosophy.md

Add scout MCP server for browser recon + msgspec workspace dep

2026-02-21 15:44:02 +01:00

market_overview.md

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

materia.drawio

add prototype ui

2025-04-01 20:26:45 +02:00

pyproject.toml

ICE extraction overhaul: API discovery + aging report + historical backfill

2026-02-21 21:13:18 +01:00

readme.md

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

single_server_arch.excalidraw

cleanup and prefect service setup

2026-02-04 22:24:55 +01:00

uv.lock

ICE extraction overhaul: API discovery + aging report + historical backfill

2026-02-21 21:13:18 +01:00

vision.md

dashboard: JTBD-driven restructure — Pulse, Supply, Positioning, Warehouse

2026-02-22 01:27:44 +01:00

readme.md

Materia

A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.

Tech Stack

Python 3.13 with uv package manager
SQLMesh for SQL transformation and orchestration
DuckDB as the analytical database
Cloudflare R2 (Iceberg) for data storage
Pulumi ESC for secrets management
Hetzner Cloud for infrastructure

Quick Start

1. Install UV

UV is our Python package manager for faster, more reliable dependency management.

curl -LsSf https://astral.sh/uv/install.sh | sh

📚 UV Documentation

2. Install Dependencies

uv sync

This installs Python and all dependencies declared in pyproject.toml.

3. Setup Pre-commit Hooks

pre-commit install

This enables automatic linting with ruff on every commit.

4. Install Pulumi ESC (for running with secrets)

# Install ESC CLI
curl -fsSL https://get.pulumi.com/esc/install.sh | sh

# Login
esc login

Project Structure

This is a uv workspace with three main packages:

Extract Layer (`extract/`)

psdonline - Extracts USDA PSD commodity data

# Local development (downloads to local directory)
uv run extract_psd

# Production (uploads to R2)
esc run beanflows/prod -- uv run extract_psd

Transform Layer (`transform/sqlmesh_materia/`)

SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).

All commands run from project root with -p transform/sqlmesh_materia:

# Local development
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>

# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run tests (no secrets needed)
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format

Core Package (`src/materia/`)

CLI for managing infrastructure and pipelines (currently minimal).

Development Workflow

Adding Dependencies

For workspace root:

uv add <package-name>

For specific package:

uv add --package psdonline <package-name>

Linting and Formatting

# Check for issues
ruff check .

# Auto-fix issues
ruff check --fix .

# Format code
ruff format .

Running Tests

# Python tests
uv run pytest tests/ -v --cov=src/materia

# SQLMesh tests
uv run sqlmesh -p transform/sqlmesh_materia test

Secrets Management

All secrets are managed via Pulumi ESC environment beanflows/prod.

Load secrets into shell:

eval $(esc env open beanflows/prod --format shell)

Run commands with secrets:

# Single command
esc run beanflows/prod -- uv run extract_psd

# Multiple commands
esc run beanflows/prod -- bash -c "
  uv run extract_psd
  uv run sqlmesh -p transform/sqlmesh_materia plan prod
"

Production Architecture

Git-Based Deployment

Supervisor (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
Workers (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
Storage: Cloudflare R2 Data Catalog (Apache Iceberg REST API)

CI/CD Pipeline

GitLab CI runs on every push to master:

Lint - ruff check
Test - pytest + SQLMesh tests
Deploy - Updates supervisor infrastructure and bootstraps if needed

No build artifacts - supervisor pulls code directly from git!

Architecture Principles

Simplicity First - Avoid unnecessary abstractions
Data-Oriented Design - Identify data by content, not metadata
Cost Optimization - Ephemeral workers, minimal always-on infrastructure
Inspectable - Easy to understand, test locally, and debug

Resources

Architecture Plans: See .claude/plans/ for design decisions
UV Docs: https://docs.astral.sh/uv/
SQLMesh Docs: https://sqlmesh.readthedocs.io/

Languages

Python 50.8%

HTML 33.7%

Jupyter Notebook 8.3%

Shell 3.6%

CSS 2.9%

Other 0.7%

readme.md

Materia

Tech Stack

Quick Start

1. Install UV

2. Install Dependencies

3. Setup Pre-commit Hooks

4. Install Pulumi ESC (for running with secrets)

Project Structure

Extract Layer (extract/)

Transform Layer (transform/sqlmesh_materia/)

Core Package (src/materia/)

Development Workflow

Adding Dependencies

Linting and Formatting

Running Tests

Secrets Management

Load secrets into shell:

Run commands with secrets:

Production Architecture

Git-Based Deployment

CI/CD Pipeline

Architecture Principles

Resources

Extract Layer (`extract/`)

Transform Layer (`transform/sqlmesh_materia/`)

Core Package (`src/materia/`)