Deeman ff7301d6a8 ICE extraction overhaul: API discovery + aging report + historical backfill
- Replace brittle ICE_STOCKS_URL env var with API-based URL discovery via
  the private ICE Report Center JSON API (no auth required)
- Add rolling CSV → XLS fallback in extract_ice_stocks() using
  find_latest_report() from ice_api.py
- Add ice_api.py: fetch_report_listings(), find_latest_report() with
  pagination up to MAX_API_PAGES
- Add xls_parse.py: detect_file_format() (magic bytes), xls_to_rows()
  using xlrd for OLE2/BIFF XLS files
- Add extract_ice_aging(): monthly certified stock aging report by
  age bucket × port → ice_aging/ landing dir
- Add extract_ice_historical(): 30-year EOM by-port stocks from static
  ICE URL → ice_stocks_by_port/ landing dir
- Add xlrd>=2.0.1 (parse XLS), xlwt>=1.3.0 (dev, test fixtures)
- Add SQLMesh raw + foundation models for both new datasets
- Add ice_aging_glob(), ice_stocks_by_port_glob() macros
- Add extract_ice_aging + extract_ice_historical pipeline entries
- Add 12 unit tests (format detection, XLS roundtrip, API mock, CSV output)

Seed files (data/landing/ice_aging/seed/ and ice_stocks_by_port/seed/)
must be created locally — data/ is gitignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 21:13:18 +01:00
2026-02-18 16:11:56 +01:00
2026-02-04 22:24:55 +01:00
2026-02-04 22:24:55 +01:00
2025-03-01 18:11:57 +01:00
2025-04-01 18:33:40 +02:00
2025-04-01 20:26:45 +02:00
2026-02-04 22:24:55 +01:00

Materia

A commodity data analytics platform built on a modern data engineering stack. Extracts agricultural commodity data from USDA PSD Online, transforms it through a layered SQL pipeline using SQLMesh, and stores it in DuckDB + Cloudflare R2 for analysis.

Tech Stack

  • Python 3.13 with uv package manager
  • SQLMesh for SQL transformation and orchestration
  • DuckDB as the analytical database
  • Cloudflare R2 (Iceberg) for data storage
  • Pulumi ESC for secrets management
  • Hetzner Cloud for infrastructure

Quick Start

1. Install UV

UV is our Python package manager for faster, more reliable dependency management.

curl -LsSf https://astral.sh/uv/install.sh | sh

📚 UV Documentation

2. Install Dependencies

uv sync

This installs Python and all dependencies declared in pyproject.toml.

3. Setup Pre-commit Hooks

pre-commit install

This enables automatic linting with ruff on every commit.

4. Install Pulumi ESC (for running with secrets)

# Install ESC CLI
curl -fsSL https://get.pulumi.com/esc/install.sh | sh

# Login
esc login

Project Structure

This is a uv workspace with three main packages:

Extract Layer (extract/)

psdonline - Extracts USDA PSD commodity data

# Local development (downloads to local directory)
uv run extract_psd

# Production (uploads to R2)
esc run beanflows/prod -- uv run extract_psd

Transform Layer (transform/sqlmesh_materia/)

SQLMesh project implementing a 4-layer data architecture (raw → staging → cleaned → serving).

All commands run from project root with -p transform/sqlmesh_materia:

# Local development
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan dev_<username>

# Production
esc run beanflows/prod -- uv run sqlmesh -p transform/sqlmesh_materia plan prod

# Run tests (no secrets needed)
uv run sqlmesh -p transform/sqlmesh_materia test

# Format SQL
uv run sqlmesh -p transform/sqlmesh_materia format

Core Package (src/materia/)

CLI for managing infrastructure and pipelines (currently minimal).

Development Workflow

Adding Dependencies

For workspace root:

uv add <package-name>

For specific package:

uv add --package psdonline <package-name>

Linting and Formatting

# Check for issues
ruff check .

# Auto-fix issues
ruff check --fix .

# Format code
ruff format .

Running Tests

# Python tests
uv run pytest tests/ -v --cov=src/materia

# SQLMesh tests
uv run sqlmesh -p transform/sqlmesh_materia test

Secrets Management

All secrets are managed via Pulumi ESC environment beanflows/prod.

Load secrets into shell:

eval $(esc env open beanflows/prod --format shell)

Run commands with secrets:

# Single command
esc run beanflows/prod -- uv run extract_psd

# Multiple commands
esc run beanflows/prod -- bash -c "
  uv run extract_psd
  uv run sqlmesh -p transform/sqlmesh_materia plan prod
"

Production Architecture

Git-Based Deployment

  • Supervisor (Hetzner CPX11): Always-on orchestrator that pulls latest code every 15 minutes
  • Workers (Ephemeral): Created on-demand for each pipeline run, destroyed after completion
  • Storage: Cloudflare R2 Data Catalog (Apache Iceberg REST API)

CI/CD Pipeline

GitLab CI runs on every push to master:

  1. Lint - ruff check
  2. Test - pytest + SQLMesh tests
  3. Deploy - Updates supervisor infrastructure and bootstraps if needed

No build artifacts - supervisor pulls code directly from git!

Architecture Principles

  • Simplicity First - Avoid unnecessary abstractions
  • Data-Oriented Design - Identify data by content, not metadata
  • Cost Optimization - Ephemeral workers, minimal always-on infrastructure
  • Inspectable - Easy to understand, test locally, and debug

Resources

Description
No description provided
Readme 3.5 MiB
Languages
Python 50.8%
HTML 33.7%
Jupyter Notebook 8.3%
Shell 3.6%
CSS 2.9%
Other 0.7%