Refactor to local-first architecture on Hetzner NVMe

Remove distributed R2/Iceberg/SSH pipeline architecture in favor of local subprocess execution with NVMe storage. Landing data backed up to R2 via rclone timer. - Strip Iceberg catalog, httpfs, boto3, paramiko, prefect, pyarrow - Pipelines run via subprocess.run() with bounded timeouts - Extract writes to {LANDING_DIR}/psd/{year}/{month}/{etag}.csv.gzip - SQLMesh reads LANDING_DIR variable, writes to DUCKDB_PATH - Delete unused provider stubs (ovh, scaleway, oracle) - Add rclone systemd timer for R2 backup every 6h - Update supervisor to run pipelines with env vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:05:41 +01:00
parent 910424c956
commit c1d00dcdc4
25 changed files with 231 additions and 1807 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Project Overview
-Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for orchestrating cloud workers and pipelines.
+Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.
 ## Commands
@@ -25,7 +25,7 @@ cd transform/sqlmesh_materia && uv run sqlmesh test  # SQLMesh model tests
 uv run pytest tests/test_cli.py::test_name -v
 # Extract data
-uv run extract_psd
+LANDING_DIR=data/landing uv run extract_psd
 # SQLMesh (from repo root)
 uv run sqlmesh -p transform/sqlmesh_materia plan              # Plans to dev_<username> by default
@@ -33,43 +33,45 @@ uv run sqlmesh -p transform/sqlmesh_materia plan prod          # Production
 uv run sqlmesh -p transform/sqlmesh_materia test               # Run model tests
 uv run sqlmesh -p transform/sqlmesh_materia format             # Format SQL
 # With production secrets
 esc run beanflows/prod -- <command>
 # CLI
 uv run materia pipeline run extract|transform
 uv run materia pipeline list
 uv run materia worker create|destroy|list
 uv run materia pipeline run
 uv run materia secrets get
 ```
 ## Architecture
 **Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, uploads to R2
+- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory
- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (DuckDB + Iceberg)
+- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (local DuckDB)
- `src/materia/` — CLI (Typer) for worker management, pipeline orchestration, secrets
+- `src/materia/` — CLI (Typer) for pipeline execution, worker management, secrets
 - `web/` — Future web frontend
 **Data flow:**
 ```
-USDA API → extract (psdonline) → R2/local CSV → SQLMesh transforms → DuckDB/Iceberg
+USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
         → rclone cron syncs landing/ to R2
         → SQLMesh raw → staging → cleaned → serving → /data/materia/lakehouse.duckdb
         → Web app reads lakehouse.duckdb (read-only)
 ```
 **SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`):
-1. `raw/` — Immutable source reads (read_csv from extracted files)
+1. `raw/` — Immutable source reads (read_csv from landing directory)
 2. `staging/` — Type casting, lookup joins, basic cleansing
 3. `cleaned/` — Business logic, pivoting, integration
 4. `serving/` — Analytics-ready facts, dimensions, aggregates
 **CLI modules** (`src/materia/`):
 - `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
- `workers.py` — Ephemeral cloud instance management (Hetzner, with planned OVH/Scaleway/Oracle)
+- `workers.py` — Hetzner cloud instance management (for ad-hoc compute)
- `pipelines.py` — SSH-based pipeline execution on workers (download artifact, run, destroy)
+- `pipelines.py` — Local subprocess pipeline execution with bounded timeouts
 - `secrets.py` — Pulumi ESC integration for environment secrets
 **Infrastructure** (`infra/`):
 - Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
- Supervisor systemd service for always-on orchestration (pulls git every 15 min)
+- Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
 - rclone systemd timer for landing data backup to R2
 ## Coding Philosophy
@@ -87,7 +89,14 @@ Read `coding_philosophy.md` for the full guide. Key points:
 - **Python 3.13** (`.python-version`)
 - **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
 - **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
- **Storage**: Cloudflare R2 with Iceberg catalog (zero egress cost)
+- **Storage**: Local NVMe (`LANDING_DIR`, `DUCKDB_PATH`), R2 for backup via rclone
 - **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
 - **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
 - **Pre-commit hooks**: installed via `pre-commit install`
 ## Environment Variables
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `LANDING_DIR` | `data/landing` | Root directory for extracted landing data |
 | `DUCKDB_PATH` | `local.duckdb` | Path to the DuckDB lakehouse database |
--- a/extract/psdonline/pyproject.toml
+++ b/extract/psdonline/pyproject.toml
@@ -2,16 +2,13 @@
 name = "psdonline"
 version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
 authors = [
    { name = "Deeman", email = "hendriknote@gmail.com" }
 ]
 requires-python = ">=3.13"
 dependencies = [
    "boto3>=1.40.55",
    "niquests>=3.14.1",
    "pendulum>=3.1.0",
 ]
 [project.scripts]
 extract_psd = "psdonline.execute:extract_psd_dataset"
--- a/extract/psdonline/src/psdonline/execute.py
+++ b/extract/psdonline/src/psdonline/execute.py
@@ -5,63 +5,32 @@ import pathlib
 import sys
 from datetime import datetime
 import boto3
 import niquests
 from botocore.exceptions import ClientError
 logging.basicConfig(
    level=logging.INFO,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
-    datefmt='%Y-%m-%d %H:%M:%S',
+    datefmt="%Y-%m-%d %H:%M:%S",
-    handlers=[
+    handlers=[logging.StreamHandler(sys.stdout)],
        logging.StreamHandler(sys.stdout)
    ]
 )
 logger = logging.getLogger("PSDOnline Extractor")
 OUTPUT_DIR = pathlib.Path(__file__).parent / "data"
 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 logger.info(f"Output dir: {OUTPUT_DIR}")
-# R2 configuration from environment
+LANDING_DIR = pathlib.Path(os.getenv("LANDING_DIR", "data/landing"))
-R2_ENDPOINT = os.getenv('R2_ENDPOINT')
+LANDING_DIR.mkdir(parents=True, exist_ok=True)
-R2_BUCKET = os.getenv('R2_BUCKET')
+logger.info(f"Landing dir: {LANDING_DIR}")
 R2_ACCESS_KEY = os.getenv('R2_ACCESS_KEY') or os.getenv('R2_ADMIN_ACCESS_KEY_ID')
 R2_SECRET_KEY = os.getenv('R2_SECRET_KEY') or os.getenv('R2_ADMIN_SECRET_ACCESS_KEY')
 PSD_HISTORICAL_URL = "https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip"
 FIRST_YEAR = 2006
 FIRST_MONTH = 8
-def check_r2_file_exists(etag: str, s3_client) -> bool:
+HTTP_TIMEOUT_SECONDS = 60
    """Check if file exists in R2."""
    r2_key = f"landing/psd/{etag}.csv.gzip"
    try:
        s3_client.head_object(Bucket=R2_BUCKET, Key=r2_key)
        logger.info(f"File {r2_key} already exists in R2, skipping")
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == '404':
            return False
        raise
-def upload_to_r2(content: bytes, etag: str, s3_client):
+def extract_psd_file(url: str, year: int, month: int, http_session: niquests.Session):
-    """Upload file content to R2."""
+    """Extract PSD file to local year/month subdirectory."""
    r2_key = f"landing/psd/{etag}.csv.gzip"
    logger.info(f"Uploading to R2: {r2_key}")
    s3_client.put_object(Bucket=R2_BUCKET, Key=r2_key, Body=content)
    logger.info("Upload complete")
 def extract_psd_file(url: str, extract_to_path: pathlib.Path, http_session: niquests.Session, s3_client=None):
    """
    Extract PSD file either to local storage or R2.
    If s3_client is provided, uploads to R2 only (no local storage).
    If s3_client is None, downloads to local storage.
    """
    logger.info(f"Requesting file {url} ...")
-    response = http_session.head(url)
+    response = http_session.head(url, timeout=HTTP_TIMEOUT_SECONDS)
    if response.status_code == 404:
        logger.error("File doesn't exist on server, received status code 404 Not Found")
        return
@@ -69,55 +38,31 @@ def extract_psd_file(url: str, extract_to_path: pathlib.Path, http_session: niqu
        logger.error(f"Status code not ok, STATUS={response.status_code}")
        return
-    etag = response.headers.get("etag").replace('"',"").replace(":","_")
+    etag = response.headers.get("etag", "").replace('"', "").replace(":", "_")
    assert etag, "USDA response missing etag header"
-    # R2 mode: check R2 and upload if needed
+    extract_to_path = LANDING_DIR / "psd" / str(year) / f"{month:02d}"
    if s3_client:
        if check_r2_file_exists(etag, s3_client):
            return
        response = http_session.get(url)
        normalized_content = normalize_zipped_csv(response.content)
        upload_to_r2(normalized_content, etag, s3_client)
        return
    # Local mode: check local and download if needed
    local_file = extract_to_path / f"{etag}.csv.gzip"
    if local_file.exists():
-        logger.info(f"File {etag}.zip already exists locally, skipping")
+        logger.info(f"File {etag}.csv.gzip already exists locally, skipping")
        return
-    response = http_session.get(url)
+    response = http_session.get(url, timeout=HTTP_TIMEOUT_SECONDS)
    logger.info(f"Storing file to {local_file}")
    extract_to_path.mkdir(parents=True, exist_ok=True)
    normalized_content = normalize_zipped_csv(response.content)
    local_file.write_bytes(normalized_content)
    assert local_file.exists(), f"File was not written: {local_file}"
    logger.info("Download complete")
 def extract_psd_dataset():
    today = datetime.now()
    # Check if R2 credentials are configured
    use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY])
    if use_r2:
        logger.info("R2 credentials found, uploading to R2")
        s3_client = boto3.client(
            's3',
            endpoint_url=R2_ENDPOINT,
            aws_access_key_id=R2_ACCESS_KEY,
            aws_secret_access_key=R2_SECRET_KEY
        )
    else:
        logger.info("R2 credentials not found, downloading to local storage")
        s3_client = None
    # Try current month and previous 3 months (USDA data is published with lag)
    with niquests.Session() as session:
        for months_back in range(4):
            year = today.year
            month = today.month - months_back
            # Handle year rollover
            while month < 1:
                month += 12
                year -= 1
@@ -125,11 +70,10 @@ def extract_psd_dataset():
            url = PSD_HISTORICAL_URL.format(year=year, month=month)
            logger.info(f"Trying {year}-{month:02d}...")
-            # Check if URL exists
+            response = session.head(url, timeout=HTTP_TIMEOUT_SECONDS)
            response = session.head(url)
            if response.status_code == 200:
                logger.info(f"Found latest data at {year}-{month:02d}")
-                extract_psd_file(url=url, http_session=session, extract_to_path=OUTPUT_DIR, s3_client=s3_client)
+                extract_psd_file(url=url, year=year, month=month, http_session=session)
                return
            elif response.status_code == 404:
                logger.info(f"Month {year}-{month:02d} not found, trying earlier...")
@@ -141,5 +85,3 @@ def extract_psd_dataset():
 if __name__ == "__main__":
    extract_psd_dataset()
--- a/infra/backup/materia-backup.service
+++ b/infra/backup/materia-backup.service
@@ -0,0 +1,9 @@
 [Unit]
 Description=Materia Landing Data Backup to R2
 After=network-online.target
 Wants=network-online.target
 [Service]
 Type=oneshot
 ExecStart=/usr/bin/rclone sync /data/materia/landing/ r2:materia-raw/landing/ --log-level INFO
 TimeoutStartSec=1800
--- a/infra/backup/materia-backup.timer
+++ b/infra/backup/materia-backup.timer
@@ -0,0 +1,10 @@
 [Unit]
 Description=Materia Landing Data Backup Timer
 [Timer]
 OnCalendar=*-*-* 00/6:00:00
 RandomizedDelaySec=300
 Persistent=true
 [Install]
 WantedBy=timers.target
--- a/infra/backup/rclone.conf.example
+++ b/infra/backup/rclone.conf.example
@@ -0,0 +1,14 @@
 # Cloudflare R2 remote for landing data backup
 # Copy to /root/.config/rclone/rclone.conf and fill in credentials
 #
 # Get credentials from: Cloudflare Dashboard → R2 → Manage R2 API Tokens
 # Or from Pulumi ESC: esc env open beanflows/prod --format shell
 [r2]
 type = s3
 provider = Cloudflare
 access_key_id = <R2_ACCESS_KEY_ID>
 secret_access_key = <R2_SECRET_ACCESS_KEY>
 endpoint = https://<CLOUDFLARE_ACCOUNT_ID>.r2.cloudflarestorage.com
 acl = private
 no_check_bucket = true
--- a/infra/bootstrap_supervisor.sh
+++ b/infra/bootstrap_supervisor.sh
@@ -79,6 +79,9 @@ else
    cd "$REPO_DIR"
 fi
 echo "--- Creating data directories ---"
 mkdir -p /data/materia/landing/psd
 echo "--- Installing Python dependencies ---"
 uv sync
@@ -88,6 +91,8 @@ cat > "$REPO_DIR/.env" <<EOF
 # Loaded from Pulumi ESC: beanflows/prod
 PULUMI_ACCESS_TOKEN=${PULUMI_ACCESS_TOKEN}
 PATH=/root/.cargo/bin:/root/.pulumi/bin:/usr/local/bin:/usr/bin:/bin
 LANDING_DIR=/data/materia/landing
 DUCKDB_PATH=/data/materia/lakehouse.duckdb
 EOF
 echo "--- Setting up systemd service ---"
--- a/infra/prefect.docker-compose.yml
+++ b/infra/prefect.docker-compose.yml
@@ -1,80 +0,0 @@
 services:
  postgres:
    image: postgres:14
    environment:
      POSTGRES_USER: prefect
      POSTGRES_PASSWORD: prefect
      POSTGRES_DB: prefect
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U prefect"]
      interval: 5s
      timeout: 5s
      retries: 5
  dragonfly:
    image: 'docker.dragonflydb.io/dragonflydb/dragonfly'
    ulimits:
      memlock: -1
    volumes:
      - dragonflydata:/data
    healthcheck:
      test: ["CMD-SHELL", "redis-cli ping"]
      interval: 5s
      timeout: 5s
      retries: 5
  prefect-server:
    image: prefecthq/prefect:3-latest
    depends_on:
      postgres:
        condition: service_healthy
      dragonfly:
        condition: service_healthy
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://prefect:prefect@postgres:5432/prefect
      PREFECT_SERVER_API_HOST: 0.0.0.0
      PREFECT_UI_API_URL: http://localhost:4200/api
      PREFECT_MESSAGING_BROKER: prefect_redis.messaging
      PREFECT_MESSAGING_CACHE: prefect_redis.messaging
      PREFECT_REDIS_MESSAGING_HOST: dragonfly
      PREFECT_REDIS_MESSAGING_PORT: 6379
      PREFECT_REDIS_MESSAGING_DB: 0
    command: prefect server start --no-services
    ports:
      - "4200:4200"
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request as u; u.urlopen('http://localhost:4200/api/health', timeout=1)"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
  prefect-services:
    image: prefecthq/prefect:3-latest
    depends_on:
      prefect-server:
        condition: service_healthy
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://prefect:prefect@postgres:5432/prefect
      PREFECT_MESSAGING_BROKER: prefect_redis.messaging
      PREFECT_MESSAGING_CACHE: prefect_redis.messaging
      PREFECT_REDIS_MESSAGING_HOST: dragonfly
      PREFECT_REDIS_MESSAGING_PORT: 6379
      PREFECT_REDIS_MESSAGING_DB: 0
    command: prefect server services start
  prefect-worker:
    image: prefecthq/prefect:3-latest
    depends_on:
      prefect-server:
        condition: service_healthy
    environment:
      PREFECT_API_URL: http://prefect-server:4200/api
    command: prefect worker start --pool local-pool
    restart: on-failure
 volumes:
  postgres_data:
  dragonflydata:
--- a/infra/readme.md
+++ b/infra/readme.md
@@ -1,161 +1,85 @@
 # Materia Infrastructure
-Pulumi-managed infrastructure for BeanFlows.coffee
+Single-server local-first setup for BeanFlows.coffee on Hetzner NVMe.
-## Stack Overview
+## Architecture
- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
+```
- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
+Hetzner Server (NVMe)
- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)
+├── /opt/materia/              # Git repo, code, uv environment
 ├── /data/materia/landing/     # Extracted USDA data (year/month subdirs)
 ├── /data/materia/lakehouse.duckdb  # SQLMesh output database
 └── systemd services:
    ├── materia-supervisor     # Pulls git, runs extract + transform daily
    └── materia-backup.timer   # Syncs landing/ to R2 every 6 hours
 ```
-## Prerequisites
+## Data Flow
-1. **Cloudflare Account**
+1. **Extract**: USDA API → `/data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip`
-   - Sign up at https://dash.cloudflare.com
+2. **Transform**: SQLMesh reads landing CSVs → writes to `/data/materia/lakehouse.duckdb`
-   - Create API token with R2 + Data Catalog permissions
+3. **Backup**: rclone syncs `/data/materia/landing/` → R2 `materia-raw/landing/`
-   - Get your Account ID from dashboard
+4. **Web**: Reads `lakehouse.duckdb` (read-only)
-2. **Hetzner Cloud Account**
+## Setup
   - Sign up at https://console.hetzner.cloud
   - Create API token with Read & Write permissions
-3. **Pulumi Account** (optional, can use local state)
+### Prerequisites
   - Sign up at https://app.pulumi.com
   - Or use local state with `pulumi login --local`
-4. **SSH Key**
+- Hetzner server with NVMe storage
-   - Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`
+- Pulumi ESC configured (`beanflows/prod` environment)
 - `GITLAB_READ_TOKEN` and `PULUMI_ACCESS_TOKEN` set
-## Initial Setup
+### Bootstrap
 ```bash
 # From local machine or CI:
 ssh root@<server_ip> 'bash -s' < infra/bootstrap_supervisor.sh
 ```
 This installs dependencies, clones the repo, creates data directories, and starts the supervisor service.
 ### R2 Backup
 1. Install rclone: `apt install rclone`
 2. Copy and configure: `cp infra/backup/rclone.conf.example /root/.config/rclone/rclone.conf`
 3. Fill in R2 credentials from Pulumi ESC
 4. Install systemd units:
 ```bash
 cp infra/backup/materia-backup.service /etc/systemd/system/
 cp infra/backup/materia-backup.timer /etc/systemd/system/
 systemctl daemon-reload
 systemctl enable --now materia-backup.timer
 ```
 ## Pulumi IaC
 Still manages Cloudflare R2 buckets and can provision Hetzner instances:
 ```bash
 cd infra
-
+pulumi login
-# Login to Pulumi (local or cloud)
+pulumi stack select prod
 pulumi login  # or: pulumi login --local
 # Initialize the stack
 pulumi stack init dev
 # Configure secrets
 pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
 pulumi config set cloudflare_account_id <your-account-id>
 pulumi config set --secret hcloud:token <your-hetzner-token>
 pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"
 # Preview changes
 pulumi preview
 # Deploy infrastructure
 pulumi up
 ```
-## What Gets Provisioned
+## Monitoring
 ### Cloudflare R2 Buckets
 1. **materia-raw** - Raw data from extraction (immutable archives)
 2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)
 ### Hetzner Cloud Servers
 1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
   - Runs cron scheduler
   - Lightweight orchestration tasks
   - Always-on, low cost (~€6/mo)
 2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
   - Heavy SQLMesh transformations
   - Can be stopped when not in use
   - Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)
 3. **materia-firewall**
   - SSH access (port 22)
   - All outbound traffic allowed
   - No inbound HTTP/HTTPS (we're not running web services yet)
 ## Enabling R2 Data Catalog (Iceberg)
 As of October 2025, R2 Data Catalog is in public beta. Enable it manually:
 1. Go to Cloudflare Dashboard → R2
 2. Select the `materia-lakehouse` bucket
 3. Navigate to Settings → Data Catalog
 4. Click "Enable Data Catalog"
 Once enabled, you can connect DuckDB to the Iceberg REST catalog:
 ```python
 import duckdb
 # Get catalog URI from Pulumi outputs
 # pulumi stack output duckdb_r2_config
 conn = duckdb.connect()
 conn.execute("INSTALL iceberg; LOAD iceberg;")
 conn.execute(f"""
    ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
    AS lakehouse (
        TYPE ICEBERG_REST,
        SECRET '<r2_api_token>'
    );
 """)
 ```
 ## Server Access
 Get server IPs from Pulumi outputs:
 ```bash
-pulumi stack output scheduler_ip
+# Supervisor status and logs
-pulumi stack output worker_ip
+systemctl status materia-supervisor
 journalctl -u materia-supervisor -f
 # Backup timer status
 systemctl list-timers materia-backup.timer
 journalctl -u materia-backup -f
 ```
-SSH into servers:
+## Cost
 ```bash
 ssh root@<scheduler_ip>
 ssh root@<worker_ip>
 ```
 ## Cost Estimates (Monthly)
 | Resource | Type | Cost |
 |----------|------|------|
-| R2 Storage | 10 GB | $0.15 |
+| Hetzner Server | CCX22 (4 vCPU, 16GB) | ~€24/mo |
-| R2 Operations | 1M reads | $0.36 |
+| R2 Storage | Backup (~10 GB) | $0.15/mo |
-| R2 Egress | Unlimited | $0.00 (zero egress!) |
+| R2 Egress | Zero | $0.00 |
-| Scheduler | CCX12 | €6.00 |
+| **Total** | | **~€24/mo (~$26)** |
 | Worker (on-demand) | CCX22 | €24.00 |
 | **Total** | | **~€30/mo (~$33)** |
 Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.
 ## Scaling Workers
 To add more worker capacity or different instance sizes:
 1. Edit `infra/__main__.py` to add new server resources
 2. Update worker config in `src/orchestrator/workers.yaml`
 3. Run `pulumi up` to provision
 Example worker sizes:
 - CCX12: 2 vCPU, 8GB RAM (light workloads)
 - CCX22: 4 vCPU, 16GB RAM (medium workloads)
 - CCX32: 8 vCPU, 32GB RAM (heavy workloads)
 - CCX42: 16 vCPU, 64GB RAM (very heavy workloads)
 ## Destroying Infrastructure
 ```bash
 cd infra
 pulumi destroy
 ```
 **Warning:** This will delete all buckets and servers. Backup data first!
 ## Next Steps
 1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
 2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
 3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)
--- a/infra/supervisor/materia-supervisor.service
+++ b/infra/supervisor/materia-supervisor.service
@@ -11,6 +11,8 @@ ExecStart=/opt/materia/infra/supervisor/supervisor.sh
 Restart=always
 RestartSec=10
 EnvironmentFile=/opt/materia/.env
 Environment=LANDING_DIR=/data/materia/landing
 Environment=DUCKDB_PATH=/data/materia/lakehouse.duckdb
 # Resource limits
 LimitNOFILE=65536
--- a/infra/supervisor/supervisor.sh
+++ b/infra/supervisor/supervisor.sh
@@ -24,9 +24,14 @@ do
        git switch --discard-changes --detach origin/master
        uv sync
-        # Run pipelines (SQLMesh handles scheduling)
+        # Run pipelines
-        #uv run materia pipeline run extract
+        LANDING_DIR="${LANDING_DIR:-/data/materia/landing}" \
-        #uv run materia pipeline run transform
+        DUCKDB_PATH="${DUCKDB_PATH:-/data/materia/lakehouse.duckdb}" \
            uv run materia pipeline run extract
        LANDING_DIR="${LANDING_DIR:-/data/materia/landing}" \
        DUCKDB_PATH="${DUCKDB_PATH:-/data/materia/lakehouse.duckdb}" \
            uv run materia pipeline run transform
    ) || sleep 600  # Sleep 10 min on failure to avoid busy-loop retries
 done
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -9,14 +9,11 @@ authors = [
 ]
 requires-python = ">=3.13"
 dependencies = [
    "pyarrow>=20.0.0",
    "python-dotenv>=1.1.0",
    "typer>=0.15.0",
    "paramiko>=3.5.0",
    "pyyaml>=6.0.2",
    "niquests>=3.15.2",
    "hcloud>=2.8.0",
    "prefect>=3.6.15",
 ]
 [project.scripts]
@@ -130,4 +127,5 @@ force-single-line = false
 # Allow print statements and other rules in scripts
 "scripts/*" = ["T201"]
-
+[tool.pytest.ini_options]
 testpaths = ["tests"]
--- a/src/materia/cli.py
+++ b/src/materia/cli.py
@@ -80,15 +80,12 @@ app.add_typer(pipeline_app, name="pipeline")
@pipeline_app.command("run")
 def pipeline_run(
    name: Annotated[str, typer.Argument(help="Pipeline name (extract, transform)")],
    worker_type: Annotated[str | None, typer.Option("--worker", "-w")] = None,
    provider: Annotated[str, typer.Option("--provider", "-p")] = "hetzner",
    keep: Annotated[bool, typer.Option("--keep", help="Keep worker after completion")] = False,
 ):
-    """Run a pipeline on an ephemeral worker."""
+    """Run a pipeline locally."""
    from materia.pipelines import run_pipeline
    typer.echo(f"Running pipeline '{name}'...")
-    result = run_pipeline(name, worker_type, auto_destroy=not keep, provider=provider)
+    result = run_pipeline(name)
    if result.success:
        typer.echo(result.output)
@@ -105,7 +102,8 @@ def pipeline_list():
    typer.echo("Available pipelines:")
    for name, config in PIPELINES.items():
-        typer.echo(f"  • {name:<15} (worker: {config.worker_type}, artifact: {config.artifact})")
+        cmd = " ".join(config["command"])
        typer.echo(f"  • {name:<15} (command: {cmd}, timeout: {config['timeout_seconds']}s)")
 secrets_app = typer.Typer(help="Manage secrets via Pulumi ESC")
--- a/src/materia/pipelines.py
+++ b/src/materia/pipelines.py
@@ -1,21 +1,8 @@
-"""Pipeline execution on ephemeral workers."""
+"""Pipeline execution via local subprocess."""
-import contextlib
+import subprocess
 from dataclasses import dataclass
 import paramiko
 from materia.secrets import get_secret
 from materia.workers import create_worker, destroy_worker
@dataclass
 class PipelineConfig:
    worker_type: str
    artifact: str
    command: str
    secrets: list[str]
@dataclass
 class PipelineResult:
@@ -25,56 +12,20 @@ class PipelineResult:
 PIPELINES = {
-    "extract": PipelineConfig(
+    "extract": {
-        worker_type="ccx12",
+        "command": ["uv", "run", "--package", "psdonline", "extract_psd"],
-        artifact="materia-extract-latest.tar.gz",
+        "timeout_seconds": 1800,
-        command="./extract_psd",
+    },
-        secrets=["R2_ACCESS_KEY_ID", "R2_SECRET_ACCESS_KEY", "R2_ENDPOINT", "R2_ARTIFACTS_BUCKET"],
+    "transform": {
-    ),
+        "command": ["uv", "run", "--package", "sqlmesh_materia", "sqlmesh", "-p", "transform/sqlmesh_materia", "plan", "prod", "--no-prompts", "--auto-apply"],
-    "transform": PipelineConfig(
+        "timeout_seconds": 3600,
-        worker_type="ccx22",
+    },
        artifact="materia-transform-latest.tar.gz",
        command="cd sqlmesh_materia && ./sqlmesh plan prod",
        secrets=[
            "CLOUDFLARE_API_TOKEN",
            "ICEBERG_REST_URI",
            "R2_WAREHOUSE_NAME",
        ],
    ),
 }
-def _execute_ssh_command(ip: str, command: str, env_vars: dict[str, str]) -> tuple[str, str, int]:
+def run_pipeline(pipeline_name: str) -> PipelineResult:
-    ssh_key_path = get_secret("SSH_PRIVATE_KEY_PATH")
+    assert pipeline_name, "pipeline_name must not be empty"
    if not ssh_key_path:
        raise ValueError("SSH_PRIVATE_KEY_PATH not found in secrets")
    client = paramiko.SSHClient()
    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    pkey = paramiko.RSAKey.from_private_key_file(ssh_key_path)
    client.connect(ip, username="root", pkey=pkey)
    env_string = " ".join([f"export {k}='{v}' &&" for k, v in env_vars.items()])
    full_command = f"{env_string} {command}" if env_vars else command
    stdin, stdout, stderr = client.exec_command(full_command)
    exit_code = stdout.channel.recv_exit_status()
    output = stdout.read().decode()
    error = stderr.read().decode()
    client.close()
    return output, error, exit_code
 def run_pipeline(
    pipeline_name: str,
    worker_type: str | None = None,
    auto_destroy: bool = True,
    provider: str = "hetzner",
 ) -> PipelineResult:
    if pipeline_name not in PIPELINES:
        return PipelineResult(
            success=False,
@@ -82,58 +33,24 @@ def run_pipeline(
            error=f"Unknown pipeline: {pipeline_name}. Available: {', '.join(PIPELINES.keys())}",
        )
-    pipeline_config = PIPELINES[pipeline_name]
+    pipeline = PIPELINES[pipeline_name]
-    worker_type = worker_type or pipeline_config.worker_type
+    timeout_seconds = pipeline["timeout_seconds"]
    worker_name = f"materia-{pipeline_name}-worker"
-    r2_bucket = get_secret("R2_ARTIFACTS_BUCKET") or "materia-artifacts"
+    try:
-    r2_endpoint = get_secret("R2_ENDPOINT")
+        result = subprocess.run(
-
+            pipeline["command"],
-    if not r2_endpoint:
+            capture_output=True,
            text=True,
            timeout=timeout_seconds,
        )
        return PipelineResult(
            success=result.returncode == 0,
            output=result.stdout,
            error=result.stderr if result.returncode != 0 else None,
        )
    except subprocess.TimeoutExpired:
        return PipelineResult(
            success=False,
            output="",
-            error="R2_ENDPOINT not configured in secrets",
+            error=f"Pipeline '{pipeline_name}' timed out after {timeout_seconds} seconds",
        )
    try:
        worker = create_worker(worker_name, worker_type, provider)
        artifact_url = f"https://{r2_endpoint}/{r2_bucket}/{pipeline_config.artifact}"
        bootstrap_commands = [
            f"curl -fsSL -o artifact.tar.gz {artifact_url}",
            "tar -xzf artifact.tar.gz",
            "chmod +x -R .",
        ]
        for cmd in bootstrap_commands:
            _, error, exit_code = _execute_ssh_command(worker.ip, cmd, {})
            if exit_code != 0:
                return PipelineResult(
                    success=False,
                    output="",
                    error=f"Bootstrap failed: {error}",
                )
        env_vars = {}
        for secret_key in pipeline_config.secrets:
            value = get_secret(secret_key)
            if value:
                env_vars[secret_key] = value
        command = pipeline_config.command
        output, error, exit_code = _execute_ssh_command(worker.ip, command, env_vars)
        success = exit_code == 0
        return PipelineResult(
            success=success,
            output=output,
            error=error if not success else None,
        )
    finally:
        if auto_destroy:
            with contextlib.suppress(Exception):
                destroy_worker(worker_name, provider)
--- a/src/materia/providers/init.py
+++ b/src/materia/providers/init.py
@@ -1,7 +1,6 @@
-"""Cloud provider abstraction for worker management."""
+"""Cloud provider for worker management."""
 from dataclasses import dataclass
 from typing import Protocol
@dataclass
@@ -14,35 +13,10 @@ class Instance:
    type: str
-class ProviderModule(Protocol):
+def get_provider(provider_name: str):
    def create_instance(
        self: str,
        instance_type: str,
        ssh_key: str,
        location: str | None = None,
    ) -> Instance: ...
    def destroy_instance(self: str) -> None: ...
    def list_instances(self: str | None = None) -> list[Instance]: ...
    def get_instance(self: str) -> Instance | None: ...
    def wait_for_ssh(self: str, timeout: int = 300) -> bool: ...
 def get_provider(provider_name: str) -> ProviderModule:
    if provider_name == "hetzner":
        from materia.providers import hetzner
        return hetzner
    elif provider_name == "ovh":
        from materia.providers import ovh
        return ovh
    elif provider_name == "scaleway":
        from materia.providers import scaleway
        return scaleway
    elif provider_name == "oracle":
        from materia.providers import oracle
        return oracle
    else:
        raise ValueError(f"Unknown provider: {provider_name}")
--- a/src/materia/providers/oracle.py
+++ b/src/materia/providers/oracle.py
@@ -1,28 +0,0 @@
 """Oracle Cloud provider implementation."""
 from materia.providers import Instance
 def create_instance(
    name: str,
    instance_type: str,
    ssh_key: str,
    location: str | None = None,
 ) -> Instance:
    raise NotImplementedError("Oracle Cloud provider not yet implemented")
 def destroy_instance(instance_id: str) -> None:
    raise NotImplementedError("Oracle Cloud provider not yet implemented")
 def list_instances(label: str | None = None) -> list[Instance]:
    raise NotImplementedError("Oracle Cloud provider not yet implemented")
 def get_instance(name: str) -> Instance | None:
    raise NotImplementedError("Oracle Cloud provider not yet implemented")
 def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
    raise NotImplementedError("Oracle Cloud provider not yet implemented")
--- a/src/materia/providers/ovh.py
+++ b/src/materia/providers/ovh.py
@@ -1,28 +0,0 @@
 """OVH Cloud provider implementation."""
 from materia.providers import Instance
 def create_instance(
    name: str,
    instance_type: str,
    ssh_key: str,
    location: str | None = None,
 ) -> Instance:
    raise NotImplementedError("OVH provider not yet implemented")
 def destroy_instance(instance_id: str) -> None:
    raise NotImplementedError("OVH provider not yet implemented")
 def list_instances(label: str | None = None) -> list[Instance]:
    raise NotImplementedError("OVH provider not yet implemented")
 def get_instance(name: str) -> Instance | None:
    raise NotImplementedError("OVH provider not yet implemented")
 def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
    raise NotImplementedError("OVH provider not yet implemented")
--- a/src/materia/providers/scaleway.py
+++ b/src/materia/providers/scaleway.py
@@ -1,28 +0,0 @@
 """Scaleway provider implementation."""
 from materia.providers import Instance
 def create_instance(
    name: str,
    instance_type: str,
    ssh_key: str,
    location: str | None = None,
 ) -> Instance:
    raise NotImplementedError("Scaleway provider not yet implemented")
 def destroy_instance(instance_id: str) -> None:
    raise NotImplementedError("Scaleway provider not yet implemented")
 def list_instances(label: str | None = None) -> list[Instance]:
    raise NotImplementedError("Scaleway provider not yet implemented")
 def get_instance(name: str) -> Instance | None:
    raise NotImplementedError("Scaleway provider not yet implemented")
 def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
    raise NotImplementedError("Scaleway provider not yet implemented")
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -13,16 +13,9 @@ def mock_esc_env(tmp_path):
    return {
        "HETZNER_API_TOKEN": "test-hetzner-token",
        "R2_ACCESS_KEY_ID": "test-r2-key",
        "R2_SECRET_ACCESS_KEY": "test-r2-secret",
        "R2_ENDPOINT": "test.r2.cloudflarestorage.com",
        "R2_ARTIFACTS_BUCKET": "test-artifacts",
        "SSH_PUBLIC_KEY": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAITest",
        "SSH_PRIVATE_KEY": "-----BEGIN OPENSSH PRIVATE KEY-----\ntest\n-----END OPENSSH PRIVATE KEY-----",
        "SSH_PRIVATE_KEY_PATH": str(ssh_key_path),
        "CLOUDFLARE_API_TOKEN": "test-cf-token",
        "ICEBERG_REST_URI": "https://api.cloudflare.com/test",
        "R2_WAREHOUSE_NAME": "test-warehouse",
    }
@@ -67,33 +60,3 @@ def mock_ssh_wait():
    """Mock SSH wait function to return immediately."""
    with patch("materia.providers.hetzner.wait_for_ssh", return_value=True):
        yield
@pytest.fixture
 def mock_ssh_connection():
    """Mock paramiko SSH connection."""
    with patch("materia.pipelines.paramiko.SSHClient") as mock_ssh_class, \
         patch("materia.pipelines.paramiko.RSAKey.from_private_key_file") as mock_key:
        ssh_instance = Mock()
        mock_ssh_class.return_value = ssh_instance
        mock_key.return_value = Mock()
        ssh_instance.connect = Mock()
        ssh_instance.set_missing_host_key_policy = Mock()
        mock_channel = Mock()
        mock_channel.recv_exit_status.return_value = 0
        mock_stdout = Mock()
        mock_stdout.read.return_value = b"Success\n"
        mock_stdout.channel = mock_channel
        mock_stderr = Mock()
        mock_stderr.read.return_value = b""
        ssh_instance.exec_command = Mock(
            return_value=(Mock(), mock_stdout, mock_stderr)
        )
        ssh_instance.close = Mock()
        yield ssh_instance
--- a/tests/test_cli_e2e.py
+++ b/tests/test_cli_e2e.py
@@ -1,5 +1,7 @@
 """End-to-end tests for the materia CLI."""
 from unittest.mock import patch
 from typer.testing import CliRunner
 from materia.cli import app
@@ -33,7 +35,6 @@ def test_secrets_list_command(mock_secrets):
    result = runner.invoke(app, ["secrets", "list"])
    assert result.exit_code == 0
    assert "HETZNER_API_TOKEN" in result.stdout
    assert "R2_ACCESS_KEY_ID" in result.stdout
 def test_worker_list_empty(mock_secrets, mock_hcloud_client):
@@ -98,46 +99,55 @@ def test_worker_destroy(mock_secrets, mock_hcloud_client):
    assert "Worker destroyed" in result.stdout
-def test_pipeline_list(mock_secrets):
+def test_pipeline_list():
    """Test pipeline list command."""
    result = runner.invoke(app, ["pipeline", "list"])
    assert result.exit_code == 0
    assert "extract" in result.stdout
    assert "transform" in result.stdout
-    assert "ccx12" in result.stdout
+    assert "1800" in result.stdout
-    assert "ccx22" in result.stdout
+    assert "3600" in result.stdout
-def test_pipeline_run_extract(
+def test_pipeline_run_extract():
    mock_secrets, mock_hcloud_client, mock_ssh_wait, mock_ssh_connection
 ):
    """Test running extract pipeline end-to-end."""
-    result = runner.invoke(app, ["pipeline", "run", "extract"])
+    with patch("materia.pipelines.subprocess.run") as mock_run:
        mock_run.return_value.returncode = 0
        mock_run.return_value.stdout = "Extracted successfully\n"
        mock_run.return_value.stderr = ""
-    assert result.exit_code == 0
+        result = runner.invoke(app, ["pipeline", "run", "extract"])
    assert "Running pipeline" in result.stdout
    assert "Pipeline completed successfully" in result.stdout
-    mock_hcloud_client.servers.create.assert_called_once()
+        assert result.exit_code == 0
-    mock_ssh_connection.connect.assert_called()
+        assert "Running pipeline" in result.stdout
-    mock_ssh_connection.exec_command.assert_called()
+        assert "Pipeline completed successfully" in result.stdout
        mock_run.assert_called_once()
        call_args = mock_run.call_args
        assert call_args[0][0] == ["uv", "run", "--package", "psdonline", "extract_psd"]
        assert call_args[1]["timeout"] == 1800
-def test_pipeline_run_transform(
+def test_pipeline_run_transform():
    mock_secrets, mock_hcloud_client, mock_ssh_wait, mock_ssh_connection
 ):
    """Test running transform pipeline end-to-end."""
-    result = runner.invoke(app, ["pipeline", "run", "transform"])
+    with patch("materia.pipelines.subprocess.run") as mock_run:
        mock_run.return_value.returncode = 0
        mock_run.return_value.stdout = "Transform complete\n"
        mock_run.return_value.stderr = ""
-    assert result.exit_code == 0
+        result = runner.invoke(app, ["pipeline", "run", "transform"])
    assert "Running pipeline" in result.stdout
    assert "Pipeline completed successfully" in result.stdout
-    mock_hcloud_client.servers.create.assert_called_once()
+        assert result.exit_code == 0
-    mock_ssh_connection.connect.assert_called()
+        assert "Running pipeline" in result.stdout
        assert "Pipeline completed successfully" in result.stdout
        mock_run.assert_called_once()
        call_args = mock_run.call_args
        assert "sqlmesh" in call_args[0][0]
        assert call_args[1]["timeout"] == 3600
-def test_pipeline_run_invalid(mock_secrets):
+def test_pipeline_run_invalid():
    """Test running an invalid pipeline."""
    result = runner.invoke(app, ["pipeline", "run", "invalid-pipeline"])
--- a/transform/sqlmesh_materia/config.yaml
+++ b/transform/sqlmesh_materia/config.yaml
@@ -1,5 +1,5 @@
 # --- Gateway Connection ---
-# Single gateway connecting to R2 Iceberg catalog
+# Single local DuckDB gateway
 # Local dev uses virtual environments (e.g., dev_<username>)
 # Production uses the 'prod' environment
 gateways:
@@ -7,48 +7,13 @@ gateways:
    connection:
      type: duckdb
      catalogs:
-        local: 'local.duckdb'
+        local: '{{ env_var("DUCKDB_PATH", "local.duckdb") }}'
        cloudflare:
          type: iceberg
          path: '{{ env_var("ICEBERG_WAREHOUSE_NAME") }}'
      connector_config:
        endpoint: '{{ env_var("ICEBERG_CATALOG_URI") }}'
      extensions:
        - name: httpfs
        - name: iceberg
      secrets:
        r2_secret:
          type: iceberg
          token: "{{ env_var('R2_ADMIN_API_TOKEN') }}"
        r2_data_secret:
          type: r2
          key_id: "{{ env_var('R2_ADMIN_ACCESS_KEY_ID') }}"
          secret: "{{ env_var('R2_ADMIN_SECRET_ACCESS_KEY') }}"
          account_id: "{{ var('CLOUDFLARE_ACCOUNT_ID') }}"
          region: 'eeur'
 default_gateway: duckdb
 # --- Variables ---
 # Make environment variables available to models
 variables:
-  R2_BUCKET: beanflows-data-prod
+  LANDING_DIR: '{{ env_var("LANDING_DIR", "data/landing") }}'
  CLOUDFLARE_ACCOUNT_ID: "{{ env_var('CLOUDFLARE_ACCOUNT_ID') }}"
  # --- Catalog Configuration ---
 # Attach R2 Iceberg catalog and configure default schema
 # https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#execution-hooks
 # https://developers.cloudflare.com/r2/data-catalog/config-examples/duckdb/
 #before_all:
 #  - "ATTACH '{{ env_var('ICEBERG_WAREHOUSE_NAME') }}' AS catalog (TYPE ICEBERG, ENDPOINT '{{ env_var('ICEBERG_CATALOG_URI') }}', SECRET r2_secret);"
  # Note: R2 data access is configured via r2_data_secret (TYPE R2)
  # Models can use r2://bucket/path to read landing data
  # Note: CREATE SCHEMA has a DuckDB/Iceberg bug (missing Content-Type header)
  # Schema must be pre-created in R2 Data Catalog via Cloudflare dashboard or API
  # For now, skip USE statement and rely on fully-qualified table names in models
 # --- Model Defaults ---
 # https://sqlmesh.readthedocs.io/en/stable/reference/model_configuration/#model-defaults
@@ -59,7 +24,6 @@ model_defaults:
  cron: '@daily'    # Run models daily at 12am UTC (can override per model)
 # --- Linting Rules ---
 # Enforce standards for your team
 # https://sqlmesh.readthedocs.io/en/stable/guides/linter/
 linter:
@@ -68,22 +32,8 @@ linter:
    - ambiguousorinvalidcolumn
    - invalidselectstarexpansion
-# FLOW: Minimal prompts, automatic changes, summary output
+# --- Default Target Environment ---
-# https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#plan
+# Prevents accidentally applying plans to prod during local development.
 #plan:
 #  no_diff: true             # Hide detailed text differences for changed models
 #  no_prompts: true          # No interactive prompts
 #  auto_apply: true          # Apply changes automatically
 # --- Optional: Set a default target environment ---
 # This is intended for local development to prevent users from accidentally applying plans to the prod environment.
 # It is a development only config and should NOT be committed to your git repo.
 # https://sqlmesh.readthedocs.io/en/stable/guides/configuration/#default-target-environment
 # Uncomment the following line to use a default target environment derived from the logged in user's name.
 default_target_environment: dev_{{ user() }}
 # Example usage:
 # sqlmesh plan            # Automatically resolves to: sqlmesh plan dev_yourname
 # sqlmesh plan prod       # Specify `prod` to apply changes to production
--- a/transform/sqlmesh_materia/models/raw/psd_data.sql
+++ b/transform/sqlmesh_materia/models/raw/psd_data.sql
@@ -21,4 +21,4 @@ MODEL (
 )
 );
 select *
-FROM read_csv('extract/psdonline/src/psdonline/data/*.csv.gzip', delim=',', encoding='utf-8', compression='gzip', max_line_size=10000000, header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
+FROM read_csv('{{ var("LANDING_DIR") }}/psd/**/*.csv.gzip', delim=',', encoding='utf-8', compression='gzip', max_line_size=10000000, header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
--- a/transform/sqlmesh_materia/models/staging/stg_psdalldata__commodity.sql
+++ b/transform/sqlmesh_materia/models/staging/stg_psdalldata__commodity.sql
@@ -41,7 +41,7 @@ select
    any_value(unit_name) as unit_name,
    any_value(value) as value, 
    hash(commodity_code, commodity_name, country_code, country_name, market_year, calendar_year, month, attribute_id, attribute_name, unit_id, unit_name, value) as hkey,
-    any_value(make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)) as ingest_date,
+    any_value(make_date(split(filename, '/')[-3]::int, split(filename, '/')[-2]::int, 1)) as ingest_date,
    any_value(if(month!=0,last_day(make_date(market_year, month, 1)),null)) as market_date_month_end,
  from cast_dtypes
  group by hkey
--- a/transform/sqlmesh_materia/pyproject.toml
+++ b/transform/sqlmesh_materia/pyproject.toml
@@ -2,7 +2,6 @@
 name = "sqlmesh_materia"
 version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
 authors = [
    { name = "Deeman", email = "hendriknote@gmail.com" }
 ]
--- a/uv.lock
+++ b/uv.lock