Refactor to local-first architecture on Hetzner NVMe
Remove distributed R2/Iceberg/SSH pipeline architecture in favor of
local subprocess execution with NVMe storage. Landing data backed up
to R2 via rclone timer.
- Strip Iceberg catalog, httpfs, boto3, paramiko, prefect, pyarrow
- Pipelines run via subprocess.run() with bounded timeouts
- Extract writes to {LANDING_DIR}/psd/{year}/{month}/{etag}.csv.gzip
- SQLMesh reads LANDING_DIR variable, writes to DUCKDB_PATH
- Delete unused provider stubs (ovh, scaleway, oracle)
- Add rclone systemd timer for R2 backup every 6h
- Update supervisor to run pipelines with env vars
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
39
CLAUDE.md
39
CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|||||||
|
|
||||||
## Project Overview
|
## Project Overview
|
||||||
|
|
||||||
Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for orchestrating cloud workers and pipelines.
|
Materia is a commodity data analytics platform (product: **BeanFlows.coffee**) for coffee traders. It's a uv workspace monorepo with three packages: extraction (USDA PSD data), SQL transformation (SQLMesh + DuckDB), and a CLI for worker management and local pipeline execution.
|
||||||
|
|
||||||
## Commands
|
## Commands
|
||||||
|
|
||||||
@@ -25,7 +25,7 @@ cd transform/sqlmesh_materia && uv run sqlmesh test # SQLMesh model tests
|
|||||||
uv run pytest tests/test_cli.py::test_name -v
|
uv run pytest tests/test_cli.py::test_name -v
|
||||||
|
|
||||||
# Extract data
|
# Extract data
|
||||||
uv run extract_psd
|
LANDING_DIR=data/landing uv run extract_psd
|
||||||
|
|
||||||
# SQLMesh (from repo root)
|
# SQLMesh (from repo root)
|
||||||
uv run sqlmesh -p transform/sqlmesh_materia plan # Plans to dev_<username> by default
|
uv run sqlmesh -p transform/sqlmesh_materia plan # Plans to dev_<username> by default
|
||||||
@@ -33,43 +33,45 @@ uv run sqlmesh -p transform/sqlmesh_materia plan prod # Production
|
|||||||
uv run sqlmesh -p transform/sqlmesh_materia test # Run model tests
|
uv run sqlmesh -p transform/sqlmesh_materia test # Run model tests
|
||||||
uv run sqlmesh -p transform/sqlmesh_materia format # Format SQL
|
uv run sqlmesh -p transform/sqlmesh_materia format # Format SQL
|
||||||
|
|
||||||
# With production secrets
|
|
||||||
esc run beanflows/prod -- <command>
|
|
||||||
|
|
||||||
# CLI
|
# CLI
|
||||||
|
uv run materia pipeline run extract|transform
|
||||||
|
uv run materia pipeline list
|
||||||
uv run materia worker create|destroy|list
|
uv run materia worker create|destroy|list
|
||||||
uv run materia pipeline run
|
|
||||||
uv run materia secrets get
|
uv run materia secrets get
|
||||||
```
|
```
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
**Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
|
**Workspace packages** (`pyproject.toml` → `tool.uv.workspace`):
|
||||||
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, uploads to R2
|
- `extract/psdonline/` — Downloads USDA PSD Online data, normalizes ZIP→gzip CSV, writes to local landing directory
|
||||||
- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (DuckDB + Iceberg)
|
- `transform/sqlmesh_materia/` — 4-layer SQL transformation pipeline (local DuckDB)
|
||||||
- `src/materia/` — CLI (Typer) for worker management, pipeline orchestration, secrets
|
- `src/materia/` — CLI (Typer) for pipeline execution, worker management, secrets
|
||||||
- `web/` — Future web frontend
|
- `web/` — Future web frontend
|
||||||
|
|
||||||
**Data flow:**
|
**Data flow:**
|
||||||
```
|
```
|
||||||
USDA API → extract (psdonline) → R2/local CSV → SQLMesh transforms → DuckDB/Iceberg
|
USDA API → extract → /data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip
|
||||||
|
→ rclone cron syncs landing/ to R2
|
||||||
|
→ SQLMesh raw → staging → cleaned → serving → /data/materia/lakehouse.duckdb
|
||||||
|
→ Web app reads lakehouse.duckdb (read-only)
|
||||||
```
|
```
|
||||||
|
|
||||||
**SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`):
|
**SQLMesh 4-layer model structure** (`transform/sqlmesh_materia/models/`):
|
||||||
1. `raw/` — Immutable source reads (read_csv from extracted files)
|
1. `raw/` — Immutable source reads (read_csv from landing directory)
|
||||||
2. `staging/` — Type casting, lookup joins, basic cleansing
|
2. `staging/` — Type casting, lookup joins, basic cleansing
|
||||||
3. `cleaned/` — Business logic, pivoting, integration
|
3. `cleaned/` — Business logic, pivoting, integration
|
||||||
4. `serving/` — Analytics-ready facts, dimensions, aggregates
|
4. `serving/` — Analytics-ready facts, dimensions, aggregates
|
||||||
|
|
||||||
**CLI modules** (`src/materia/`):
|
**CLI modules** (`src/materia/`):
|
||||||
- `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
|
- `cli.py` — Typer app with subcommands: worker, pipeline, secrets, version
|
||||||
- `workers.py` — Ephemeral cloud instance management (Hetzner, with planned OVH/Scaleway/Oracle)
|
- `workers.py` — Hetzner cloud instance management (for ad-hoc compute)
|
||||||
- `pipelines.py` — SSH-based pipeline execution on workers (download artifact, run, destroy)
|
- `pipelines.py` — Local subprocess pipeline execution with bounded timeouts
|
||||||
- `secrets.py` — Pulumi ESC integration for environment secrets
|
- `secrets.py` — Pulumi ESC integration for environment secrets
|
||||||
|
|
||||||
**Infrastructure** (`infra/`):
|
**Infrastructure** (`infra/`):
|
||||||
- Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
|
- Pulumi IaC for Cloudflare R2 buckets and Hetzner compute
|
||||||
- Supervisor systemd service for always-on orchestration (pulls git every 15 min)
|
- Supervisor systemd service for always-on orchestration (pulls git, runs pipelines)
|
||||||
|
- rclone systemd timer for landing data backup to R2
|
||||||
|
|
||||||
## Coding Philosophy
|
## Coding Philosophy
|
||||||
|
|
||||||
@@ -87,7 +89,14 @@ Read `coding_philosophy.md` for the full guide. Key points:
|
|||||||
- **Python 3.13** (`.python-version`)
|
- **Python 3.13** (`.python-version`)
|
||||||
- **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
|
- **Ruff**: double quotes, spaces, E501 ignored (formatter handles line length)
|
||||||
- **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
|
- **SQLMesh**: DuckDB dialect, `@daily` cron, start date `2025-07-07`, default env `dev_{{ user() }}`
|
||||||
- **Storage**: Cloudflare R2 with Iceberg catalog (zero egress cost)
|
- **Storage**: Local NVMe (`LANDING_DIR`, `DUCKDB_PATH`), R2 for backup via rclone
|
||||||
- **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
|
- **Secrets**: Pulumi ESC (`esc run beanflows/prod -- <cmd>`)
|
||||||
- **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
|
- **CI**: GitLab CI (`.gitlab/.gitlab-ci.yml`) — runs pytest and sqlmesh test on push/MR
|
||||||
- **Pre-commit hooks**: installed via `pre-commit install`
|
- **Pre-commit hooks**: installed via `pre-commit install`
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
| Variable | Default | Description |
|
||||||
|
|----------|---------|-------------|
|
||||||
|
| `LANDING_DIR` | `data/landing` | Root directory for extracted landing data |
|
||||||
|
| `DUCKDB_PATH` | `local.duckdb` | Path to the DuckDB lakehouse database |
|
||||||
|
|||||||
@@ -2,16 +2,13 @@
|
|||||||
name = "psdonline"
|
name = "psdonline"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
description = "Add your description here"
|
description = "Add your description here"
|
||||||
readme = "README.md"
|
|
||||||
authors = [
|
authors = [
|
||||||
{ name = "Deeman", email = "hendriknote@gmail.com" }
|
{ name = "Deeman", email = "hendriknote@gmail.com" }
|
||||||
]
|
]
|
||||||
requires-python = ">=3.13"
|
requires-python = ">=3.13"
|
||||||
|
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"boto3>=1.40.55",
|
|
||||||
"niquests>=3.14.1",
|
"niquests>=3.14.1",
|
||||||
"pendulum>=3.1.0",
|
|
||||||
]
|
]
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
extract_psd = "psdonline.execute:extract_psd_dataset"
|
extract_psd = "psdonline.execute:extract_psd_dataset"
|
||||||
|
|||||||
@@ -5,63 +5,32 @@ import pathlib
|
|||||||
import sys
|
import sys
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
import boto3
|
|
||||||
import niquests
|
import niquests
|
||||||
from botocore.exceptions import ClientError
|
|
||||||
|
|
||||||
logging.basicConfig(
|
logging.basicConfig(
|
||||||
level=logging.INFO,
|
level=logging.INFO,
|
||||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
|
||||||
datefmt='%Y-%m-%d %H:%M:%S',
|
datefmt="%Y-%m-%d %H:%M:%S",
|
||||||
handlers=[
|
handlers=[logging.StreamHandler(sys.stdout)],
|
||||||
logging.StreamHandler(sys.stdout)
|
|
||||||
]
|
|
||||||
)
|
)
|
||||||
logger = logging.getLogger("PSDOnline Extractor")
|
logger = logging.getLogger("PSDOnline Extractor")
|
||||||
OUTPUT_DIR = pathlib.Path(__file__).parent / "data"
|
|
||||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
logger.info(f"Output dir: {OUTPUT_DIR}")
|
|
||||||
|
|
||||||
# R2 configuration from environment
|
LANDING_DIR = pathlib.Path(os.getenv("LANDING_DIR", "data/landing"))
|
||||||
R2_ENDPOINT = os.getenv('R2_ENDPOINT')
|
LANDING_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
R2_BUCKET = os.getenv('R2_BUCKET')
|
logger.info(f"Landing dir: {LANDING_DIR}")
|
||||||
R2_ACCESS_KEY = os.getenv('R2_ACCESS_KEY') or os.getenv('R2_ADMIN_ACCESS_KEY_ID')
|
|
||||||
R2_SECRET_KEY = os.getenv('R2_SECRET_KEY') or os.getenv('R2_ADMIN_SECRET_ACCESS_KEY')
|
|
||||||
|
|
||||||
PSD_HISTORICAL_URL = "https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip"
|
PSD_HISTORICAL_URL = "https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip"
|
||||||
FIRST_YEAR = 2006
|
FIRST_YEAR = 2006
|
||||||
FIRST_MONTH = 8
|
FIRST_MONTH = 8
|
||||||
|
|
||||||
def check_r2_file_exists(etag: str, s3_client) -> bool:
|
HTTP_TIMEOUT_SECONDS = 60
|
||||||
"""Check if file exists in R2."""
|
|
||||||
r2_key = f"landing/psd/{etag}.csv.gzip"
|
|
||||||
try:
|
|
||||||
s3_client.head_object(Bucket=R2_BUCKET, Key=r2_key)
|
|
||||||
logger.info(f"File {r2_key} already exists in R2, skipping")
|
|
||||||
return True
|
|
||||||
except ClientError as e:
|
|
||||||
if e.response['Error']['Code'] == '404':
|
|
||||||
return False
|
|
||||||
raise
|
|
||||||
|
|
||||||
|
|
||||||
def upload_to_r2(content: bytes, etag: str, s3_client):
|
def extract_psd_file(url: str, year: int, month: int, http_session: niquests.Session):
|
||||||
"""Upload file content to R2."""
|
"""Extract PSD file to local year/month subdirectory."""
|
||||||
r2_key = f"landing/psd/{etag}.csv.gzip"
|
|
||||||
logger.info(f"Uploading to R2: {r2_key}")
|
|
||||||
s3_client.put_object(Bucket=R2_BUCKET, Key=r2_key, Body=content)
|
|
||||||
logger.info("Upload complete")
|
|
||||||
|
|
||||||
|
|
||||||
def extract_psd_file(url: str, extract_to_path: pathlib.Path, http_session: niquests.Session, s3_client=None):
|
|
||||||
"""
|
|
||||||
Extract PSD file either to local storage or R2.
|
|
||||||
If s3_client is provided, uploads to R2 only (no local storage).
|
|
||||||
If s3_client is None, downloads to local storage.
|
|
||||||
"""
|
|
||||||
logger.info(f"Requesting file {url} ...")
|
logger.info(f"Requesting file {url} ...")
|
||||||
|
|
||||||
response = http_session.head(url)
|
response = http_session.head(url, timeout=HTTP_TIMEOUT_SECONDS)
|
||||||
if response.status_code == 404:
|
if response.status_code == 404:
|
||||||
logger.error("File doesn't exist on server, received status code 404 Not Found")
|
logger.error("File doesn't exist on server, received status code 404 Not Found")
|
||||||
return
|
return
|
||||||
@@ -69,55 +38,31 @@ def extract_psd_file(url: str, extract_to_path: pathlib.Path, http_session: niqu
|
|||||||
logger.error(f"Status code not ok, STATUS={response.status_code}")
|
logger.error(f"Status code not ok, STATUS={response.status_code}")
|
||||||
return
|
return
|
||||||
|
|
||||||
etag = response.headers.get("etag").replace('"',"").replace(":","_")
|
etag = response.headers.get("etag", "").replace('"', "").replace(":", "_")
|
||||||
|
assert etag, "USDA response missing etag header"
|
||||||
|
|
||||||
# R2 mode: check R2 and upload if needed
|
extract_to_path = LANDING_DIR / "psd" / str(year) / f"{month:02d}"
|
||||||
if s3_client:
|
|
||||||
if check_r2_file_exists(etag, s3_client):
|
|
||||||
return
|
|
||||||
response = http_session.get(url)
|
|
||||||
normalized_content = normalize_zipped_csv(response.content)
|
|
||||||
upload_to_r2(normalized_content, etag, s3_client)
|
|
||||||
return
|
|
||||||
|
|
||||||
# Local mode: check local and download if needed
|
|
||||||
local_file = extract_to_path / f"{etag}.csv.gzip"
|
local_file = extract_to_path / f"{etag}.csv.gzip"
|
||||||
if local_file.exists():
|
if local_file.exists():
|
||||||
logger.info(f"File {etag}.zip already exists locally, skipping")
|
logger.info(f"File {etag}.csv.gzip already exists locally, skipping")
|
||||||
return
|
return
|
||||||
|
|
||||||
response = http_session.get(url)
|
response = http_session.get(url, timeout=HTTP_TIMEOUT_SECONDS)
|
||||||
logger.info(f"Storing file to {local_file}")
|
logger.info(f"Storing file to {local_file}")
|
||||||
extract_to_path.mkdir(parents=True, exist_ok=True)
|
extract_to_path.mkdir(parents=True, exist_ok=True)
|
||||||
normalized_content = normalize_zipped_csv(response.content)
|
normalized_content = normalize_zipped_csv(response.content)
|
||||||
local_file.write_bytes(normalized_content)
|
local_file.write_bytes(normalized_content)
|
||||||
|
assert local_file.exists(), f"File was not written: {local_file}"
|
||||||
logger.info("Download complete")
|
logger.info("Download complete")
|
||||||
|
|
||||||
|
|
||||||
def extract_psd_dataset():
|
def extract_psd_dataset():
|
||||||
today = datetime.now()
|
today = datetime.now()
|
||||||
|
|
||||||
# Check if R2 credentials are configured
|
|
||||||
use_r2 = all([R2_ENDPOINT, R2_BUCKET, R2_ACCESS_KEY, R2_SECRET_KEY])
|
|
||||||
|
|
||||||
if use_r2:
|
|
||||||
logger.info("R2 credentials found, uploading to R2")
|
|
||||||
s3_client = boto3.client(
|
|
||||||
's3',
|
|
||||||
endpoint_url=R2_ENDPOINT,
|
|
||||||
aws_access_key_id=R2_ACCESS_KEY,
|
|
||||||
aws_secret_access_key=R2_SECRET_KEY
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
logger.info("R2 credentials not found, downloading to local storage")
|
|
||||||
s3_client = None
|
|
||||||
|
|
||||||
# Try current month and previous 3 months (USDA data is published with lag)
|
|
||||||
with niquests.Session() as session:
|
with niquests.Session() as session:
|
||||||
for months_back in range(4):
|
for months_back in range(4):
|
||||||
year = today.year
|
year = today.year
|
||||||
month = today.month - months_back
|
month = today.month - months_back
|
||||||
# Handle year rollover
|
|
||||||
while month < 1:
|
while month < 1:
|
||||||
month += 12
|
month += 12
|
||||||
year -= 1
|
year -= 1
|
||||||
@@ -125,11 +70,10 @@ def extract_psd_dataset():
|
|||||||
url = PSD_HISTORICAL_URL.format(year=year, month=month)
|
url = PSD_HISTORICAL_URL.format(year=year, month=month)
|
||||||
logger.info(f"Trying {year}-{month:02d}...")
|
logger.info(f"Trying {year}-{month:02d}...")
|
||||||
|
|
||||||
# Check if URL exists
|
response = session.head(url, timeout=HTTP_TIMEOUT_SECONDS)
|
||||||
response = session.head(url)
|
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
logger.info(f"Found latest data at {year}-{month:02d}")
|
logger.info(f"Found latest data at {year}-{month:02d}")
|
||||||
extract_psd_file(url=url, http_session=session, extract_to_path=OUTPUT_DIR, s3_client=s3_client)
|
extract_psd_file(url=url, year=year, month=month, http_session=session)
|
||||||
return
|
return
|
||||||
elif response.status_code == 404:
|
elif response.status_code == 404:
|
||||||
logger.info(f"Month {year}-{month:02d} not found, trying earlier...")
|
logger.info(f"Month {year}-{month:02d} not found, trying earlier...")
|
||||||
@@ -141,5 +85,3 @@ def extract_psd_dataset():
|
|||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
extract_psd_dataset()
|
extract_psd_dataset()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
9
infra/backup/materia-backup.service
Normal file
9
infra/backup/materia-backup.service
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Materia Landing Data Backup to R2
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/bin/rclone sync /data/materia/landing/ r2:materia-raw/landing/ --log-level INFO
|
||||||
|
TimeoutStartSec=1800
|
||||||
10
infra/backup/materia-backup.timer
Normal file
10
infra/backup/materia-backup.timer
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Materia Landing Data Backup Timer
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=*-*-* 00/6:00:00
|
||||||
|
RandomizedDelaySec=300
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
14
infra/backup/rclone.conf.example
Normal file
14
infra/backup/rclone.conf.example
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
# Cloudflare R2 remote for landing data backup
|
||||||
|
# Copy to /root/.config/rclone/rclone.conf and fill in credentials
|
||||||
|
#
|
||||||
|
# Get credentials from: Cloudflare Dashboard → R2 → Manage R2 API Tokens
|
||||||
|
# Or from Pulumi ESC: esc env open beanflows/prod --format shell
|
||||||
|
|
||||||
|
[r2]
|
||||||
|
type = s3
|
||||||
|
provider = Cloudflare
|
||||||
|
access_key_id = <R2_ACCESS_KEY_ID>
|
||||||
|
secret_access_key = <R2_SECRET_ACCESS_KEY>
|
||||||
|
endpoint = https://<CLOUDFLARE_ACCOUNT_ID>.r2.cloudflarestorage.com
|
||||||
|
acl = private
|
||||||
|
no_check_bucket = true
|
||||||
@@ -79,6 +79,9 @@ else
|
|||||||
cd "$REPO_DIR"
|
cd "$REPO_DIR"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
echo "--- Creating data directories ---"
|
||||||
|
mkdir -p /data/materia/landing/psd
|
||||||
|
|
||||||
echo "--- Installing Python dependencies ---"
|
echo "--- Installing Python dependencies ---"
|
||||||
uv sync
|
uv sync
|
||||||
|
|
||||||
@@ -88,6 +91,8 @@ cat > "$REPO_DIR/.env" <<EOF
|
|||||||
# Loaded from Pulumi ESC: beanflows/prod
|
# Loaded from Pulumi ESC: beanflows/prod
|
||||||
PULUMI_ACCESS_TOKEN=${PULUMI_ACCESS_TOKEN}
|
PULUMI_ACCESS_TOKEN=${PULUMI_ACCESS_TOKEN}
|
||||||
PATH=/root/.cargo/bin:/root/.pulumi/bin:/usr/local/bin:/usr/bin:/bin
|
PATH=/root/.cargo/bin:/root/.pulumi/bin:/usr/local/bin:/usr/bin:/bin
|
||||||
|
LANDING_DIR=/data/materia/landing
|
||||||
|
DUCKDB_PATH=/data/materia/lakehouse.duckdb
|
||||||
EOF
|
EOF
|
||||||
|
|
||||||
echo "--- Setting up systemd service ---"
|
echo "--- Setting up systemd service ---"
|
||||||
|
|||||||
@@ -1,80 +0,0 @@
|
|||||||
services:
|
|
||||||
postgres:
|
|
||||||
image: postgres:14
|
|
||||||
environment:
|
|
||||||
POSTGRES_USER: prefect
|
|
||||||
POSTGRES_PASSWORD: prefect
|
|
||||||
POSTGRES_DB: prefect
|
|
||||||
volumes:
|
|
||||||
- postgres_data:/var/lib/postgresql/data
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "pg_isready -U prefect"]
|
|
||||||
interval: 5s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 5
|
|
||||||
|
|
||||||
dragonfly:
|
|
||||||
image: 'docker.dragonflydb.io/dragonflydb/dragonfly'
|
|
||||||
ulimits:
|
|
||||||
memlock: -1
|
|
||||||
volumes:
|
|
||||||
- dragonflydata:/data
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "redis-cli ping"]
|
|
||||||
interval: 5s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 5
|
|
||||||
|
|
||||||
prefect-server:
|
|
||||||
image: prefecthq/prefect:3-latest
|
|
||||||
depends_on:
|
|
||||||
postgres:
|
|
||||||
condition: service_healthy
|
|
||||||
dragonfly:
|
|
||||||
condition: service_healthy
|
|
||||||
environment:
|
|
||||||
PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://prefect:prefect@postgres:5432/prefect
|
|
||||||
PREFECT_SERVER_API_HOST: 0.0.0.0
|
|
||||||
PREFECT_UI_API_URL: http://localhost:4200/api
|
|
||||||
PREFECT_MESSAGING_BROKER: prefect_redis.messaging
|
|
||||||
PREFECT_MESSAGING_CACHE: prefect_redis.messaging
|
|
||||||
PREFECT_REDIS_MESSAGING_HOST: dragonfly
|
|
||||||
PREFECT_REDIS_MESSAGING_PORT: 6379
|
|
||||||
PREFECT_REDIS_MESSAGING_DB: 0
|
|
||||||
command: prefect server start --no-services
|
|
||||||
ports:
|
|
||||||
- "4200:4200"
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "python", "-c", "import urllib.request as u; u.urlopen('http://localhost:4200/api/health', timeout=1)"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 60s
|
|
||||||
|
|
||||||
prefect-services:
|
|
||||||
image: prefecthq/prefect:3-latest
|
|
||||||
depends_on:
|
|
||||||
prefect-server:
|
|
||||||
condition: service_healthy
|
|
||||||
environment:
|
|
||||||
PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://prefect:prefect@postgres:5432/prefect
|
|
||||||
PREFECT_MESSAGING_BROKER: prefect_redis.messaging
|
|
||||||
PREFECT_MESSAGING_CACHE: prefect_redis.messaging
|
|
||||||
PREFECT_REDIS_MESSAGING_HOST: dragonfly
|
|
||||||
PREFECT_REDIS_MESSAGING_PORT: 6379
|
|
||||||
PREFECT_REDIS_MESSAGING_DB: 0
|
|
||||||
command: prefect server services start
|
|
||||||
|
|
||||||
prefect-worker:
|
|
||||||
image: prefecthq/prefect:3-latest
|
|
||||||
depends_on:
|
|
||||||
prefect-server:
|
|
||||||
condition: service_healthy
|
|
||||||
environment:
|
|
||||||
PREFECT_API_URL: http://prefect-server:4200/api
|
|
||||||
command: prefect worker start --pool local-pool
|
|
||||||
restart: on-failure
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
postgres_data:
|
|
||||||
dragonflydata:
|
|
||||||
200
infra/readme.md
200
infra/readme.md
@@ -1,161 +1,85 @@
|
|||||||
# Materia Infrastructure
|
# Materia Infrastructure
|
||||||
|
|
||||||
Pulumi-managed infrastructure for BeanFlows.coffee
|
Single-server local-first setup for BeanFlows.coffee on Hetzner NVMe.
|
||||||
|
|
||||||
## Stack Overview
|
## Architecture
|
||||||
|
|
||||||
- **Storage:** Cloudflare R2 buckets with Iceberg Data Catalog
|
```
|
||||||
- **Compute:** Hetzner Cloud CCX dedicated vCPU instances
|
Hetzner Server (NVMe)
|
||||||
- **Orchestration:** Custom Python scheduler (see `src/orchestrator/`)
|
├── /opt/materia/ # Git repo, code, uv environment
|
||||||
|
├── /data/materia/landing/ # Extracted USDA data (year/month subdirs)
|
||||||
|
├── /data/materia/lakehouse.duckdb # SQLMesh output database
|
||||||
|
└── systemd services:
|
||||||
|
├── materia-supervisor # Pulls git, runs extract + transform daily
|
||||||
|
└── materia-backup.timer # Syncs landing/ to R2 every 6 hours
|
||||||
|
```
|
||||||
|
|
||||||
## Prerequisites
|
## Data Flow
|
||||||
|
|
||||||
1. **Cloudflare Account**
|
1. **Extract**: USDA API → `/data/materia/landing/psd/{year}/{month}/{etag}.csv.gzip`
|
||||||
- Sign up at https://dash.cloudflare.com
|
2. **Transform**: SQLMesh reads landing CSVs → writes to `/data/materia/lakehouse.duckdb`
|
||||||
- Create API token with R2 + Data Catalog permissions
|
3. **Backup**: rclone syncs `/data/materia/landing/` → R2 `materia-raw/landing/`
|
||||||
- Get your Account ID from dashboard
|
4. **Web**: Reads `lakehouse.duckdb` (read-only)
|
||||||
|
|
||||||
2. **Hetzner Cloud Account**
|
## Setup
|
||||||
- Sign up at https://console.hetzner.cloud
|
|
||||||
- Create API token with Read & Write permissions
|
|
||||||
|
|
||||||
3. **Pulumi Account** (optional, can use local state)
|
### Prerequisites
|
||||||
- Sign up at https://app.pulumi.com
|
|
||||||
- Or use local state with `pulumi login --local`
|
|
||||||
|
|
||||||
4. **SSH Key**
|
- Hetzner server with NVMe storage
|
||||||
- Generate if needed: `ssh-keygen -t ed25519 -C "materia-deploy"`
|
- Pulumi ESC configured (`beanflows/prod` environment)
|
||||||
|
- `GITLAB_READ_TOKEN` and `PULUMI_ACCESS_TOKEN` set
|
||||||
|
|
||||||
## Initial Setup
|
### Bootstrap
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From local machine or CI:
|
||||||
|
ssh root@<server_ip> 'bash -s' < infra/bootstrap_supervisor.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This installs dependencies, clones the repo, creates data directories, and starts the supervisor service.
|
||||||
|
|
||||||
|
### R2 Backup
|
||||||
|
|
||||||
|
1. Install rclone: `apt install rclone`
|
||||||
|
2. Copy and configure: `cp infra/backup/rclone.conf.example /root/.config/rclone/rclone.conf`
|
||||||
|
3. Fill in R2 credentials from Pulumi ESC
|
||||||
|
4. Install systemd units:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp infra/backup/materia-backup.service /etc/systemd/system/
|
||||||
|
cp infra/backup/materia-backup.timer /etc/systemd/system/
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now materia-backup.timer
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pulumi IaC
|
||||||
|
|
||||||
|
Still manages Cloudflare R2 buckets and can provision Hetzner instances:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd infra
|
cd infra
|
||||||
|
pulumi login
|
||||||
# Login to Pulumi (local or cloud)
|
pulumi stack select prod
|
||||||
pulumi login # or: pulumi login --local
|
|
||||||
|
|
||||||
# Initialize the stack
|
|
||||||
pulumi stack init dev
|
|
||||||
|
|
||||||
# Configure secrets
|
|
||||||
pulumi config set --secret cloudflare:apiToken <your-cloudflare-token>
|
|
||||||
pulumi config set cloudflare_account_id <your-account-id>
|
|
||||||
pulumi config set --secret hcloud:token <your-hetzner-token>
|
|
||||||
pulumi config set --secret ssh_public_key "$(cat ~/.ssh/id_ed25519.pub)"
|
|
||||||
|
|
||||||
# Preview changes
|
|
||||||
pulumi preview
|
|
||||||
|
|
||||||
# Deploy infrastructure
|
|
||||||
pulumi up
|
pulumi up
|
||||||
```
|
```
|
||||||
|
|
||||||
## What Gets Provisioned
|
## Monitoring
|
||||||
|
|
||||||
### Cloudflare R2 Buckets
|
|
||||||
|
|
||||||
1. **materia-raw** - Raw data from extraction (immutable archives)
|
|
||||||
2. **materia-lakehouse** - Iceberg tables for SQLMesh (ACID transactions)
|
|
||||||
|
|
||||||
### Hetzner Cloud Servers
|
|
||||||
|
|
||||||
1. **materia-scheduler** (CCX12: 2 vCPU, 8GB RAM)
|
|
||||||
- Runs cron scheduler
|
|
||||||
- Lightweight orchestration tasks
|
|
||||||
- Always-on, low cost (~€6/mo)
|
|
||||||
|
|
||||||
2. **materia-worker-01** (CCX22: 4 vCPU, 16GB RAM)
|
|
||||||
- Heavy SQLMesh transformations
|
|
||||||
- Can be stopped when not in use
|
|
||||||
- Scale up to CCX32/CCX42 for larger workloads (~€24-90/mo)
|
|
||||||
|
|
||||||
3. **materia-firewall**
|
|
||||||
- SSH access (port 22)
|
|
||||||
- All outbound traffic allowed
|
|
||||||
- No inbound HTTP/HTTPS (we're not running web services yet)
|
|
||||||
|
|
||||||
## Enabling R2 Data Catalog (Iceberg)
|
|
||||||
|
|
||||||
As of October 2025, R2 Data Catalog is in public beta. Enable it manually:
|
|
||||||
|
|
||||||
1. Go to Cloudflare Dashboard → R2
|
|
||||||
2. Select the `materia-lakehouse` bucket
|
|
||||||
3. Navigate to Settings → Data Catalog
|
|
||||||
4. Click "Enable Data Catalog"
|
|
||||||
|
|
||||||
Once enabled, you can connect DuckDB to the Iceberg REST catalog:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import duckdb
|
|
||||||
|
|
||||||
# Get catalog URI from Pulumi outputs
|
|
||||||
# pulumi stack output duckdb_r2_config
|
|
||||||
|
|
||||||
conn = duckdb.connect()
|
|
||||||
conn.execute("INSTALL iceberg; LOAD iceberg;")
|
|
||||||
conn.execute(f"""
|
|
||||||
ATTACH 'iceberg_rest://catalog.cloudflarestorage.com/<account_id>/r2-data-catalog'
|
|
||||||
AS lakehouse (
|
|
||||||
TYPE ICEBERG_REST,
|
|
||||||
SECRET '<r2_api_token>'
|
|
||||||
);
|
|
||||||
""")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Server Access
|
|
||||||
|
|
||||||
Get server IPs from Pulumi outputs:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pulumi stack output scheduler_ip
|
# Supervisor status and logs
|
||||||
pulumi stack output worker_ip
|
systemctl status materia-supervisor
|
||||||
|
journalctl -u materia-supervisor -f
|
||||||
|
|
||||||
|
# Backup timer status
|
||||||
|
systemctl list-timers materia-backup.timer
|
||||||
|
journalctl -u materia-backup -f
|
||||||
```
|
```
|
||||||
|
|
||||||
SSH into servers:
|
## Cost
|
||||||
|
|
||||||
```bash
|
|
||||||
ssh root@<scheduler_ip>
|
|
||||||
ssh root@<worker_ip>
|
|
||||||
```
|
|
||||||
|
|
||||||
## Cost Estimates (Monthly)
|
|
||||||
|
|
||||||
| Resource | Type | Cost |
|
| Resource | Type | Cost |
|
||||||
|----------|------|------|
|
|----------|------|------|
|
||||||
| R2 Storage | 10 GB | $0.15 |
|
| Hetzner Server | CCX22 (4 vCPU, 16GB) | ~€24/mo |
|
||||||
| R2 Operations | 1M reads | $0.36 |
|
| R2 Storage | Backup (~10 GB) | $0.15/mo |
|
||||||
| R2 Egress | Unlimited | $0.00 (zero egress!) |
|
| R2 Egress | Zero | $0.00 |
|
||||||
| Scheduler | CCX12 | €6.00 |
|
| **Total** | | **~€24/mo (~$26)** |
|
||||||
| Worker (on-demand) | CCX22 | €24.00 |
|
|
||||||
| **Total** | | **~€30/mo (~$33)** |
|
|
||||||
|
|
||||||
Compare to AWS equivalent: ~$300-500/mo with S3 + EC2 + egress fees.
|
|
||||||
|
|
||||||
## Scaling Workers
|
|
||||||
|
|
||||||
To add more worker capacity or different instance sizes:
|
|
||||||
|
|
||||||
1. Edit `infra/__main__.py` to add new server resources
|
|
||||||
2. Update worker config in `src/orchestrator/workers.yaml`
|
|
||||||
3. Run `pulumi up` to provision
|
|
||||||
|
|
||||||
Example worker sizes:
|
|
||||||
- CCX12: 2 vCPU, 8GB RAM (light workloads)
|
|
||||||
- CCX22: 4 vCPU, 16GB RAM (medium workloads)
|
|
||||||
- CCX32: 8 vCPU, 32GB RAM (heavy workloads)
|
|
||||||
- CCX42: 16 vCPU, 64GB RAM (very heavy workloads)
|
|
||||||
|
|
||||||
## Destroying Infrastructure
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd infra
|
|
||||||
pulumi destroy
|
|
||||||
```
|
|
||||||
|
|
||||||
**Warning:** This will delete all buckets and servers. Backup data first!
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. Deploy orchestrator to scheduler server (see `src/orchestrator/README.md`)
|
|
||||||
2. Configure SQLMesh to use R2 lakehouse (see `transform/sqlmesh_materia/config.yaml`)
|
|
||||||
3. Set up CI/CD pipeline to deploy on push (see `.gitlab-ci.yml`)
|
|
||||||
|
|||||||
@@ -11,6 +11,8 @@ ExecStart=/opt/materia/infra/supervisor/supervisor.sh
|
|||||||
Restart=always
|
Restart=always
|
||||||
RestartSec=10
|
RestartSec=10
|
||||||
EnvironmentFile=/opt/materia/.env
|
EnvironmentFile=/opt/materia/.env
|
||||||
|
Environment=LANDING_DIR=/data/materia/landing
|
||||||
|
Environment=DUCKDB_PATH=/data/materia/lakehouse.duckdb
|
||||||
|
|
||||||
# Resource limits
|
# Resource limits
|
||||||
LimitNOFILE=65536
|
LimitNOFILE=65536
|
||||||
|
|||||||
@@ -24,9 +24,14 @@ do
|
|||||||
git switch --discard-changes --detach origin/master
|
git switch --discard-changes --detach origin/master
|
||||||
uv sync
|
uv sync
|
||||||
|
|
||||||
# Run pipelines (SQLMesh handles scheduling)
|
# Run pipelines
|
||||||
#uv run materia pipeline run extract
|
LANDING_DIR="${LANDING_DIR:-/data/materia/landing}" \
|
||||||
#uv run materia pipeline run transform
|
DUCKDB_PATH="${DUCKDB_PATH:-/data/materia/lakehouse.duckdb}" \
|
||||||
|
uv run materia pipeline run extract
|
||||||
|
|
||||||
|
LANDING_DIR="${LANDING_DIR:-/data/materia/landing}" \
|
||||||
|
DUCKDB_PATH="${DUCKDB_PATH:-/data/materia/lakehouse.duckdb}" \
|
||||||
|
uv run materia pipeline run transform
|
||||||
|
|
||||||
) || sleep 600 # Sleep 10 min on failure to avoid busy-loop retries
|
) || sleep 600 # Sleep 10 min on failure to avoid busy-loop retries
|
||||||
done
|
done
|
||||||
|
|||||||
@@ -9,14 +9,11 @@ authors = [
|
|||||||
]
|
]
|
||||||
requires-python = ">=3.13"
|
requires-python = ">=3.13"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"pyarrow>=20.0.0",
|
|
||||||
"python-dotenv>=1.1.0",
|
"python-dotenv>=1.1.0",
|
||||||
"typer>=0.15.0",
|
"typer>=0.15.0",
|
||||||
"paramiko>=3.5.0",
|
|
||||||
"pyyaml>=6.0.2",
|
"pyyaml>=6.0.2",
|
||||||
"niquests>=3.15.2",
|
"niquests>=3.15.2",
|
||||||
"hcloud>=2.8.0",
|
"hcloud>=2.8.0",
|
||||||
"prefect>=3.6.15",
|
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
@@ -130,4 +127,5 @@ force-single-line = false
|
|||||||
# Allow print statements and other rules in scripts
|
# Allow print statements and other rules in scripts
|
||||||
"scripts/*" = ["T201"]
|
"scripts/*" = ["T201"]
|
||||||
|
|
||||||
|
[tool.pytest.ini_options]
|
||||||
|
testpaths = ["tests"]
|
||||||
|
|||||||
@@ -80,15 +80,12 @@ app.add_typer(pipeline_app, name="pipeline")
|
|||||||
@pipeline_app.command("run")
|
@pipeline_app.command("run")
|
||||||
def pipeline_run(
|
def pipeline_run(
|
||||||
name: Annotated[str, typer.Argument(help="Pipeline name (extract, transform)")],
|
name: Annotated[str, typer.Argument(help="Pipeline name (extract, transform)")],
|
||||||
worker_type: Annotated[str | None, typer.Option("--worker", "-w")] = None,
|
|
||||||
provider: Annotated[str, typer.Option("--provider", "-p")] = "hetzner",
|
|
||||||
keep: Annotated[bool, typer.Option("--keep", help="Keep worker after completion")] = False,
|
|
||||||
):
|
):
|
||||||
"""Run a pipeline on an ephemeral worker."""
|
"""Run a pipeline locally."""
|
||||||
from materia.pipelines import run_pipeline
|
from materia.pipelines import run_pipeline
|
||||||
|
|
||||||
typer.echo(f"Running pipeline '{name}'...")
|
typer.echo(f"Running pipeline '{name}'...")
|
||||||
result = run_pipeline(name, worker_type, auto_destroy=not keep, provider=provider)
|
result = run_pipeline(name)
|
||||||
|
|
||||||
if result.success:
|
if result.success:
|
||||||
typer.echo(result.output)
|
typer.echo(result.output)
|
||||||
@@ -105,7 +102,8 @@ def pipeline_list():
|
|||||||
|
|
||||||
typer.echo("Available pipelines:")
|
typer.echo("Available pipelines:")
|
||||||
for name, config in PIPELINES.items():
|
for name, config in PIPELINES.items():
|
||||||
typer.echo(f" • {name:<15} (worker: {config.worker_type}, artifact: {config.artifact})")
|
cmd = " ".join(config["command"])
|
||||||
|
typer.echo(f" • {name:<15} (command: {cmd}, timeout: {config['timeout_seconds']}s)")
|
||||||
|
|
||||||
|
|
||||||
secrets_app = typer.Typer(help="Manage secrets via Pulumi ESC")
|
secrets_app = typer.Typer(help="Manage secrets via Pulumi ESC")
|
||||||
|
|||||||
@@ -1,21 +1,8 @@
|
|||||||
"""Pipeline execution on ephemeral workers."""
|
"""Pipeline execution via local subprocess."""
|
||||||
|
|
||||||
import contextlib
|
import subprocess
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
|
|
||||||
import paramiko
|
|
||||||
|
|
||||||
from materia.secrets import get_secret
|
|
||||||
from materia.workers import create_worker, destroy_worker
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class PipelineConfig:
|
|
||||||
worker_type: str
|
|
||||||
artifact: str
|
|
||||||
command: str
|
|
||||||
secrets: list[str]
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class PipelineResult:
|
class PipelineResult:
|
||||||
@@ -25,56 +12,20 @@ class PipelineResult:
|
|||||||
|
|
||||||
|
|
||||||
PIPELINES = {
|
PIPELINES = {
|
||||||
"extract": PipelineConfig(
|
"extract": {
|
||||||
worker_type="ccx12",
|
"command": ["uv", "run", "--package", "psdonline", "extract_psd"],
|
||||||
artifact="materia-extract-latest.tar.gz",
|
"timeout_seconds": 1800,
|
||||||
command="./extract_psd",
|
},
|
||||||
secrets=["R2_ACCESS_KEY_ID", "R2_SECRET_ACCESS_KEY", "R2_ENDPOINT", "R2_ARTIFACTS_BUCKET"],
|
"transform": {
|
||||||
),
|
"command": ["uv", "run", "--package", "sqlmesh_materia", "sqlmesh", "-p", "transform/sqlmesh_materia", "plan", "prod", "--no-prompts", "--auto-apply"],
|
||||||
"transform": PipelineConfig(
|
"timeout_seconds": 3600,
|
||||||
worker_type="ccx22",
|
},
|
||||||
artifact="materia-transform-latest.tar.gz",
|
|
||||||
command="cd sqlmesh_materia && ./sqlmesh plan prod",
|
|
||||||
secrets=[
|
|
||||||
"CLOUDFLARE_API_TOKEN",
|
|
||||||
"ICEBERG_REST_URI",
|
|
||||||
"R2_WAREHOUSE_NAME",
|
|
||||||
],
|
|
||||||
),
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def _execute_ssh_command(ip: str, command: str, env_vars: dict[str, str]) -> tuple[str, str, int]:
|
def run_pipeline(pipeline_name: str) -> PipelineResult:
|
||||||
ssh_key_path = get_secret("SSH_PRIVATE_KEY_PATH")
|
assert pipeline_name, "pipeline_name must not be empty"
|
||||||
if not ssh_key_path:
|
|
||||||
raise ValueError("SSH_PRIVATE_KEY_PATH not found in secrets")
|
|
||||||
|
|
||||||
client = paramiko.SSHClient()
|
|
||||||
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
|
|
||||||
|
|
||||||
pkey = paramiko.RSAKey.from_private_key_file(ssh_key_path)
|
|
||||||
client.connect(ip, username="root", pkey=pkey)
|
|
||||||
|
|
||||||
env_string = " ".join([f"export {k}='{v}' &&" for k, v in env_vars.items()])
|
|
||||||
full_command = f"{env_string} {command}" if env_vars else command
|
|
||||||
|
|
||||||
stdin, stdout, stderr = client.exec_command(full_command)
|
|
||||||
exit_code = stdout.channel.recv_exit_status()
|
|
||||||
|
|
||||||
output = stdout.read().decode()
|
|
||||||
error = stderr.read().decode()
|
|
||||||
|
|
||||||
client.close()
|
|
||||||
|
|
||||||
return output, error, exit_code
|
|
||||||
|
|
||||||
|
|
||||||
def run_pipeline(
|
|
||||||
pipeline_name: str,
|
|
||||||
worker_type: str | None = None,
|
|
||||||
auto_destroy: bool = True,
|
|
||||||
provider: str = "hetzner",
|
|
||||||
) -> PipelineResult:
|
|
||||||
if pipeline_name not in PIPELINES:
|
if pipeline_name not in PIPELINES:
|
||||||
return PipelineResult(
|
return PipelineResult(
|
||||||
success=False,
|
success=False,
|
||||||
@@ -82,58 +33,24 @@ def run_pipeline(
|
|||||||
error=f"Unknown pipeline: {pipeline_name}. Available: {', '.join(PIPELINES.keys())}",
|
error=f"Unknown pipeline: {pipeline_name}. Available: {', '.join(PIPELINES.keys())}",
|
||||||
)
|
)
|
||||||
|
|
||||||
pipeline_config = PIPELINES[pipeline_name]
|
pipeline = PIPELINES[pipeline_name]
|
||||||
worker_type = worker_type or pipeline_config.worker_type
|
timeout_seconds = pipeline["timeout_seconds"]
|
||||||
worker_name = f"materia-{pipeline_name}-worker"
|
|
||||||
|
|
||||||
r2_bucket = get_secret("R2_ARTIFACTS_BUCKET") or "materia-artifacts"
|
try:
|
||||||
r2_endpoint = get_secret("R2_ENDPOINT")
|
result = subprocess.run(
|
||||||
|
pipeline["command"],
|
||||||
if not r2_endpoint:
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout_seconds,
|
||||||
|
)
|
||||||
|
return PipelineResult(
|
||||||
|
success=result.returncode == 0,
|
||||||
|
output=result.stdout,
|
||||||
|
error=result.stderr if result.returncode != 0 else None,
|
||||||
|
)
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
return PipelineResult(
|
return PipelineResult(
|
||||||
success=False,
|
success=False,
|
||||||
output="",
|
output="",
|
||||||
error="R2_ENDPOINT not configured in secrets",
|
error=f"Pipeline '{pipeline_name}' timed out after {timeout_seconds} seconds",
|
||||||
)
|
)
|
||||||
|
|
||||||
try:
|
|
||||||
worker = create_worker(worker_name, worker_type, provider)
|
|
||||||
|
|
||||||
artifact_url = f"https://{r2_endpoint}/{r2_bucket}/{pipeline_config.artifact}"
|
|
||||||
|
|
||||||
bootstrap_commands = [
|
|
||||||
f"curl -fsSL -o artifact.tar.gz {artifact_url}",
|
|
||||||
"tar -xzf artifact.tar.gz",
|
|
||||||
"chmod +x -R .",
|
|
||||||
]
|
|
||||||
|
|
||||||
for cmd in bootstrap_commands:
|
|
||||||
_, error, exit_code = _execute_ssh_command(worker.ip, cmd, {})
|
|
||||||
if exit_code != 0:
|
|
||||||
return PipelineResult(
|
|
||||||
success=False,
|
|
||||||
output="",
|
|
||||||
error=f"Bootstrap failed: {error}",
|
|
||||||
)
|
|
||||||
|
|
||||||
env_vars = {}
|
|
||||||
for secret_key in pipeline_config.secrets:
|
|
||||||
value = get_secret(secret_key)
|
|
||||||
if value:
|
|
||||||
env_vars[secret_key] = value
|
|
||||||
|
|
||||||
command = pipeline_config.command
|
|
||||||
output, error, exit_code = _execute_ssh_command(worker.ip, command, env_vars)
|
|
||||||
|
|
||||||
success = exit_code == 0
|
|
||||||
|
|
||||||
return PipelineResult(
|
|
||||||
success=success,
|
|
||||||
output=output,
|
|
||||||
error=error if not success else None,
|
|
||||||
)
|
|
||||||
|
|
||||||
finally:
|
|
||||||
if auto_destroy:
|
|
||||||
with contextlib.suppress(Exception):
|
|
||||||
destroy_worker(worker_name, provider)
|
|
||||||
|
|||||||
@@ -1,7 +1,6 @@
|
|||||||
"""Cloud provider abstraction for worker management."""
|
"""Cloud provider for worker management."""
|
||||||
|
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from typing import Protocol
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -14,35 +13,10 @@ class Instance:
|
|||||||
type: str
|
type: str
|
||||||
|
|
||||||
|
|
||||||
class ProviderModule(Protocol):
|
def get_provider(provider_name: str):
|
||||||
def create_instance(
|
|
||||||
self: str,
|
|
||||||
instance_type: str,
|
|
||||||
ssh_key: str,
|
|
||||||
location: str | None = None,
|
|
||||||
) -> Instance: ...
|
|
||||||
|
|
||||||
def destroy_instance(self: str) -> None: ...
|
|
||||||
|
|
||||||
def list_instances(self: str | None = None) -> list[Instance]: ...
|
|
||||||
|
|
||||||
def get_instance(self: str) -> Instance | None: ...
|
|
||||||
|
|
||||||
def wait_for_ssh(self: str, timeout: int = 300) -> bool: ...
|
|
||||||
|
|
||||||
|
|
||||||
def get_provider(provider_name: str) -> ProviderModule:
|
|
||||||
if provider_name == "hetzner":
|
if provider_name == "hetzner":
|
||||||
from materia.providers import hetzner
|
from materia.providers import hetzner
|
||||||
|
|
||||||
return hetzner
|
return hetzner
|
||||||
elif provider_name == "ovh":
|
|
||||||
from materia.providers import ovh
|
|
||||||
return ovh
|
|
||||||
elif provider_name == "scaleway":
|
|
||||||
from materia.providers import scaleway
|
|
||||||
return scaleway
|
|
||||||
elif provider_name == "oracle":
|
|
||||||
from materia.providers import oracle
|
|
||||||
return oracle
|
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown provider: {provider_name}")
|
raise ValueError(f"Unknown provider: {provider_name}")
|
||||||
|
|||||||
@@ -1,28 +0,0 @@
|
|||||||
"""Oracle Cloud provider implementation."""
|
|
||||||
|
|
||||||
from materia.providers import Instance
|
|
||||||
|
|
||||||
|
|
||||||
def create_instance(
|
|
||||||
name: str,
|
|
||||||
instance_type: str,
|
|
||||||
ssh_key: str,
|
|
||||||
location: str | None = None,
|
|
||||||
) -> Instance:
|
|
||||||
raise NotImplementedError("Oracle Cloud provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def destroy_instance(instance_id: str) -> None:
|
|
||||||
raise NotImplementedError("Oracle Cloud provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def list_instances(label: str | None = None) -> list[Instance]:
|
|
||||||
raise NotImplementedError("Oracle Cloud provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def get_instance(name: str) -> Instance | None:
|
|
||||||
raise NotImplementedError("Oracle Cloud provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
|
|
||||||
raise NotImplementedError("Oracle Cloud provider not yet implemented")
|
|
||||||
@@ -1,28 +0,0 @@
|
|||||||
"""OVH Cloud provider implementation."""
|
|
||||||
|
|
||||||
from materia.providers import Instance
|
|
||||||
|
|
||||||
|
|
||||||
def create_instance(
|
|
||||||
name: str,
|
|
||||||
instance_type: str,
|
|
||||||
ssh_key: str,
|
|
||||||
location: str | None = None,
|
|
||||||
) -> Instance:
|
|
||||||
raise NotImplementedError("OVH provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def destroy_instance(instance_id: str) -> None:
|
|
||||||
raise NotImplementedError("OVH provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def list_instances(label: str | None = None) -> list[Instance]:
|
|
||||||
raise NotImplementedError("OVH provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def get_instance(name: str) -> Instance | None:
|
|
||||||
raise NotImplementedError("OVH provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
|
|
||||||
raise NotImplementedError("OVH provider not yet implemented")
|
|
||||||
@@ -1,28 +0,0 @@
|
|||||||
"""Scaleway provider implementation."""
|
|
||||||
|
|
||||||
from materia.providers import Instance
|
|
||||||
|
|
||||||
|
|
||||||
def create_instance(
|
|
||||||
name: str,
|
|
||||||
instance_type: str,
|
|
||||||
ssh_key: str,
|
|
||||||
location: str | None = None,
|
|
||||||
) -> Instance:
|
|
||||||
raise NotImplementedError("Scaleway provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def destroy_instance(instance_id: str) -> None:
|
|
||||||
raise NotImplementedError("Scaleway provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def list_instances(label: str | None = None) -> list[Instance]:
|
|
||||||
raise NotImplementedError("Scaleway provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def get_instance(name: str) -> Instance | None:
|
|
||||||
raise NotImplementedError("Scaleway provider not yet implemented")
|
|
||||||
|
|
||||||
|
|
||||||
def wait_for_ssh(ip: str, timeout: int = 300) -> bool:
|
|
||||||
raise NotImplementedError("Scaleway provider not yet implemented")
|
|
||||||
@@ -13,16 +13,9 @@ def mock_esc_env(tmp_path):
|
|||||||
|
|
||||||
return {
|
return {
|
||||||
"HETZNER_API_TOKEN": "test-hetzner-token",
|
"HETZNER_API_TOKEN": "test-hetzner-token",
|
||||||
"R2_ACCESS_KEY_ID": "test-r2-key",
|
|
||||||
"R2_SECRET_ACCESS_KEY": "test-r2-secret",
|
|
||||||
"R2_ENDPOINT": "test.r2.cloudflarestorage.com",
|
|
||||||
"R2_ARTIFACTS_BUCKET": "test-artifacts",
|
|
||||||
"SSH_PUBLIC_KEY": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAITest",
|
"SSH_PUBLIC_KEY": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAITest",
|
||||||
"SSH_PRIVATE_KEY": "-----BEGIN OPENSSH PRIVATE KEY-----\ntest\n-----END OPENSSH PRIVATE KEY-----",
|
"SSH_PRIVATE_KEY": "-----BEGIN OPENSSH PRIVATE KEY-----\ntest\n-----END OPENSSH PRIVATE KEY-----",
|
||||||
"SSH_PRIVATE_KEY_PATH": str(ssh_key_path),
|
"SSH_PRIVATE_KEY_PATH": str(ssh_key_path),
|
||||||
"CLOUDFLARE_API_TOKEN": "test-cf-token",
|
|
||||||
"ICEBERG_REST_URI": "https://api.cloudflare.com/test",
|
|
||||||
"R2_WAREHOUSE_NAME": "test-warehouse",
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -67,33 +60,3 @@ def mock_ssh_wait():
|
|||||||
"""Mock SSH wait function to return immediately."""
|
"""Mock SSH wait function to return immediately."""
|
||||||
with patch("materia.providers.hetzner.wait_for_ssh", return_value=True):
|
with patch("materia.providers.hetzner.wait_for_ssh", return_value=True):
|
||||||
yield
|
yield
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def mock_ssh_connection():
|
|
||||||
"""Mock paramiko SSH connection."""
|
|
||||||
with patch("materia.pipelines.paramiko.SSHClient") as mock_ssh_class, \
|
|
||||||
patch("materia.pipelines.paramiko.RSAKey.from_private_key_file") as mock_key:
|
|
||||||
ssh_instance = Mock()
|
|
||||||
mock_ssh_class.return_value = ssh_instance
|
|
||||||
mock_key.return_value = Mock()
|
|
||||||
|
|
||||||
ssh_instance.connect = Mock()
|
|
||||||
ssh_instance.set_missing_host_key_policy = Mock()
|
|
||||||
|
|
||||||
mock_channel = Mock()
|
|
||||||
mock_channel.recv_exit_status.return_value = 0
|
|
||||||
|
|
||||||
mock_stdout = Mock()
|
|
||||||
mock_stdout.read.return_value = b"Success\n"
|
|
||||||
mock_stdout.channel = mock_channel
|
|
||||||
|
|
||||||
mock_stderr = Mock()
|
|
||||||
mock_stderr.read.return_value = b""
|
|
||||||
|
|
||||||
ssh_instance.exec_command = Mock(
|
|
||||||
return_value=(Mock(), mock_stdout, mock_stderr)
|
|
||||||
)
|
|
||||||
ssh_instance.close = Mock()
|
|
||||||
|
|
||||||
yield ssh_instance
|
|
||||||
|
|||||||
@@ -1,5 +1,7 @@
|
|||||||
"""End-to-end tests for the materia CLI."""
|
"""End-to-end tests for the materia CLI."""
|
||||||
|
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
from typer.testing import CliRunner
|
from typer.testing import CliRunner
|
||||||
|
|
||||||
from materia.cli import app
|
from materia.cli import app
|
||||||
@@ -33,7 +35,6 @@ def test_secrets_list_command(mock_secrets):
|
|||||||
result = runner.invoke(app, ["secrets", "list"])
|
result = runner.invoke(app, ["secrets", "list"])
|
||||||
assert result.exit_code == 0
|
assert result.exit_code == 0
|
||||||
assert "HETZNER_API_TOKEN" in result.stdout
|
assert "HETZNER_API_TOKEN" in result.stdout
|
||||||
assert "R2_ACCESS_KEY_ID" in result.stdout
|
|
||||||
|
|
||||||
|
|
||||||
def test_worker_list_empty(mock_secrets, mock_hcloud_client):
|
def test_worker_list_empty(mock_secrets, mock_hcloud_client):
|
||||||
@@ -98,46 +99,55 @@ def test_worker_destroy(mock_secrets, mock_hcloud_client):
|
|||||||
assert "Worker destroyed" in result.stdout
|
assert "Worker destroyed" in result.stdout
|
||||||
|
|
||||||
|
|
||||||
def test_pipeline_list(mock_secrets):
|
def test_pipeline_list():
|
||||||
"""Test pipeline list command."""
|
"""Test pipeline list command."""
|
||||||
result = runner.invoke(app, ["pipeline", "list"])
|
result = runner.invoke(app, ["pipeline", "list"])
|
||||||
assert result.exit_code == 0
|
assert result.exit_code == 0
|
||||||
assert "extract" in result.stdout
|
assert "extract" in result.stdout
|
||||||
assert "transform" in result.stdout
|
assert "transform" in result.stdout
|
||||||
assert "ccx12" in result.stdout
|
assert "1800" in result.stdout
|
||||||
assert "ccx22" in result.stdout
|
assert "3600" in result.stdout
|
||||||
|
|
||||||
|
|
||||||
def test_pipeline_run_extract(
|
def test_pipeline_run_extract():
|
||||||
mock_secrets, mock_hcloud_client, mock_ssh_wait, mock_ssh_connection
|
|
||||||
):
|
|
||||||
"""Test running extract pipeline end-to-end."""
|
"""Test running extract pipeline end-to-end."""
|
||||||
result = runner.invoke(app, ["pipeline", "run", "extract"])
|
with patch("materia.pipelines.subprocess.run") as mock_run:
|
||||||
|
mock_run.return_value.returncode = 0
|
||||||
|
mock_run.return_value.stdout = "Extracted successfully\n"
|
||||||
|
mock_run.return_value.stderr = ""
|
||||||
|
|
||||||
assert result.exit_code == 0
|
result = runner.invoke(app, ["pipeline", "run", "extract"])
|
||||||
assert "Running pipeline" in result.stdout
|
|
||||||
assert "Pipeline completed successfully" in result.stdout
|
|
||||||
|
|
||||||
mock_hcloud_client.servers.create.assert_called_once()
|
assert result.exit_code == 0
|
||||||
mock_ssh_connection.connect.assert_called()
|
assert "Running pipeline" in result.stdout
|
||||||
mock_ssh_connection.exec_command.assert_called()
|
assert "Pipeline completed successfully" in result.stdout
|
||||||
|
|
||||||
|
mock_run.assert_called_once()
|
||||||
|
call_args = mock_run.call_args
|
||||||
|
assert call_args[0][0] == ["uv", "run", "--package", "psdonline", "extract_psd"]
|
||||||
|
assert call_args[1]["timeout"] == 1800
|
||||||
|
|
||||||
|
|
||||||
def test_pipeline_run_transform(
|
def test_pipeline_run_transform():
|
||||||
mock_secrets, mock_hcloud_client, mock_ssh_wait, mock_ssh_connection
|
|
||||||
):
|
|
||||||
"""Test running transform pipeline end-to-end."""
|
"""Test running transform pipeline end-to-end."""
|
||||||
result = runner.invoke(app, ["pipeline", "run", "transform"])
|
with patch("materia.pipelines.subprocess.run") as mock_run:
|
||||||
|
mock_run.return_value.returncode = 0
|
||||||
|
mock_run.return_value.stdout = "Transform complete\n"
|
||||||
|
mock_run.return_value.stderr = ""
|
||||||
|
|
||||||
assert result.exit_code == 0
|
result = runner.invoke(app, ["pipeline", "run", "transform"])
|
||||||
assert "Running pipeline" in result.stdout
|
|
||||||
assert "Pipeline completed successfully" in result.stdout
|
|
||||||
|
|
||||||
mock_hcloud_client.servers.create.assert_called_once()
|
assert result.exit_code == 0
|
||||||
mock_ssh_connection.connect.assert_called()
|
assert "Running pipeline" in result.stdout
|
||||||
|
assert "Pipeline completed successfully" in result.stdout
|
||||||
|
|
||||||
|
mock_run.assert_called_once()
|
||||||
|
call_args = mock_run.call_args
|
||||||
|
assert "sqlmesh" in call_args[0][0]
|
||||||
|
assert call_args[1]["timeout"] == 3600
|
||||||
|
|
||||||
|
|
||||||
def test_pipeline_run_invalid(mock_secrets):
|
def test_pipeline_run_invalid():
|
||||||
"""Test running an invalid pipeline."""
|
"""Test running an invalid pipeline."""
|
||||||
result = runner.invoke(app, ["pipeline", "run", "invalid-pipeline"])
|
result = runner.invoke(app, ["pipeline", "run", "invalid-pipeline"])
|
||||||
|
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
# --- Gateway Connection ---
|
# --- Gateway Connection ---
|
||||||
# Single gateway connecting to R2 Iceberg catalog
|
# Single local DuckDB gateway
|
||||||
# Local dev uses virtual environments (e.g., dev_<username>)
|
# Local dev uses virtual environments (e.g., dev_<username>)
|
||||||
# Production uses the 'prod' environment
|
# Production uses the 'prod' environment
|
||||||
gateways:
|
gateways:
|
||||||
@@ -7,48 +7,13 @@ gateways:
|
|||||||
connection:
|
connection:
|
||||||
type: duckdb
|
type: duckdb
|
||||||
catalogs:
|
catalogs:
|
||||||
local: 'local.duckdb'
|
local: '{{ env_var("DUCKDB_PATH", "local.duckdb") }}'
|
||||||
cloudflare:
|
|
||||||
type: iceberg
|
|
||||||
path: '{{ env_var("ICEBERG_WAREHOUSE_NAME") }}'
|
|
||||||
connector_config:
|
|
||||||
endpoint: '{{ env_var("ICEBERG_CATALOG_URI") }}'
|
|
||||||
extensions:
|
|
||||||
- name: httpfs
|
|
||||||
- name: iceberg
|
|
||||||
secrets:
|
|
||||||
r2_secret:
|
|
||||||
type: iceberg
|
|
||||||
token: "{{ env_var('R2_ADMIN_API_TOKEN') }}"
|
|
||||||
r2_data_secret:
|
|
||||||
type: r2
|
|
||||||
key_id: "{{ env_var('R2_ADMIN_ACCESS_KEY_ID') }}"
|
|
||||||
secret: "{{ env_var('R2_ADMIN_SECRET_ACCESS_KEY') }}"
|
|
||||||
account_id: "{{ var('CLOUDFLARE_ACCOUNT_ID') }}"
|
|
||||||
region: 'eeur'
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
default_gateway: duckdb
|
default_gateway: duckdb
|
||||||
|
|
||||||
# --- Variables ---
|
# --- Variables ---
|
||||||
# Make environment variables available to models
|
|
||||||
variables:
|
variables:
|
||||||
R2_BUCKET: beanflows-data-prod
|
LANDING_DIR: '{{ env_var("LANDING_DIR", "data/landing") }}'
|
||||||
CLOUDFLARE_ACCOUNT_ID: "{{ env_var('CLOUDFLARE_ACCOUNT_ID') }}"
|
|
||||||
|
|
||||||
# --- Catalog Configuration ---
|
|
||||||
# Attach R2 Iceberg catalog and configure default schema
|
|
||||||
# https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#execution-hooks
|
|
||||||
# https://developers.cloudflare.com/r2/data-catalog/config-examples/duckdb/
|
|
||||||
|
|
||||||
#before_all:
|
|
||||||
# - "ATTACH '{{ env_var('ICEBERG_WAREHOUSE_NAME') }}' AS catalog (TYPE ICEBERG, ENDPOINT '{{ env_var('ICEBERG_CATALOG_URI') }}', SECRET r2_secret);"
|
|
||||||
# Note: R2 data access is configured via r2_data_secret (TYPE R2)
|
|
||||||
# Models can use r2://bucket/path to read landing data
|
|
||||||
# Note: CREATE SCHEMA has a DuckDB/Iceberg bug (missing Content-Type header)
|
|
||||||
# Schema must be pre-created in R2 Data Catalog via Cloudflare dashboard or API
|
|
||||||
# For now, skip USE statement and rely on fully-qualified table names in models
|
|
||||||
|
|
||||||
# --- Model Defaults ---
|
# --- Model Defaults ---
|
||||||
# https://sqlmesh.readthedocs.io/en/stable/reference/model_configuration/#model-defaults
|
# https://sqlmesh.readthedocs.io/en/stable/reference/model_configuration/#model-defaults
|
||||||
@@ -59,7 +24,6 @@ model_defaults:
|
|||||||
cron: '@daily' # Run models daily at 12am UTC (can override per model)
|
cron: '@daily' # Run models daily at 12am UTC (can override per model)
|
||||||
|
|
||||||
# --- Linting Rules ---
|
# --- Linting Rules ---
|
||||||
# Enforce standards for your team
|
|
||||||
# https://sqlmesh.readthedocs.io/en/stable/guides/linter/
|
# https://sqlmesh.readthedocs.io/en/stable/guides/linter/
|
||||||
|
|
||||||
linter:
|
linter:
|
||||||
@@ -68,22 +32,8 @@ linter:
|
|||||||
- ambiguousorinvalidcolumn
|
- ambiguousorinvalidcolumn
|
||||||
- invalidselectstarexpansion
|
- invalidselectstarexpansion
|
||||||
|
|
||||||
# FLOW: Minimal prompts, automatic changes, summary output
|
# --- Default Target Environment ---
|
||||||
# https://sqlmesh.readthedocs.io/en/stable/reference/configuration/#plan
|
# Prevents accidentally applying plans to prod during local development.
|
||||||
|
|
||||||
#plan:
|
|
||||||
# no_diff: true # Hide detailed text differences for changed models
|
|
||||||
# no_prompts: true # No interactive prompts
|
|
||||||
# auto_apply: true # Apply changes automatically
|
|
||||||
|
|
||||||
# --- Optional: Set a default target environment ---
|
|
||||||
# This is intended for local development to prevent users from accidentally applying plans to the prod environment.
|
|
||||||
# It is a development only config and should NOT be committed to your git repo.
|
|
||||||
# https://sqlmesh.readthedocs.io/en/stable/guides/configuration/#default-target-environment
|
# https://sqlmesh.readthedocs.io/en/stable/guides/configuration/#default-target-environment
|
||||||
|
|
||||||
# Uncomment the following line to use a default target environment derived from the logged in user's name.
|
|
||||||
default_target_environment: dev_{{ user() }}
|
default_target_environment: dev_{{ user() }}
|
||||||
|
|
||||||
# Example usage:
|
|
||||||
# sqlmesh plan # Automatically resolves to: sqlmesh plan dev_yourname
|
|
||||||
# sqlmesh plan prod # Specify `prod` to apply changes to production
|
|
||||||
|
|||||||
@@ -21,4 +21,4 @@ MODEL (
|
|||||||
)
|
)
|
||||||
);
|
);
|
||||||
select *
|
select *
|
||||||
FROM read_csv('extract/psdonline/src/psdonline/data/*.csv.gzip', delim=',', encoding='utf-8', compression='gzip', max_line_size=10000000, header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
|
FROM read_csv('{{ var("LANDING_DIR") }}/psd/**/*.csv.gzip', delim=',', encoding='utf-8', compression='gzip', max_line_size=10000000, header=true, union_by_name=true, filename=true, names = ['commodity_code', 'commodity_description', 'country_code', 'country_name', 'market_year', 'calendar_year', 'month', 'attribute_id', 'attribute_description', 'unit_id', 'unit_description', 'value'], all_varchar=true)
|
||||||
|
|||||||
@@ -41,7 +41,7 @@ select
|
|||||||
any_value(unit_name) as unit_name,
|
any_value(unit_name) as unit_name,
|
||||||
any_value(value) as value,
|
any_value(value) as value,
|
||||||
hash(commodity_code, commodity_name, country_code, country_name, market_year, calendar_year, month, attribute_id, attribute_name, unit_id, unit_name, value) as hkey,
|
hash(commodity_code, commodity_name, country_code, country_name, market_year, calendar_year, month, attribute_id, attribute_name, unit_id, unit_name, value) as hkey,
|
||||||
any_value(make_date(split(filename, '/')[-4]::int, split(filename, '/')[-3]::int, 1)) as ingest_date,
|
any_value(make_date(split(filename, '/')[-3]::int, split(filename, '/')[-2]::int, 1)) as ingest_date,
|
||||||
any_value(if(month!=0,last_day(make_date(market_year, month, 1)),null)) as market_date_month_end,
|
any_value(if(month!=0,last_day(make_date(market_year, month, 1)),null)) as market_date_month_end,
|
||||||
from cast_dtypes
|
from cast_dtypes
|
||||||
group by hkey
|
group by hkey
|
||||||
|
|||||||
@@ -2,7 +2,6 @@
|
|||||||
name = "sqlmesh_materia"
|
name = "sqlmesh_materia"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
description = "Add your description here"
|
description = "Add your description here"
|
||||||
readme = "README.md"
|
|
||||||
authors = [
|
authors = [
|
||||||
{ name = "Deeman", email = "hendriknote@gmail.com" }
|
{ name = "Deeman", email = "hendriknote@gmail.com" }
|
||||||
]
|
]
|
||||||
|
|||||||
Reference in New Issue
Block a user