beanflows

Author	SHA1	Message	Date
Deeman	80c1163a7f	feat: extraction framework overhaul — extract_core shared package + SQLite state tracking - Add extract/extract_core/ workspace package with three modules: - state.py: SQLite run tracking (open_state_db, start_run, end_run, get_last_cursor) - http.py: niquests session factory + etag normalization helpers - files.py: landing_path, content_hash, write_bytes_atomic (atomic gzip writes) - State lives at {LANDING_DIR}/.state.sqlite — no extra env var needed - SQLite chosen over DuckDB: state tracking is OLTP (row inserts/updates), not analytical - Refactor all 4 extractors (psdonline, cftc_cot, coffee_prices, ice_stocks): - Replace inline boilerplate with extract_core helpers - Add start_run/end_run tracking to every extraction entry point - extract_cot_year returns int (bytes_written) instead of bool - Update tests: assert result == 0 (not `is False`) for the return type change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-22 14:37:50 +01:00
Deeman	c1d00dcdc4	Refactor to local-first architecture on Hetzner NVMe Remove distributed R2/Iceberg/SSH pipeline architecture in favor of local subprocess execution with NVMe storage. Landing data backed up to R2 via rclone timer. - Strip Iceberg catalog, httpfs, boto3, paramiko, prefect, pyarrow - Pipelines run via subprocess.run() with bounded timeouts - Extract writes to {LANDING_DIR}/psd/{year}/{month}/{etag}.csv.gzip - SQLMesh reads LANDING_DIR variable, writes to DUCKDB_PATH - Delete unused provider stubs (ovh, scaleway, oracle) - Add rclone systemd timer for R2 backup every 6h - Update supervisor to run pipelines with env vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 19:50:19 +01:00
Deeman	38897617e7	Refactor PSD extraction: simplify to latest-only + add R2 support ## Key Changes 1. Simplified extraction logic - Changed from downloading 220+ historical archives to checking only latest available month - Tries current month and falls back up to 3 months (handles USDA publication lag) - Architecture advisor insight: ETags naturally deduplicate, historical year/month structure was unnecessary 2. Flat storage structure - Old: `data/{year}/{month}/{etag}.zip` - New: `data/{etag}.zip` (local) or `psd/{etag}.zip` (R2) - Migrated 226 existing files to flat structure 3. Dual storage modes - Local mode: Downloads to local directory (development) - R2 mode: Uploads to Cloudflare R2 (production) - Mode determined by presence of R2 environment variables - Added boto3 dependency for S3-compatible R2 API 4. Updated raw SQLMesh model - Changed pattern from `*/.zip` to `*.zip` to match flat structure ## Benefits - Simpler: Single file check instead of 220+ URL attempts - Efficient: ETag-based deduplication works naturally - Flexible: Supports both local dev and production R2 storage - Maintainable: Removed unnecessary complexity ## Testing - ✅ Local extraction works and respects ETags - ✅ Falls back correctly when current month unavailable - ✅ Linting passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-20 22:02:15 +02:00
Deeman	9baa0d185c	testing sqlmesh	2025-07-27 00:18:03 +02:00
Deeman	0bbbd25b68	update projects to packages	2025-07-26 22:32:37 +02:00
Deeman	4fd1b96114	simplify using etags	2025-07-26 22:08:35 +02:00
Deeman	8143c6ed8e	async is requesting stuff too fast	2025-07-13 18:08:19 +02:00
Deeman	c3c281fcd8	update structure	2025-07-08 22:41:59 +02:00

8 Commits