cleanup and prefect service setup

2026-02-04 22:24:55 +01:00
parent fc27d5f887
commit 6d4377ccf9
41 changed files with 15888 additions and 2591 deletions
--- a/vision.md
+++ b/vision.md
@@ -0,0 +1,261 @@
+# VISION.md
+
+## Mission
+
+Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit.
+
+## Product: BeanFlows.coffee
+
+**Tagline:** Real-time commodity intelligence for traders who think for themselves.
+
+**Beachhead Market:** Coffee commodities
+**Long-term Vision:** Expand to all major commodity markets (~35-40 global contracts)
+
+## Why We Exist
+
+Platforms like Kpler dominate the commodity analytics space but are:
+- Slow and complex
+- Prohibitively expensive
+- Designed for enterprise buyers with bloated sales processes
+- Built on legacy infrastructure that prioritizes features over performance
+
+We're building the anti-Kpler: **better, faster, cheaper**.
+
+## Who We Are
+
+A two-person indie hacker startup:
+- **Data Engineer:** Building the platform
+- **Commodity Trader:** Domain expertise and product direction
+
+We move fast, ship incrementally, and prioritize value over vanity metrics.
+
+## Technical Philosophy
+
+### Core Principles
+
+1. **Simplicity over complexity**
+   - Minimal dependencies
+   - Clear, readable code
+   - Avoid premature abstraction
+
+2. **Performance over features**
+   - DuckDB over Spark
+   - Hetzner/Cloudflare over AWS
+   - SQL/Python/C over heavyweight frameworks
+
+3. **Accuracy over speed-to-market**
+   - Data quality is non-negotiable
+   - Rigorous validation at every layer
+   - Build trust through reliability
+
+4. **Build over buy**
+   - We're not afraid to write code from scratch
+   - Third-party tools must earn their place
+   - Control our destiny, minimize vendor lock-in
+
+### Technology Stack
+
+**Languages:**
+- SQL (primary transformation language)
+- Python (orchestration, extraction, APIs)
+- C (performance-critical extensions)
+
+**Infrastructure:**
+- **Storage:** Cloudflare R2 (not S3)
+- **Compute:** Hetzner bare metal (not AWS/GCP)
+- **Database:** DuckDB (not Spark/Snowflake)
+- **Orchestration:** SQLMesh + custom Python (not Airflow)
+
+**Development:**
+- **Monorepo:** uv workspace
+- **Package Manager:** uv (not pip/poetry)
+- **Version Control:** Git (GitLab)
+- **CI/CD:** GitLab CI
+
+### Architectural Philosophy
+
+**Data-Oriented Design:**
+- No OOP spaghetti
+- Data flows are explicit and traceable
+- Functions transform data, not objects with hidden state
+
+**Layered Architecture:**
+- Raw → Staging → Cleaned → Serving
+- Each layer has a single, clear purpose
+- Immutable raw data, reproducible transformations
+
+**Incremental Everything:**
+- Models update incrementally by time ranges
+- Avoid full table scans
+- Pay only for what changed
+
+## Current State (October 2025)
+
+### What's Working
+- USDA PSD Online extraction (2006-present, monthly archives)
+- 4-layer SQLMesh pipeline (raw → staging → cleaned → serving)
+- DuckDB backend with 13GB dev database
+- Incremental-by-time-range models with deduplication
+- Development environment with pre-commit hooks, linting, formatting
+
+### What We Have
+- Comprehensive commodity supply/demand data (USDA PSD)
+- Established naming conventions and data quality patterns
+- GitLab CI pipeline (lint, test, build)
+- Documentation (CLAUDE.md, layer conventions)
+
+## Roadmap
+
+### Phase 1: Coffee Market Foundation (Current)
+**Goal:** Build complete coffee analytics from supply to price
+
+**Data Sources to Integrate:**
+- ✅ USDA PSD Online (production, stocks, consumption)
+- ⬜ ICO (International Coffee Organization) data
+- ⬜ Yahoo Finance / Alpha Vantage (coffee futures prices - KC=F)
+- ⬜ Weather data for coffee-growing regions (OpenWeatherMap, NOAA)
+- ⬜ CFTC COT data (trader positioning)
+- ⬜ ICE warehouse stocks (web scraping)
+
+**Features to Build:**
+- ⬜ Historical price correlation analysis
+- ⬜ Supply/demand balance modeling
+- ⬜ Weather impact scoring
+- ⬜ Trader sentiment indicators (COT)
+- ⬜ Simple web dashboard (read-only analytics)
+- ⬜ Data export APIs (JSON, CSV, Parquet)
+
+**Infrastructure:**
+- ⬜ Move to Cloudflare R2 for raw data storage
+- ⬜ Deploy SQLMesh to Hetzner production environment
+- ⬜ Set up automated daily extraction + transformation pipeline
+- ⬜ Implement monitoring and alerting
+
+### Phase 2: Product Market Fit
+**Goal:** Validate with real traders, iterate on feedback
+
+- ⬜ Beta access for small group of coffee traders
+- ⬜ Usage analytics (what queries matter?)
+- ⬜ Performance optimization based on real workloads
+- ⬜ Pricing model experimentation ($X/month, pay-as-you-go?)
+
+### Phase 3: Expand Commodity Coverage
+**Goal:** Prove architecture scales across commodities
+
+**Priority Markets:**
+1. Other softs (cocoa, sugar, cotton, OJ)
+2. Grains (corn, wheat, soybeans)
+3. Energy (crude oil, natural gas)
+4. Metals (gold, silver, copper)
+
+**Reusable Patterns:**
+- Abstract extraction logic (API connectors, scrapers)
+- Standardized staging layer for price/volume data
+- Common serving models (time series, correlations, anomalies)
+
+### Phase 4: Advanced Analytics
+**Goal:** Differentiation through unique insights
+
+- ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring
+- ⬜ Custom yield forecasting models
+- ⬜ Real-time alert system (price thresholds, supply shocks)
+- ⬜ Historical backtesting framework for trading strategies
+- ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO)
+
+### Phase 5: Scale & Polish
+**Goal:** Handle growth, maintain performance advantage
+
+- ⬜ Multi-region deployment (low latency globally)
+- ⬜ Advanced caching strategies
+- ⬜ Self-service onboarding (no sales calls)
+- ⬜ Public documentation and API reference
+- ⬜ Community/forum for traders
+
+## Key Decisions & Trade-offs
+
+### Why DuckDB over Spark?
+- **Speed:** In-process OLAP is faster for our workloads
+- **Simplicity:** No cluster management, no JVM
+- **Cost:** Runs on a single beefy server, not 100 nodes
+- **Developer experience:** SQL-first, Python-friendly
+
+### Why SQLMesh over dbt/Airflow?
+- **Unified:** Orchestration + transformation in one tool
+- **Performance:** Built for incremental execution
+- **Virtual environments:** Test changes without breaking prod
+- **Python-native:** Extend with custom macros
+
+### Why Cloudflare R2 over S3?
+- **Cost:** No egress fees (huge for data-heavy platform)
+- **Performance:** Global edge network
+- **Simplicity:** S3-compatible API, easy migration path
+
+### Why Hetzner over AWS?
+- **Cost:** 10x cheaper for equivalent compute
+- **Performance:** Bare metal = no noisy neighbors
+- **Simplicity:** Less surface area, fewer services to manage
+
+### Why Monorepo?
+- **Atomic changes:** Update extraction + transformation together
+- **Shared code:** Reusable utilities across packages
+- **Simplified CI:** One pipeline, consistent tooling
+
+## Anti-Goals
+
+Things we explicitly do NOT want:
+
+- ❌ Enterprise sales team
+- ❌ Complex onboarding processes
+- ❌ Vendor lock-in (AWS, Snowflake, etc.)
+- ❌ OOP frameworks (Django ORM, SQLAlchemy magic)
+- ❌ Microservices (until we need them, which is not now)
+- ❌ Kubernetes (overkill for our scale)
+- ❌ Feature bloat (every feature has a performance cost)
+
+## Success Metrics
+
+**Phase 1 (Foundation):**
+- All coffee data sources integrated
+- Daily pipeline runs reliably (<5% failure rate)
+- Query latency <500ms for common analytics
+
+**Phase 2 (PMF):**
+- 10+ paying beta users
+- 90%+ data accuracy (validated against spot checks)
+- Monthly churn <10%
+
+**Phase 3 (Expansion):**
+- 5+ commodity markets covered
+- 100+ active users
+- Break-even on infrastructure costs
+
+**Long-term (Scale):**
+- Cover all ~35-40 major commodity contracts
+- 1000+ traders using the platform
+- Recognized as the go-to alternative to Kpler for indie traders
+
+## Guiding Questions
+
+When making decisions, ask:
+
+1. **Does this make us faster?** (Performance)
+2. **Does this make us more accurate?** (Data quality)
+3. **Does this make us simpler?** (Maintainability)
+4. **Does this help traders make better decisions?** (Value)
+5. **Can we afford to run this at scale?** (Unit economics)
+
+If the answer to any of these is "no," reconsider.
+
+## Current Priorities (Q4 2025)
+
+1. Integrate coffee futures price data (Yahoo Finance)
+2. Build time-series serving models for price/supply correlation
+3. Deploy production pipeline to Hetzner
+4. Set up Cloudflare R2 for raw data storage
+5. Create simple read-only dashboard for coffee analytics
+6. Document API for beta testers
+
+---
+
+**Last Updated:** October 2025
+**Next Review:** End of Q4 2025 (adjust based on Phase 1 progress)