# VISION.md ## Mission Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit. ## Product: BeanFlows.coffee **Tagline:** Real-time commodity intelligence for traders who think for themselves. **Beachhead Market:** Coffee commodities **Long-term Vision:** Expand to all major commodity markets (~35-40 global contracts) ## Why We Exist Platforms like Kpler dominate the commodity analytics space but are: - Slow and complex - Prohibitively expensive - Designed for enterprise buyers with bloated sales processes - Built on legacy infrastructure that prioritizes features over performance We're building the anti-Kpler: **better, faster, cheaper**. ## Who We Are A two-person indie hacker startup: - **Data Engineer:** Building the platform - **Commodity Trader:** Domain expertise and product direction We move fast, ship incrementally, and prioritize value over vanity metrics. ## Technical Philosophy ### Core Principles 1. **Simplicity over complexity** - Minimal dependencies - Clear, readable code - Avoid premature abstraction 2. **Performance over features** - DuckDB over Spark - Hetzner/Cloudflare over AWS - SQL/Python/C over heavyweight frameworks 3. **Accuracy over speed-to-market** - Data quality is non-negotiable - Rigorous validation at every layer - Build trust through reliability 4. **Build over buy** - We're not afraid to write code from scratch - Third-party tools must earn their place - Control our destiny, minimize vendor lock-in ### Technology Stack **Languages:** - SQL (primary transformation language) - Python (orchestration, extraction, APIs) - C (performance-critical extensions) **Infrastructure:** - **Storage:** Cloudflare R2 (not S3) - **Compute:** Hetzner bare metal (not AWS/GCP) - **Database:** DuckDB (not Spark/Snowflake) - **Orchestration:** SQLMesh + custom Python (not Airflow) **Development:** - **Monorepo:** uv workspace - **Package Manager:** uv (not pip/poetry) - **Version Control:** Git (GitLab) - **CI/CD:** GitLab CI ### Architectural Philosophy **Data-Oriented Design:** - No OOP spaghetti - Data flows are explicit and traceable - Functions transform data, not objects with hidden state **Layered Architecture:** - Raw → Staging → Cleaned → Serving - Each layer has a single, clear purpose - Immutable raw data, reproducible transformations **Incremental Everything:** - Models update incrementally by time ranges - Avoid full table scans - Pay only for what changed ## Current State (February 2026) ### What's Working - USDA PSD Online extraction (2006-present, monthly archives) - 4-layer SQLMesh pipeline (raw → staging → cleaned → serving) - DuckDB backend (local dev + production lakehouse) - Incremental-by-time-range models with deduplication - Development environment with pre-commit hooks, linting, formatting - **Web app (BeanFlows.coffee)** — Quart + HTMX, deployed via Docker - Magic-link auth + signup with waitlist flow - Coffee analytics dashboard: time series, top producers, stock-to-use trend, supply/demand balance, YoY change - Country comparison view - User settings + account management - API key management (create, revoke, prefix display) - Plan-based access control (free / starter / pro) with 5-year history cap on free tier - Billing via Paddle (subscriptions + webhooks) - Admin panel (users, waitlist, feedback, tasks) - REST API with Bearer token auth, rate limiting (1000 req/hr), CSV export - Feedback + waitlist capture - GitLab CI pipeline (lint, test, build), regression tests for billing/auth/API ### What We Have - Comprehensive commodity supply/demand data (USDA PSD, 2006–present) - Established naming conventions and data quality patterns - Full product pipeline: data → DB → API → web dashboard - Paddle billing integration (Starter + Pro tiers) - Working waitlist to capture early interest ## Roadmap ### Phase 1: Coffee Market Foundation (In Progress → ~70% done) **Goal:** Build complete coffee analytics from supply to price **Data Sources to Integrate:** - ✅ USDA PSD Online (production, stocks, consumption) - ⬜ CFTC COT data (trader positioning — weekly, Coffee C futures code 083731) - ⬜ Coffee futures prices — KC=F via Yahoo Finance / yfinance, or Databento for tick-level - ⬜ ICO (International Coffee Organization) data — trade volumes, consumption stats - ⬜ ICE certified warehouse stocks (daily CSV from ICE Report Center — free) - ⬜ Weather data for growing regions — ECMWF/Open-Meteo (free), Brazil frost alerts **Features to Build:** - ✅ Web dashboard (supply/demand, stock-to-use trend, YoY, country comparison) - ✅ REST API with key auth, plan-based access, rate limiting - ✅ CSV export - ⬜ CFTC COT integration → trader sentiment indicators - ⬜ Historical price data → price/supply correlation analysis - ⬜ Python SDK (`pip install beanflows`) — critical for the quant analyst beachhead - ⬜ Data methodology documentation page — P0 for trust (see strategy doc) - ⬜ Parquet export endpoint - ⬜ Example Jupyter notebooks (show how to pipe data into common models) **Infrastructure:** - ⬜ Cloudflare R2 for raw data storage (rclone sync is partly planned) - ⬜ Automated daily pipeline on Hetzner (SQLMesh prod + cron) - ⬜ Pipeline monitoring + alerting (failure notifications) - ⬜ Published SLA for data freshness ### Phase 2: Product Market Fit **Goal:** Validate with real traders, iterate on feedback - ⬜ Beta access for small group of coffee traders - ⬜ Usage analytics (what queries matter?) - ⬜ Performance optimization based on real workloads - ⬜ Pricing model experimentation ($X/month, pay-as-you-go?) ### Phase 3: Expand Commodity Coverage **Goal:** Prove architecture scales across commodities **Priority Markets:** 1. Other softs (cocoa, sugar, cotton, OJ) 2. Grains (corn, wheat, soybeans) 3. Energy (crude oil, natural gas) 4. Metals (gold, silver, copper) **Reusable Patterns:** - Abstract extraction logic (API connectors, scrapers) - Standardized staging layer for price/volume data - Common serving models (time series, correlations, anomalies) ### Phase 4: Advanced Analytics **Goal:** Differentiation through unique insights - ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring - ⬜ Custom yield forecasting models - ⬜ Real-time alert system (price thresholds, supply shocks) - ⬜ Historical backtesting framework for trading strategies - ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO) ### Phase 5: Scale & Polish **Goal:** Handle growth, maintain performance advantage - ⬜ Multi-region deployment (low latency globally) - ⬜ Advanced caching strategies - ⬜ Self-service onboarding (no sales calls) - ⬜ Public documentation and API reference - ⬜ Community/forum for traders ## Key Decisions & Trade-offs ### Why DuckDB over Spark? - **Speed:** In-process OLAP is faster for our workloads - **Simplicity:** No cluster management, no JVM - **Cost:** Runs on a single beefy server, not 100 nodes - **Developer experience:** SQL-first, Python-friendly ### Why SQLMesh over dbt/Airflow? - **Unified:** Orchestration + transformation in one tool - **Performance:** Built for incremental execution - **Virtual environments:** Test changes without breaking prod - **Python-native:** Extend with custom macros ### Why Cloudflare R2 over S3? - **Cost:** No egress fees (huge for data-heavy platform) - **Performance:** Global edge network - **Simplicity:** S3-compatible API, easy migration path ### Why Hetzner over AWS? - **Cost:** 10x cheaper for equivalent compute - **Performance:** Bare metal = no noisy neighbors - **Simplicity:** Less surface area, fewer services to manage ### Why Monorepo? - **Atomic changes:** Update extraction + transformation together - **Shared code:** Reusable utilities across packages - **Simplified CI:** One pipeline, consistent tooling ## Anti-Goals Things we explicitly do NOT want: - ❌ Enterprise sales team - ❌ Complex onboarding processes - ❌ Vendor lock-in (AWS, Snowflake, etc.) - ❌ OOP frameworks (Django ORM, SQLAlchemy magic) - ❌ Microservices (until we need them, which is not now) - ❌ Kubernetes (overkill for our scale) - ❌ Feature bloat (every feature has a performance cost) ## Success Metrics **Phase 1 (Foundation):** - All coffee data sources integrated - Daily pipeline runs reliably (<5% failure rate) - Query latency <500ms for common analytics **Phase 2 (PMF):** - 10+ paying beta users - 90%+ data accuracy (validated against spot checks) - Monthly churn <10% **Phase 3 (Expansion):** - 5+ commodity markets covered - 100+ active users - Break-even on infrastructure costs **Long-term (Scale):** - Cover all ~35-40 major commodity contracts - 1000+ traders using the platform - Recognized as the go-to alternative to Kpler for indie traders ## Guiding Questions When making decisions, ask: 1. **Does this make us faster?** (Performance) 2. **Does this make us more accurate?** (Data quality) 3. **Does this make us simpler?** (Maintainability) 4. **Does this help traders make better decisions?** (Value) 5. **Can we afford to run this at scale?** (Unit economics) If the answer to any of these is "no," reconsider. ## Current Priorities (Q1 2026) **Goal: Complete Phase 1 "whole product" and start beachhead outreach** ### Immediate (ship first): 1. **CFTC COT data** — extract weekly positioning data (CFTC code 083731), add to SQLMesh pipeline, expose via API. Completes the "USDA + CFTC" V1 promise from the strategy doc. 2. **Coffee futures price (KC=F)** — daily close via yfinance or Databento. Enables price/supply correlation in the dashboard. Core hook for trader interest. 3. **Data methodology page** — transparent docs for every field, every source, lineage. The #1 trust driver per the strategy doc. Required before outreach. 4. **Python SDK** (`pip install beanflows`) — one-line data access for quant analysts. The beachhead segment runs Python; this removes their biggest switching friction. ### Then (before Series A of customers): 5. **Automated daily pipeline** on Hetzner — cron + SQLMesh prod, with failure alerting 6. **Cloudflare R2** raw data backup + pipeline source 7. **Example Jupyter notebooks** — show before/after vs. manual WASDE workflow 8. **ICE warehouse stocks** — daily certified Arabica/Robusta inventory data (free from ICE Report Center) ### Business (parallel, not blocking): - Start direct outreach to 20–30 named analysts at mid-size commodity funds - Weekly "BeanFlows Coffee Data Brief" newsletter (content marketing + credibility signal) - Identify 1–2 early beta users willing to give feedback --- **Last Updated:** February 2026 **Next Review:** End of Q1 2026