beanflows/vision.md

# VISION.md

## Mission

Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit.

## Product: BeanFlows.coffee

**Tagline:** Real-time commodity intelligence for traders who think for themselves.

**Beachhead Market:** Coffee commodities
**Long-term Vision:** Expand to all major commodity markets (~35-40 global contracts)

## Why We Exist

Platforms like Kpler dominate the commodity analytics space but are:
- Slow and complex
- Prohibitively expensive
- Designed for enterprise buyers with bloated sales processes
- Built on legacy infrastructure that prioritizes features over performance

We're building the anti-Kpler: **better, faster, cheaper**.

## Who We Are

A two-person indie hacker startup:
- **Data Engineer:** Building the platform
- **Commodity Trader:** Domain expertise and product direction

We move fast, ship incrementally, and prioritize value over vanity metrics.

## Technical Philosophy

### Core Principles

1. **Simplicity over complexity**
   - Minimal dependencies
   - Clear, readable code
   - Avoid premature abstraction

2. **Performance over features**
   - DuckDB over Spark
   - Hetzner/Cloudflare over AWS
   - SQL/Python/C over heavyweight frameworks

3. **Accuracy over speed-to-market**
   - Data quality is non-negotiable
   - Rigorous validation at every layer
   - Build trust through reliability

4. **Build over buy**
   - We're not afraid to write code from scratch
   - Third-party tools must earn their place
   - Control our destiny, minimize vendor lock-in

### Technology Stack

**Languages:**
- SQL (primary transformation language)
- Python (orchestration, extraction, APIs)
- C (performance-critical extensions)

**Infrastructure:**
- **Storage:** Cloudflare R2 (not S3)
- **Compute:** Hetzner bare metal (not AWS/GCP)
- **Database:** DuckDB (not Spark/Snowflake)
- **Orchestration:** SQLMesh + custom Python (not Airflow)

**Development:**
- **Monorepo:** uv workspace
- **Package Manager:** uv (not pip/poetry)
- **Version Control:** Git (GitLab)
- **CI/CD:** GitLab CI

### Architectural Philosophy

**Data-Oriented Design:**
- No OOP spaghetti
- Data flows are explicit and traceable
- Functions transform data, not objects with hidden state

**Layered Architecture:**
- Raw → Staging → Cleaned → Serving
- Each layer has a single, clear purpose
- Immutable raw data, reproducible transformations

**Incremental Everything:**
- Models update incrementally by time ranges
- Avoid full table scans
- Pay only for what changed

## Current State (February 2026)

### What's Shipped
- USDA PSD Online extraction + full SQLMesh pipeline (raw→staging→cleaned→serving)
- CFTC COT disaggregated futures: weekly positioning, COT index, managed money net
- KC=F Coffee C futures prices: daily OHLCV, 20d/50d SMA, 52-week range (1971–present)
- ICE certified warehouse stocks: extractor ready, awaiting URL confirmation
- Web app (Quart + HTMX): dashboard with supply/demand + COT + price + ICE charts
- REST API with key auth + rate limiting: /metrics, /positioning, /prices, /stocks
- Paddle billing (Starter/Pro plans), magic-link auth, admin panel
- /methodology page with full data source documentation
- Automated supervisor: all extractors + webhook alerting on failure
- 23 passing tests, GitLab CI pipeline

### What's Missing
- ICE stocks URL confirmed and backfill running (URL needs manual discovery at theice.com/report-center)
- Python SDK
- Public API documentation

## Roadmap

### Phase 1: Coffee Market Foundation (COMPLETE — ready for outreach)
**Goal:** Build complete coffee analytics from supply to price

**Data Sources:**
- ✅ USDA PSD Online (production, stocks, consumption)
- ✅ CFTC COT data (trader positioning, COT index)
- ✅ KC=F Coffee futures prices (daily OHLCV, moving averages)
- ✅ ICE warehouse stocks (extractor built, seed models deployed)
- ⬜ ICO (International Coffee Organization) — future

**Features:**
- ✅ Dashboard: supply/demand + COT + price + ICE warehouse charts
- ✅ REST API: all 4 data sources
- ✅ Data methodology page
- ✅ Automated daily pipeline with alerting
- ⬜ Python SDK
- ⬜ Historical correlation analysis

**Infrastructure:**
- ✅ Supervisor loop with all extractors
- ⬜ Move to Cloudflare R2 for raw data backup
- ⬜ Deploy to Hetzner production

### Phase 2: Product Market Fit
**Goal:** Validate with real traders, iterate on feedback

- ⬜ Beta access for small group of coffee traders
- ⬜ Usage analytics (what queries matter?)
- ⬜ Performance optimization based on real workloads
- ⬜ Pricing model experimentation ($X/month, pay-as-you-go?)

### Phase 3: Expand Commodity Coverage
**Goal:** Prove architecture scales across commodities

**Priority Markets:**
1. Other softs (cocoa, sugar, cotton, OJ)
2. Grains (corn, wheat, soybeans)
3. Energy (crude oil, natural gas)
4. Metals (gold, silver, copper)

**Reusable Patterns:**
- Abstract extraction logic (API connectors, scrapers)
- Standardized staging layer for price/volume data
- Common serving models (time series, correlations, anomalies)

### Phase 4: Advanced Analytics
**Goal:** Differentiation through unique insights

- ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring
- ⬜ Custom yield forecasting models
- ⬜ Real-time alert system (price thresholds, supply shocks)
- ⬜ Historical backtesting framework for trading strategies
- ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO)

### Phase 5: Scale & Polish
**Goal:** Handle growth, maintain performance advantage

- ⬜ Multi-region deployment (low latency globally)
- ⬜ Advanced caching strategies
- ⬜ Self-service onboarding (no sales calls)
- ⬜ Public documentation and API reference
- ⬜ Community/forum for traders

## Key Decisions & Trade-offs

### Why DuckDB over Spark?
- **Speed:** In-process OLAP is faster for our workloads
- **Simplicity:** No cluster management, no JVM
- **Cost:** Runs on a single beefy server, not 100 nodes
- **Developer experience:** SQL-first, Python-friendly

### Why SQLMesh over dbt/Airflow?
- **Unified:** Orchestration + transformation in one tool
- **Performance:** Built for incremental execution
- **Virtual environments:** Test changes without breaking prod
- **Python-native:** Extend with custom macros

### Why Cloudflare R2 over S3?
- **Cost:** No egress fees (huge for data-heavy platform)
- **Performance:** Global edge network
- **Simplicity:** S3-compatible API, easy migration path

### Why Hetzner over AWS?
- **Cost:** 10x cheaper for equivalent compute
- **Performance:** Bare metal = no noisy neighbors
- **Simplicity:** Less surface area, fewer services to manage

### Why Monorepo?
- **Atomic changes:** Update extraction + transformation together
- **Shared code:** Reusable utilities across packages
- **Simplified CI:** One pipeline, consistent tooling

## Anti-Goals

Things we explicitly do NOT want:

- ❌ Enterprise sales team
- ❌ Complex onboarding processes
- ❌ Vendor lock-in (AWS, Snowflake, etc.)
- ❌ OOP frameworks (Django ORM, SQLAlchemy magic)
- ❌ Microservices (until we need them, which is not now)
- ❌ Kubernetes (overkill for our scale)
- ❌ Feature bloat (every feature has a performance cost)

## Success Metrics

**Phase 1 (Foundation):**
- All coffee data sources integrated
- Daily pipeline runs reliably (<5% failure rate)
- Query latency <500ms for common analytics

**Phase 2 (PMF):**
- 10+ paying beta users
- 90%+ data accuracy (validated against spot checks)
- Monthly churn <10%

**Phase 3 (Expansion):**
- 5+ commodity markets covered
- 100+ active users
- Break-even on infrastructure costs

**Long-term (Scale):**
- Cover all ~35-40 major commodity contracts
- 1000+ traders using the platform
- Recognized as the go-to alternative to Kpler for indie traders

## Guiding Questions

When making decisions, ask:

1. **Does this make us faster?** (Performance)
2. **Does this make us more accurate?** (Data quality)
3. **Does this make us simpler?** (Maintainability)
4. **Does this help traders make better decisions?** (Value)
5. **Can we afford to run this at scale?** (Unit economics)

If the answer to any of these is "no," reconsider.

## Current Priorities (Q1 2026)

**Goal: Complete Phase 1 "whole product" and start beachhead outreach**

### Immediate (ship first):
1. **CFTC COT data** — extract weekly positioning data (CFTC code 083731), add to SQLMesh pipeline, expose via API. Completes the "USDA + CFTC" V1 promise from the strategy doc.
2. **Coffee futures price (KC=F)** — daily close via yfinance or Databento. Enables price/supply correlation in the dashboard. Core hook for trader interest.
3. **Data methodology page** — transparent docs for every field, every source, lineage. The #1 trust driver per the strategy doc. Required before outreach.
4. **Python SDK** (`pip install beanflows`) — one-line data access for quant analysts. The beachhead segment runs Python; this removes their biggest switching friction.

### Then (before Series A of customers):
5. **Automated daily pipeline** on Hetzner — cron + SQLMesh prod, with failure alerting
6. **Cloudflare R2** raw data backup + pipeline source
7. **Example Jupyter notebooks** — show before/after vs. manual WASDE workflow
8. **ICE warehouse stocks** — daily certified Arabica/Robusta inventory data (free from ICE Report Center)

### Business (parallel, not blocking):
- Start direct outreach to 20–30 named analysts at mid-size commodity funds
- Weekly "BeanFlows Coffee Data Brief" newsletter (content marketing + credibility signal)
- Identify 1–2 early beta users willing to give feedback

---

**Last Updated:** February 2026
**Next Review:** End of Q1 2026