Files
beanflows/vision.md
Deeman 3d3f375e01 Merge worktree-cot-integration: Phase 1 + scout MCP server
- Phase 1A-C: KC=F price extraction, SQLMesh models, dashboard charts, API endpoints
- ICE warehouse stocks: extraction package, SQLMesh models, dashboard + API
- Methodology page (/methodology) with all data sources documented
- Supervisor pipeline automation with webhook alerting
- Scout MCP server (tools/scout/) for browser recon via Pydoll
- msgspec added as workspace dependency for typed boundary structs
- vision.md updated to reflect Phase 1 completion (Feb 2026)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 15:57:49 +01:00

276 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VISION.md
## Mission
Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit.
## Product: BeanFlows.coffee
**Tagline:** Real-time commodity intelligence for traders who think for themselves.
**Beachhead Market:** Coffee commodities
**Long-term Vision:** Expand to all major commodity markets (~35-40 global contracts)
## Why We Exist
Platforms like Kpler dominate the commodity analytics space but are:
- Slow and complex
- Prohibitively expensive
- Designed for enterprise buyers with bloated sales processes
- Built on legacy infrastructure that prioritizes features over performance
We're building the anti-Kpler: **better, faster, cheaper**.
## Who We Are
A two-person indie hacker startup:
- **Data Engineer:** Building the platform
- **Commodity Trader:** Domain expertise and product direction
We move fast, ship incrementally, and prioritize value over vanity metrics.
## Technical Philosophy
### Core Principles
1. **Simplicity over complexity**
- Minimal dependencies
- Clear, readable code
- Avoid premature abstraction
2. **Performance over features**
- DuckDB over Spark
- Hetzner/Cloudflare over AWS
- SQL/Python/C over heavyweight frameworks
3. **Accuracy over speed-to-market**
- Data quality is non-negotiable
- Rigorous validation at every layer
- Build trust through reliability
4. **Build over buy**
- We're not afraid to write code from scratch
- Third-party tools must earn their place
- Control our destiny, minimize vendor lock-in
### Technology Stack
**Languages:**
- SQL (primary transformation language)
- Python (orchestration, extraction, APIs)
- C (performance-critical extensions)
**Infrastructure:**
- **Storage:** Cloudflare R2 (not S3)
- **Compute:** Hetzner bare metal (not AWS/GCP)
- **Database:** DuckDB (not Spark/Snowflake)
- **Orchestration:** SQLMesh + custom Python (not Airflow)
**Development:**
- **Monorepo:** uv workspace
- **Package Manager:** uv (not pip/poetry)
- **Version Control:** Git (GitLab)
- **CI/CD:** GitLab CI
### Architectural Philosophy
**Data-Oriented Design:**
- No OOP spaghetti
- Data flows are explicit and traceable
- Functions transform data, not objects with hidden state
**Layered Architecture:**
- Raw → Staging → Cleaned → Serving
- Each layer has a single, clear purpose
- Immutable raw data, reproducible transformations
**Incremental Everything:**
- Models update incrementally by time ranges
- Avoid full table scans
- Pay only for what changed
## Current State (February 2026)
### What's Shipped
- USDA PSD Online extraction + full SQLMesh pipeline (raw→staging→cleaned→serving)
- CFTC COT disaggregated futures: weekly positioning, COT index, managed money net
- KC=F Coffee C futures prices: daily OHLCV, 20d/50d SMA, 52-week range (1971present)
- ICE certified warehouse stocks: extractor ready, awaiting URL confirmation
- Web app (Quart + HTMX): dashboard with supply/demand + COT + price + ICE charts
- REST API with key auth + rate limiting: /metrics, /positioning, /prices, /stocks
- Paddle billing (Starter/Pro plans), magic-link auth, admin panel
- /methodology page with full data source documentation
- Automated supervisor: all extractors + webhook alerting on failure
- 23 passing tests, GitLab CI pipeline
### What's Missing
- ICE stocks URL confirmed and backfill running (URL needs manual discovery at theice.com/report-center)
- Python SDK
- Public API documentation
## Roadmap
### Phase 1: Coffee Market Foundation (COMPLETE — ready for outreach)
**Goal:** Build complete coffee analytics from supply to price
**Data Sources:**
- ✅ USDA PSD Online (production, stocks, consumption)
- ✅ CFTC COT data (trader positioning, COT index)
- ✅ KC=F Coffee futures prices (daily OHLCV, moving averages)
- ✅ ICE warehouse stocks (extractor built, seed models deployed)
- ⬜ ICO (International Coffee Organization) — future
**Features:**
- ✅ Dashboard: supply/demand + COT + price + ICE warehouse charts
- ✅ REST API: all 4 data sources
- ✅ Data methodology page
- ✅ Automated daily pipeline with alerting
- ⬜ Python SDK
- ⬜ Historical correlation analysis
**Infrastructure:**
- ✅ Supervisor loop with all extractors
- ⬜ Move to Cloudflare R2 for raw data backup
- ⬜ Deploy to Hetzner production
### Phase 2: Product Market Fit
**Goal:** Validate with real traders, iterate on feedback
- ⬜ Beta access for small group of coffee traders
- ⬜ Usage analytics (what queries matter?)
- ⬜ Performance optimization based on real workloads
- ⬜ Pricing model experimentation ($X/month, pay-as-you-go?)
### Phase 3: Expand Commodity Coverage
**Goal:** Prove architecture scales across commodities
**Priority Markets:**
1. Other softs (cocoa, sugar, cotton, OJ)
2. Grains (corn, wheat, soybeans)
3. Energy (crude oil, natural gas)
4. Metals (gold, silver, copper)
**Reusable Patterns:**
- Abstract extraction logic (API connectors, scrapers)
- Standardized staging layer for price/volume data
- Common serving models (time series, correlations, anomalies)
### Phase 4: Advanced Analytics
**Goal:** Differentiation through unique insights
- ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring
- ⬜ Custom yield forecasting models
- ⬜ Real-time alert system (price thresholds, supply shocks)
- ⬜ Historical backtesting framework for trading strategies
- ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO)
### Phase 5: Scale & Polish
**Goal:** Handle growth, maintain performance advantage
- ⬜ Multi-region deployment (low latency globally)
- ⬜ Advanced caching strategies
- ⬜ Self-service onboarding (no sales calls)
- ⬜ Public documentation and API reference
- ⬜ Community/forum for traders
## Key Decisions & Trade-offs
### Why DuckDB over Spark?
- **Speed:** In-process OLAP is faster for our workloads
- **Simplicity:** No cluster management, no JVM
- **Cost:** Runs on a single beefy server, not 100 nodes
- **Developer experience:** SQL-first, Python-friendly
### Why SQLMesh over dbt/Airflow?
- **Unified:** Orchestration + transformation in one tool
- **Performance:** Built for incremental execution
- **Virtual environments:** Test changes without breaking prod
- **Python-native:** Extend with custom macros
### Why Cloudflare R2 over S3?
- **Cost:** No egress fees (huge for data-heavy platform)
- **Performance:** Global edge network
- **Simplicity:** S3-compatible API, easy migration path
### Why Hetzner over AWS?
- **Cost:** 10x cheaper for equivalent compute
- **Performance:** Bare metal = no noisy neighbors
- **Simplicity:** Less surface area, fewer services to manage
### Why Monorepo?
- **Atomic changes:** Update extraction + transformation together
- **Shared code:** Reusable utilities across packages
- **Simplified CI:** One pipeline, consistent tooling
## Anti-Goals
Things we explicitly do NOT want:
- ❌ Enterprise sales team
- ❌ Complex onboarding processes
- ❌ Vendor lock-in (AWS, Snowflake, etc.)
- ❌ OOP frameworks (Django ORM, SQLAlchemy magic)
- ❌ Microservices (until we need them, which is not now)
- ❌ Kubernetes (overkill for our scale)
- ❌ Feature bloat (every feature has a performance cost)
## Success Metrics
**Phase 1 (Foundation):**
- All coffee data sources integrated
- Daily pipeline runs reliably (<5% failure rate)
- Query latency <500ms for common analytics
**Phase 2 (PMF):**
- 10+ paying beta users
- 90%+ data accuracy (validated against spot checks)
- Monthly churn <10%
**Phase 3 (Expansion):**
- 5+ commodity markets covered
- 100+ active users
- Break-even on infrastructure costs
**Long-term (Scale):**
- Cover all ~35-40 major commodity contracts
- 1000+ traders using the platform
- Recognized as the go-to alternative to Kpler for indie traders
## Guiding Questions
When making decisions, ask:
1. **Does this make us faster?** (Performance)
2. **Does this make us more accurate?** (Data quality)
3. **Does this make us simpler?** (Maintainability)
4. **Does this help traders make better decisions?** (Value)
5. **Can we afford to run this at scale?** (Unit economics)
If the answer to any of these is "no," reconsider.
## Current Priorities (Q1 2026)
**Goal: Complete Phase 1 "whole product" and start beachhead outreach**
### Immediate (ship first):
1. **CFTC COT data** — extract weekly positioning data (CFTC code 083731), add to SQLMesh pipeline, expose via API. Completes the "USDA + CFTC" V1 promise from the strategy doc.
2. **Coffee futures price (KC=F)** — daily close via yfinance or Databento. Enables price/supply correlation in the dashboard. Core hook for trader interest.
3. **Data methodology page** — transparent docs for every field, every source, lineage. The #1 trust driver per the strategy doc. Required before outreach.
4. **Python SDK** (`pip install beanflows`) — one-line data access for quant analysts. The beachhead segment runs Python; this removes their biggest switching friction.
### Then (before Series A of customers):
5. **Automated daily pipeline** on Hetzner — cron + SQLMesh prod, with failure alerting
6. **Cloudflare R2** raw data backup + pipeline source
7. **Example Jupyter notebooks** — show before/after vs. manual WASDE workflow
8. **ICE warehouse stocks** — daily certified Arabica/Robusta inventory data (free from ICE Report Center)
### Business (parallel, not blocking):
- Start direct outreach to 2030 named analysts at mid-size commodity funds
- Weekly "BeanFlows Coffee Data Brief" newsletter (content marketing + credibility signal)
- Identify 12 early beta users willing to give feedback
---
**Last Updated:** February 2026
**Next Review:** End of Q1 2026