- Phase 1A-C: KC=F price extraction, SQLMesh models, dashboard charts, API endpoints - ICE warehouse stocks: extraction package, SQLMesh models, dashboard + API - Methodology page (/methodology) with all data sources documented - Supervisor pipeline automation with webhook alerting - Scout MCP server (tools/scout/) for browser recon via Pydoll - msgspec added as workspace dependency for typed boundary structs - vision.md updated to reflect Phase 1 completion (Feb 2026) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.8 KiB
VISION.md
Mission
Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit.
Product: BeanFlows.coffee
Tagline: Real-time commodity intelligence for traders who think for themselves.
Beachhead Market: Coffee commodities Long-term Vision: Expand to all major commodity markets (~35-40 global contracts)
Why We Exist
Platforms like Kpler dominate the commodity analytics space but are:
- Slow and complex
- Prohibitively expensive
- Designed for enterprise buyers with bloated sales processes
- Built on legacy infrastructure that prioritizes features over performance
We're building the anti-Kpler: better, faster, cheaper.
Who We Are
A two-person indie hacker startup:
- Data Engineer: Building the platform
- Commodity Trader: Domain expertise and product direction
We move fast, ship incrementally, and prioritize value over vanity metrics.
Technical Philosophy
Core Principles
-
Simplicity over complexity
- Minimal dependencies
- Clear, readable code
- Avoid premature abstraction
-
Performance over features
- DuckDB over Spark
- Hetzner/Cloudflare over AWS
- SQL/Python/C over heavyweight frameworks
-
Accuracy over speed-to-market
- Data quality is non-negotiable
- Rigorous validation at every layer
- Build trust through reliability
-
Build over buy
- We're not afraid to write code from scratch
- Third-party tools must earn their place
- Control our destiny, minimize vendor lock-in
Technology Stack
Languages:
- SQL (primary transformation language)
- Python (orchestration, extraction, APIs)
- C (performance-critical extensions)
Infrastructure:
- Storage: Cloudflare R2 (not S3)
- Compute: Hetzner bare metal (not AWS/GCP)
- Database: DuckDB (not Spark/Snowflake)
- Orchestration: SQLMesh + custom Python (not Airflow)
Development:
- Monorepo: uv workspace
- Package Manager: uv (not pip/poetry)
- Version Control: Git (GitLab)
- CI/CD: GitLab CI
Architectural Philosophy
Data-Oriented Design:
- No OOP spaghetti
- Data flows are explicit and traceable
- Functions transform data, not objects with hidden state
Layered Architecture:
- Raw → Staging → Cleaned → Serving
- Each layer has a single, clear purpose
- Immutable raw data, reproducible transformations
Incremental Everything:
- Models update incrementally by time ranges
- Avoid full table scans
- Pay only for what changed
Current State (February 2026)
What's Shipped
- USDA PSD Online extraction + full SQLMesh pipeline (raw→staging→cleaned→serving)
- CFTC COT disaggregated futures: weekly positioning, COT index, managed money net
- KC=F Coffee C futures prices: daily OHLCV, 20d/50d SMA, 52-week range (1971–present)
- ICE certified warehouse stocks: extractor ready, awaiting URL confirmation
- Web app (Quart + HTMX): dashboard with supply/demand + COT + price + ICE charts
- REST API with key auth + rate limiting: /metrics, /positioning, /prices, /stocks
- Paddle billing (Starter/Pro plans), magic-link auth, admin panel
- /methodology page with full data source documentation
- Automated supervisor: all extractors + webhook alerting on failure
- 23 passing tests, GitLab CI pipeline
What's Missing
- ICE stocks URL confirmed and backfill running (URL needs manual discovery at theice.com/report-center)
- Python SDK
- Public API documentation
Roadmap
Phase 1: Coffee Market Foundation (COMPLETE — ready for outreach)
Goal: Build complete coffee analytics from supply to price
Data Sources:
- ✅ USDA PSD Online (production, stocks, consumption)
- ✅ CFTC COT data (trader positioning, COT index)
- ✅ KC=F Coffee futures prices (daily OHLCV, moving averages)
- ✅ ICE warehouse stocks (extractor built, seed models deployed)
- ⬜ ICO (International Coffee Organization) — future
Features:
- ✅ Dashboard: supply/demand + COT + price + ICE warehouse charts
- ✅ REST API: all 4 data sources
- ✅ Data methodology page
- ✅ Automated daily pipeline with alerting
- ⬜ Python SDK
- ⬜ Historical correlation analysis
Infrastructure:
- ✅ Supervisor loop with all extractors
- ⬜ Move to Cloudflare R2 for raw data backup
- ⬜ Deploy to Hetzner production
Phase 2: Product Market Fit
Goal: Validate with real traders, iterate on feedback
- ⬜ Beta access for small group of coffee traders
- ⬜ Usage analytics (what queries matter?)
- ⬜ Performance optimization based on real workloads
- ⬜ Pricing model experimentation ($X/month, pay-as-you-go?)
Phase 3: Expand Commodity Coverage
Goal: Prove architecture scales across commodities
Priority Markets:
- Other softs (cocoa, sugar, cotton, OJ)
- Grains (corn, wheat, soybeans)
- Energy (crude oil, natural gas)
- Metals (gold, silver, copper)
Reusable Patterns:
- Abstract extraction logic (API connectors, scrapers)
- Standardized staging layer for price/volume data
- Common serving models (time series, correlations, anomalies)
Phase 4: Advanced Analytics
Goal: Differentiation through unique insights
- ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring
- ⬜ Custom yield forecasting models
- ⬜ Real-time alert system (price thresholds, supply shocks)
- ⬜ Historical backtesting framework for trading strategies
- ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO)
Phase 5: Scale & Polish
Goal: Handle growth, maintain performance advantage
- ⬜ Multi-region deployment (low latency globally)
- ⬜ Advanced caching strategies
- ⬜ Self-service onboarding (no sales calls)
- ⬜ Public documentation and API reference
- ⬜ Community/forum for traders
Key Decisions & Trade-offs
Why DuckDB over Spark?
- Speed: In-process OLAP is faster for our workloads
- Simplicity: No cluster management, no JVM
- Cost: Runs on a single beefy server, not 100 nodes
- Developer experience: SQL-first, Python-friendly
Why SQLMesh over dbt/Airflow?
- Unified: Orchestration + transformation in one tool
- Performance: Built for incremental execution
- Virtual environments: Test changes without breaking prod
- Python-native: Extend with custom macros
Why Cloudflare R2 over S3?
- Cost: No egress fees (huge for data-heavy platform)
- Performance: Global edge network
- Simplicity: S3-compatible API, easy migration path
Why Hetzner over AWS?
- Cost: 10x cheaper for equivalent compute
- Performance: Bare metal = no noisy neighbors
- Simplicity: Less surface area, fewer services to manage
Why Monorepo?
- Atomic changes: Update extraction + transformation together
- Shared code: Reusable utilities across packages
- Simplified CI: One pipeline, consistent tooling
Anti-Goals
Things we explicitly do NOT want:
- ❌ Enterprise sales team
- ❌ Complex onboarding processes
- ❌ Vendor lock-in (AWS, Snowflake, etc.)
- ❌ OOP frameworks (Django ORM, SQLAlchemy magic)
- ❌ Microservices (until we need them, which is not now)
- ❌ Kubernetes (overkill for our scale)
- ❌ Feature bloat (every feature has a performance cost)
Success Metrics
Phase 1 (Foundation):
- All coffee data sources integrated
- Daily pipeline runs reliably (<5% failure rate)
- Query latency <500ms for common analytics
Phase 2 (PMF):
- 10+ paying beta users
- 90%+ data accuracy (validated against spot checks)
- Monthly churn <10%
Phase 3 (Expansion):
- 5+ commodity markets covered
- 100+ active users
- Break-even on infrastructure costs
Long-term (Scale):
- Cover all ~35-40 major commodity contracts
- 1000+ traders using the platform
- Recognized as the go-to alternative to Kpler for indie traders
Guiding Questions
When making decisions, ask:
- Does this make us faster? (Performance)
- Does this make us more accurate? (Data quality)
- Does this make us simpler? (Maintainability)
- Does this help traders make better decisions? (Value)
- Can we afford to run this at scale? (Unit economics)
If the answer to any of these is "no," reconsider.
Current Priorities (Q1 2026)
Goal: Complete Phase 1 "whole product" and start beachhead outreach
Immediate (ship first):
- CFTC COT data — extract weekly positioning data (CFTC code 083731), add to SQLMesh pipeline, expose via API. Completes the "USDA + CFTC" V1 promise from the strategy doc.
- Coffee futures price (KC=F) — daily close via yfinance or Databento. Enables price/supply correlation in the dashboard. Core hook for trader interest.
- Data methodology page — transparent docs for every field, every source, lineage. The #1 trust driver per the strategy doc. Required before outreach.
- Python SDK (
pip install beanflows) — one-line data access for quant analysts. The beachhead segment runs Python; this removes their biggest switching friction.
Then (before Series A of customers):
- Automated daily pipeline on Hetzner — cron + SQLMesh prod, with failure alerting
- Cloudflare R2 raw data backup + pipeline source
- Example Jupyter notebooks — show before/after vs. manual WASDE workflow
- ICE warehouse stocks — daily certified Arabica/Robusta inventory data (free from ICE Report Center)
Business (parallel, not blocking):
- Start direct outreach to 20–30 named analysts at mid-size commodity funds
- Weekly "BeanFlows Coffee Data Brief" newsletter (content marketing + credibility signal)
- Identify 1–2 early beta users willing to give feedback
Last Updated: February 2026 Next Review: End of Q1 2026