262 lines
8.5 KiB
Markdown
262 lines
8.5 KiB
Markdown
# VISION.md
|
|
|
|
## Mission
|
|
|
|
Build the fastest, most accurate, and most affordable commodity analytics platform for independent traders and small firms—without enterprise sales bullshit.
|
|
|
|
## Product: BeanFlows.coffee
|
|
|
|
**Tagline:** Real-time commodity intelligence for traders who think for themselves.
|
|
|
|
**Beachhead Market:** Coffee commodities
|
|
**Long-term Vision:** Expand to all major commodity markets (~35-40 global contracts)
|
|
|
|
## Why We Exist
|
|
|
|
Platforms like Kpler dominate the commodity analytics space but are:
|
|
- Slow and complex
|
|
- Prohibitively expensive
|
|
- Designed for enterprise buyers with bloated sales processes
|
|
- Built on legacy infrastructure that prioritizes features over performance
|
|
|
|
We're building the anti-Kpler: **better, faster, cheaper**.
|
|
|
|
## Who We Are
|
|
|
|
A two-person indie hacker startup:
|
|
- **Data Engineer:** Building the platform
|
|
- **Commodity Trader:** Domain expertise and product direction
|
|
|
|
We move fast, ship incrementally, and prioritize value over vanity metrics.
|
|
|
|
## Technical Philosophy
|
|
|
|
### Core Principles
|
|
|
|
1. **Simplicity over complexity**
|
|
- Minimal dependencies
|
|
- Clear, readable code
|
|
- Avoid premature abstraction
|
|
|
|
2. **Performance over features**
|
|
- DuckDB over Spark
|
|
- Hetzner/Cloudflare over AWS
|
|
- SQL/Python/C over heavyweight frameworks
|
|
|
|
3. **Accuracy over speed-to-market**
|
|
- Data quality is non-negotiable
|
|
- Rigorous validation at every layer
|
|
- Build trust through reliability
|
|
|
|
4. **Build over buy**
|
|
- We're not afraid to write code from scratch
|
|
- Third-party tools must earn their place
|
|
- Control our destiny, minimize vendor lock-in
|
|
|
|
### Technology Stack
|
|
|
|
**Languages:**
|
|
- SQL (primary transformation language)
|
|
- Python (orchestration, extraction, APIs)
|
|
- C (performance-critical extensions)
|
|
|
|
**Infrastructure:**
|
|
- **Storage:** Cloudflare R2 (not S3)
|
|
- **Compute:** Hetzner bare metal (not AWS/GCP)
|
|
- **Database:** DuckDB (not Spark/Snowflake)
|
|
- **Orchestration:** SQLMesh + custom Python (not Airflow)
|
|
|
|
**Development:**
|
|
- **Monorepo:** uv workspace
|
|
- **Package Manager:** uv (not pip/poetry)
|
|
- **Version Control:** Git (GitLab)
|
|
- **CI/CD:** GitLab CI
|
|
|
|
### Architectural Philosophy
|
|
|
|
**Data-Oriented Design:**
|
|
- No OOP spaghetti
|
|
- Data flows are explicit and traceable
|
|
- Functions transform data, not objects with hidden state
|
|
|
|
**Layered Architecture:**
|
|
- Raw → Staging → Cleaned → Serving
|
|
- Each layer has a single, clear purpose
|
|
- Immutable raw data, reproducible transformations
|
|
|
|
**Incremental Everything:**
|
|
- Models update incrementally by time ranges
|
|
- Avoid full table scans
|
|
- Pay only for what changed
|
|
|
|
## Current State (October 2025)
|
|
|
|
### What's Working
|
|
- USDA PSD Online extraction (2006-present, monthly archives)
|
|
- 4-layer SQLMesh pipeline (raw → staging → cleaned → serving)
|
|
- DuckDB backend with 13GB dev database
|
|
- Incremental-by-time-range models with deduplication
|
|
- Development environment with pre-commit hooks, linting, formatting
|
|
|
|
### What We Have
|
|
- Comprehensive commodity supply/demand data (USDA PSD)
|
|
- Established naming conventions and data quality patterns
|
|
- GitLab CI pipeline (lint, test, build)
|
|
- Documentation (CLAUDE.md, layer conventions)
|
|
|
|
## Roadmap
|
|
|
|
### Phase 1: Coffee Market Foundation (Current)
|
|
**Goal:** Build complete coffee analytics from supply to price
|
|
|
|
**Data Sources to Integrate:**
|
|
- ✅ USDA PSD Online (production, stocks, consumption)
|
|
- ⬜ ICO (International Coffee Organization) data
|
|
- ⬜ Yahoo Finance / Alpha Vantage (coffee futures prices - KC=F)
|
|
- ⬜ Weather data for coffee-growing regions (OpenWeatherMap, NOAA)
|
|
- ⬜ CFTC COT data (trader positioning)
|
|
- ⬜ ICE warehouse stocks (web scraping)
|
|
|
|
**Features to Build:**
|
|
- ⬜ Historical price correlation analysis
|
|
- ⬜ Supply/demand balance modeling
|
|
- ⬜ Weather impact scoring
|
|
- ⬜ Trader sentiment indicators (COT)
|
|
- ⬜ Simple web dashboard (read-only analytics)
|
|
- ⬜ Data export APIs (JSON, CSV, Parquet)
|
|
|
|
**Infrastructure:**
|
|
- ⬜ Move to Cloudflare R2 for raw data storage
|
|
- ⬜ Deploy SQLMesh to Hetzner production environment
|
|
- ⬜ Set up automated daily extraction + transformation pipeline
|
|
- ⬜ Implement monitoring and alerting
|
|
|
|
### Phase 2: Product Market Fit
|
|
**Goal:** Validate with real traders, iterate on feedback
|
|
|
|
- ⬜ Beta access for small group of coffee traders
|
|
- ⬜ Usage analytics (what queries matter?)
|
|
- ⬜ Performance optimization based on real workloads
|
|
- ⬜ Pricing model experimentation ($X/month, pay-as-you-go?)
|
|
|
|
### Phase 3: Expand Commodity Coverage
|
|
**Goal:** Prove architecture scales across commodities
|
|
|
|
**Priority Markets:**
|
|
1. Other softs (cocoa, sugar, cotton, OJ)
|
|
2. Grains (corn, wheat, soybeans)
|
|
3. Energy (crude oil, natural gas)
|
|
4. Metals (gold, silver, copper)
|
|
|
|
**Reusable Patterns:**
|
|
- Abstract extraction logic (API connectors, scrapers)
|
|
- Standardized staging layer for price/volume data
|
|
- Common serving models (time series, correlations, anomalies)
|
|
|
|
### Phase 4: Advanced Analytics
|
|
**Goal:** Differentiation through unique insights
|
|
|
|
- ⬜ Satellite imagery integration (NASA, Planet) for crop monitoring
|
|
- ⬜ Custom yield forecasting models
|
|
- ⬜ Real-time alert system (price thresholds, supply shocks)
|
|
- ⬜ Historical backtesting framework for trading strategies
|
|
- ⬜ Sentiment analysis from news/reports (USDA GAIN, FAO)
|
|
|
|
### Phase 5: Scale & Polish
|
|
**Goal:** Handle growth, maintain performance advantage
|
|
|
|
- ⬜ Multi-region deployment (low latency globally)
|
|
- ⬜ Advanced caching strategies
|
|
- ⬜ Self-service onboarding (no sales calls)
|
|
- ⬜ Public documentation and API reference
|
|
- ⬜ Community/forum for traders
|
|
|
|
## Key Decisions & Trade-offs
|
|
|
|
### Why DuckDB over Spark?
|
|
- **Speed:** In-process OLAP is faster for our workloads
|
|
- **Simplicity:** No cluster management, no JVM
|
|
- **Cost:** Runs on a single beefy server, not 100 nodes
|
|
- **Developer experience:** SQL-first, Python-friendly
|
|
|
|
### Why SQLMesh over dbt/Airflow?
|
|
- **Unified:** Orchestration + transformation in one tool
|
|
- **Performance:** Built for incremental execution
|
|
- **Virtual environments:** Test changes without breaking prod
|
|
- **Python-native:** Extend with custom macros
|
|
|
|
### Why Cloudflare R2 over S3?
|
|
- **Cost:** No egress fees (huge for data-heavy platform)
|
|
- **Performance:** Global edge network
|
|
- **Simplicity:** S3-compatible API, easy migration path
|
|
|
|
### Why Hetzner over AWS?
|
|
- **Cost:** 10x cheaper for equivalent compute
|
|
- **Performance:** Bare metal = no noisy neighbors
|
|
- **Simplicity:** Less surface area, fewer services to manage
|
|
|
|
### Why Monorepo?
|
|
- **Atomic changes:** Update extraction + transformation together
|
|
- **Shared code:** Reusable utilities across packages
|
|
- **Simplified CI:** One pipeline, consistent tooling
|
|
|
|
## Anti-Goals
|
|
|
|
Things we explicitly do NOT want:
|
|
|
|
- ❌ Enterprise sales team
|
|
- ❌ Complex onboarding processes
|
|
- ❌ Vendor lock-in (AWS, Snowflake, etc.)
|
|
- ❌ OOP frameworks (Django ORM, SQLAlchemy magic)
|
|
- ❌ Microservices (until we need them, which is not now)
|
|
- ❌ Kubernetes (overkill for our scale)
|
|
- ❌ Feature bloat (every feature has a performance cost)
|
|
|
|
## Success Metrics
|
|
|
|
**Phase 1 (Foundation):**
|
|
- All coffee data sources integrated
|
|
- Daily pipeline runs reliably (<5% failure rate)
|
|
- Query latency <500ms for common analytics
|
|
|
|
**Phase 2 (PMF):**
|
|
- 10+ paying beta users
|
|
- 90%+ data accuracy (validated against spot checks)
|
|
- Monthly churn <10%
|
|
|
|
**Phase 3 (Expansion):**
|
|
- 5+ commodity markets covered
|
|
- 100+ active users
|
|
- Break-even on infrastructure costs
|
|
|
|
**Long-term (Scale):**
|
|
- Cover all ~35-40 major commodity contracts
|
|
- 1000+ traders using the platform
|
|
- Recognized as the go-to alternative to Kpler for indie traders
|
|
|
|
## Guiding Questions
|
|
|
|
When making decisions, ask:
|
|
|
|
1. **Does this make us faster?** (Performance)
|
|
2. **Does this make us more accurate?** (Data quality)
|
|
3. **Does this make us simpler?** (Maintainability)
|
|
4. **Does this help traders make better decisions?** (Value)
|
|
5. **Can we afford to run this at scale?** (Unit economics)
|
|
|
|
If the answer to any of these is "no," reconsider.
|
|
|
|
## Current Priorities (Q4 2025)
|
|
|
|
1. Integrate coffee futures price data (Yahoo Finance)
|
|
2. Build time-series serving models for price/supply correlation
|
|
3. Deploy production pipeline to Hetzner
|
|
4. Set up Cloudflare R2 for raw data storage
|
|
5. Create simple read-only dashboard for coffee analytics
|
|
6. Document API for beta testers
|
|
|
|
---
|
|
|
|
**Last Updated:** October 2025
|
|
**Next Review:** End of Q4 2025 (adjust based on Phase 1 progress)
|