Update SQLMesh for R2 data access & Convert psd data to gzip

2025-11-02 00:26:01 +01:00
parent fc27d5f887
commit b702e6565a
26 changed files with 3553 additions and 1786 deletions
--- a/.claude/plans/saas-frontend-architecture.md
+++ b/.claude/plans/saas-frontend-architecture.md
@@ -1,566 +0,0 @@
-# SaaS Frontend Architecture Plan: beanflows.coffee
-
-**Date**: 2025-10-21
-**Status**: Planning
-**Product**: beanflows.coffee - Coffee market analytics platform
-
-## Project Vision
-
-**beanflows.coffee** - A specialized coffee market analytics platform built on USDA PSD data, providing traders, roasters, and market analysts with actionable insights into global coffee production, trade flows, and supply chain dynamics.
-
-## Architecture Overview
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│ Robyn Web App (beanflows.coffee)                           │
-│                                                             │
-│  Landing Page (Jinja2 + htmx) ─┬─> Auth (JWT + SQLite)    │
-│                                 └─> /dashboards/* routes   │
-│                                            │                │
-│                                            ▼                │
-│                                  Serve Evidence /build/    │
-└─────────────────────────────────────────────────────────────┘
-                                            │
-                                            ▼
-                              ┌──────────────────────────┐
-                              │ Evidence.dev Dashboards  │
-                              │ (coffee market focus)    │
-                              │                          │
-                              │ Queries: Local DuckDB ←──┼─── Export from Iceberg
-                              │ Builds: On data updates  │
-                              └──────────────────────────┘
-```
-
-## Technical Decisions
-
-### Data Flow
- **Source:** Iceberg catalog (R2)
- **Export:** Local DuckDB file for Evidence dashboards
- **Trigger:** Rebuild Evidence after SQLMesh updates data
- **Serving:** Robyn serves Evidence static build output
-
-### Auth System
- **User data:** SQLite database
- **Auth method:** JWT tokens (Robyn built-in support)
- **Consideration:** Evaluate hosted auth services (Clerk, Auth0)
- **POC approach:** Simple email/password with JWT
-
-### Payments
- **Provider:** Stripe
- **Integration:** Webhook-based (Stripe.js on client, webhooks to Robyn)
- **Rationale:** Simplest integration, no need for complex server-side API calls
-
-### Project Structure
-```
-materia/
-├── web/                   # NEW: Robyn web application
-│   ├── app.py            # Robyn entry point
-│   ├── routes/
-│   │   ├── landing.py    # Marketing page
-│   │   ├── auth.py       # Login/signup (JWT)
-│   │   └── dashboards.py # Serve Evidence /build/
-│   ├── templates/        # Jinja2 + htmx
-│   │   ├── base.html
-│   │   ├── landing.html
-│   │   └── login.html
-│   ├── middleware/
-│   │   └── auth.py       # JWT verification
-│   ├── models.py         # SQLite schema (users table)
-│   └── static/           # CSS, htmx.js
-├── dashboards/           # NEW: Evidence.dev project
-│   ├── pages/            # Dashboard markdown files
-│   │   ├── index.md      # Global coffee overview
-│   │   ├── production.md # Production trends
-│   │   ├── trade.md      # Trade flows
-│   │   └── supply.md     # Supply/demand balance
-│   ├── sources/          # Data source configs
-│   ├── data/             # Local DuckDB exports
-│   │   └── coffee_data.duckdb
-│   └── package.json
-```
-
-## How It Works: Robyn + Evidence Integration
-
-### 1. Evidence Build Process
-```bash
-cd dashboards
-npm run build
-# Outputs static HTML/JS/CSS to dashboards/build/
-```
-
-### 2. Robyn Serves Evidence Output
-```python
-# web/routes/dashboards.py
-@app.get("/dashboards/*")
-@requires_jwt  # Custom middleware
-def serve_dashboard(request):
-    # Check authentication first
-    if not verify_jwt(request):
-        return redirect("/login")
-
-    # Strip /dashboards/ prefix
-    path = request.path.removeprefix("/dashboards/") or "index.html"
-
-    # Serve from Evidence build directory
-    file_path = Path("dashboards/build") / path
-
-    if not file_path.exists():
-        file_path = Path("dashboards/build/index.html")
-
-    return FileResponse(file_path)
-```
-
-### 3. User Flow
-1. User visits `beanflows.coffee` (landing page)
-2. User signs up / logs in (Robyn auth system)
-3. Stripe checkout for subscription (using Stripe.js)
-4. User navigates to `beanflows.coffee/dashboards/`
-5. Robyn checks JWT authentication
-6. If authenticated: serves Evidence static files
-7. If not: redirects to login
-
-## Phase 1: Evidence.dev POC
-
-**Goal:** Get Evidence working with coffee data
-
-### Tasks
-1. Create Evidence project in `dashboards/`
-   ```bash
-   mkdir dashboards && cd dashboards
-   npm init evidence@latest .
-   ```
-
-2. Create SQLMesh export model for coffee data
-   ```sql
-   -- models/exports/export_coffee_analytics.sql
-   COPY (
-     SELECT * FROM serving.obt_commodity_metrics
-     WHERE commodity_name ILIKE '%coffee%'
-   ) TO 'dashboards/data/coffee_data.duckdb';
-   ```
-
-3. Build simple coffee production dashboard
-   - Single dashboard showing coffee production trends
-   - Test Evidence build process
-   - Validate DuckDB query performance
-
-4. Test local Evidence dev server
-   ```bash
-   npm run dev
-   ```
-
-**Deliverable:** Working Evidence dashboard querying local DuckDB
-
-## Phase 2: Robyn Web App
-
-### Tasks
-
-1. Set up Robyn project in `web/`
-   ```bash
-   mkdir web && cd web
-   uv add robyn jinja2
-   ```
-
-2. Implement SQLite user database
-   ```python
-   # web/models.py
-   import sqlite3
-
-   def init_db():
-       conn = sqlite3.connect('users.db')
-       conn.execute('''
-           CREATE TABLE IF NOT EXISTS users (
-               id INTEGER PRIMARY KEY,
-               email TEXT UNIQUE NOT NULL,
-               password_hash TEXT NOT NULL,
-               stripe_customer_id TEXT,
-               subscription_status TEXT,
-               created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
-           )
-       ''')
-       conn.close()
-   ```
-
-3. Add JWT authentication
-   ```python
-   # web/middleware/auth.py
-   from robyn import Request
-   import jwt
-
-   def requires_jwt(func):
-       def wrapper(request: Request):
-           token = request.headers.get("Authorization")
-           if not token:
-               return redirect("/login")
-
-           try:
-               payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
-               request.user = payload
-               return func(request)
-           except jwt.InvalidTokenError:
-               return redirect("/login")
-
-       return wrapper
-   ```
-
-4. Create landing page (Jinja2 + htmx)
-   - Marketing copy
-   - Feature highlights
-   - Pricing section
-   - Sign up CTA
-
-5. Add dashboard serving route
-   - Protected by JWT middleware
-   - Serves Evidence `build/` directory
-
-**Deliverable:** Authenticated web app serving Evidence dashboards
-
-## Phase 3: Coffee Market Dashboards
-
-### Dashboard Ideas
-
-1. **Global Coffee Production Overview**
-   - Top producing countries (Brazil, Vietnam, Colombia, Ethiopia, Honduras)
-   - Arabica vs Robusta production split
-   - Year-over-year production changes
-   - Production volatility trends
-
-2. **Supply & Demand Balance**
-   - Stock-to-use ratios by country
-   - Export/import flows (trade network visualization)
-   - Consumption trends by region
-   - Inventory levels (ending stocks)
-
-3. **Market Volatility**
-   - Production volatility (weather impacts, climate change signals)
-   - Trade flow disruptions (sudden changes in export patterns)
-   - Stock drawdown alerts (countries depleting reserves)
-
-4. **Historical Trends**
-   - 10-year production trends by country
-   - Market share shifts (which countries gaining/losing)
-   - Climate impact signals (correlation with weather events)
-   - Long-term supply/demand balance
-
-5. **Trade Flow Analysis**
-   - Top exporters → top importers (Sankey diagram if possible)
-   - Net trade position by country
-   - Import dependency ratios
-   - Trade balance trends
-
-### Data Requirements
-
- Filter PSD data for coffee commodity codes
- May need new serving layer models:
-  - `fct_coffee_trade_flows` - Origin/destination trade flows
-  - `dim_coffee_varieties` - Arabica vs Robusta (if data available)
-  - `agg_coffee_regional_summary` - Regional aggregates
-
-**Deliverable:** Production-ready coffee analytics dashboards
-
-## Phase 4: Deployment & Automation
-
-### Evidence Build Trigger
-
-Rebuild Evidence dashboards after SQLMesh updates data:
-
-```python
-# In SQLMesh post-hook or separate script
-import subprocess
-import httpx
-
-def rebuild_dashboards():
-    # Export fresh data from Iceberg to local DuckDB
-    subprocess.run([
-        "duckdb", "-c",
-        "ATTACH 'iceberg_catalog' AS iceberg; "
-        "COPY (SELECT * FROM iceberg.serving.obt_commodity_metrics "
-        "WHERE commodity_name ILIKE '%coffee%') "
-        "TO 'dashboards/data/coffee_data.duckdb';"
-    ])
-
-    # Rebuild Evidence
-    subprocess.run(["npm", "run", "build"], cwd="dashboards")
-
-    # Optional: Restart Robyn to pick up new files
-    # (or use file watching in development)
-```
-
-**Trigger:** Run after SQLMesh `plan prod` completes successfully
-
-### Deployment Strategy
-
- **Robyn app:** Deploy to supervisor instance or dedicated worker
- **Evidence builds:** Built on deploy (run `npm run build` in CI/CD)
- **DuckDB file:** Exported from Iceberg during deployment
-
-**Deployment flow:**
-```
-GitLab master push
-  ↓
-CI/CD: Export coffee data from Iceberg → DuckDB
-  ↓
-CI/CD: Build Evidence dashboards (npm run build)
-  ↓
-Deploy Robyn app + Evidence build/ to supervisor/worker
-  ↓
-Robyn serves landing page + authenticated dashboards
-```
-
-**Deliverable:** Automated pipeline: SQLMesh → Export → Evidence Rebuild → Deployment
-
-## Alternative Architecture: nginx + FastCGI C
-
-### Evaluation
-
-**Current plan:** Robyn (Python web framework)
-**Alternative:** nginx + FastCGI C + kcgi library
-
-### How It Would Work
-
-```
-nginx (static files + Evidence dashboards)
-  ↓
-FastCGI C programs (auth, user management, Stripe webhooks)
-  ↓
-SQLite (user database)
-```
-
-### Authentication Options
-
-**Option 1: nginx JWT Module**
- Use open-source JWT module (`kjdev/nginx-auth-jwt`)
- nginx validates JWT before passing to FastCGI
- FastCGI receives `REMOTE_USER` variable
- **Complexity:** Medium (compile nginx with module)
-
-**Option 2: FastCGI C Auth Service**
- Separate FastCGI program validates JWT
- nginx uses `auth_request` directive
- Auth service returns 200 (valid) or 401 (invalid)
- **Complexity:** Medium (need `libjwt` library)
-
-**Option 3: FastCGI Handles Everything**
- Main FastCGI program validates JWT inline
- Uses `libjwt` for token parsing
- **Complexity:** Medium (simplest architecture)
-
-### Required C Libraries
-
- **FastCGI:** `kcgi` (modern, secure CGI/FastCGI library)
- **JWT:** `libjwt` (JWT creation/validation)
- **HTTP client:** `libcurl` (for Stripe API calls)
- **JSON:** `json-c` or `cjson` (parsing Stripe webhook payloads)
- **Database:** `libsqlite3` (user storage)
- **Templating:** Manual string building (no C equivalent to Jinja2)
-
-### Payment Integration
-
-**Challenge:** No official Stripe C library
-
-**Solutions:**
-
-1. **Webhook-based approach (RECOMMENDED)**
-   - Frontend uses Stripe.js (client-side checkout)
-   - Stripe sends webhook to FastCGI endpoint
-   - C program verifies webhook signature (HMAC-SHA256)
-   - Updates user database (subscription status)
-   - **Complexity:** Medium (simpler than full API integration)
-
-2. **Direct API calls with libcurl**
-   - Make HTTP POST to Stripe API
-   - Build JSON payloads manually
-   - Parse JSON responses with `json-c`
-   - **Complexity:** High (manual HTTP/JSON handling)
-
-### Development Time Estimate
-
-| Task | Robyn (Python) | FastCGI (C) |
-|------|----------------|-------------|
-| Basic auth | 2-3 days | 5-7 days |
-| Payment integration | 3-5 days | 7-10 days |
-| Template rendering | 1-2 days | 5-7 days |
-| Debugging/testing | 1-2 days | 3-5 days |
-| **Total POC** | **1-2 weeks** | **3-4 weeks** |
-
-### Performance Comparison
-
-**Robyn (Python):** ~1,000-5,000 req/sec
-**nginx + FastCGI C:** ~10,000-50,000 req/sec
-
-**Reality check:** For beanflows.coffee with <1000 users, even 100 req/sec is plenty.
-
-### Pros & Cons
-
-**Pros of C approach:**
- 10-50x faster than Python
- Lower memory footprint (~5-10MB vs 50-100MB)
- Simpler deployment (compiled binary + nginx config)
- More direct, no framework magic
- Data-oriented, performance-first design
-
-**Cons of C approach:**
- 2-3x longer development time
- More complex debugging (no interactive REPL)
- Manual memory management (potential for leaks/bugs)
- No templating library (build HTML with sprintf/snprintf)
- Stripe integration requires manual HTTP/JSON handling
- Steeper learning curve for team members
-
-### Recommendation
-
-**Start with Robyn, plan migration path to C:**
-
-**Phase 1 (Now):** Build with Robyn
- Fast development (1-2 weeks to POC)
- Prove product-market fit
- Get paying customers
- Measure actual performance needs
-
-**Phase 2 (After launch):** Evaluate performance
- Monitor Robyn performance under real load
- If Robyn handles <1000 users easily → stay with it
- If hitting bottlenecks → profile to find hot paths
-
-**Phase 3 (Optional, if needed):** Incremental C migration
- Rewrite hot paths only (e.g., auth service)
- Keep Evidence dashboards static (nginx serves directly)
- Hybrid architecture: nginx → C (auth) → Robyn (business logic)
-
-### Hybrid Architecture (Best of Both Worlds)
-
-```
-nginx
-  ↓
-  ├─> Static files (Evidence dashboards) [nginx serves directly]
-  ├─> Auth endpoints (/login, /signup) [FastCGI C - future optimization]
-  └─> Business logic (/api/*, /webhooks) [Robyn - for flexibility]
-```
-
-**When to migrate:**
- When Robyn becomes measurable bottleneck (>80% CPU under normal load)
- When response times exceed targets (>100ms p95)
- When memory usage becomes concern (>500MB for simple app)
-
-**Philosophy:** Measure first, optimize second. Data-oriented approach means we don't guess about performance, we measure and optimize only when needed.
-
-## Implementation Order
-
-1. **Week 1:** Evidence POC + local DuckDB export
-   - Create Evidence project
-   - Export coffee data from Iceberg
-   - Build simple production dashboard
-   - Validate local dev workflow
-
-2. **Week 2:** Robyn app + basic auth + Evidence embedding
-   - Set up Robyn project
-   - SQLite user database
-   - JWT authentication
-   - Landing page (Jinja2 + htmx)
-   - Serve Evidence dashboards at `/dashboards/*`
-
-3. **Week 3:** Coffee-specific dashboards + Stripe
-   - Build 3-4 core coffee dashboards
-   - Integrate Stripe checkout
-   - Webhook handling for subscriptions
-   - Basic user account page
-
-4. **Week 4:** Automated rebuild pipeline + deployment
-   - Automate Evidence rebuild after SQLMesh runs
-   - CI/CD pipeline for deployment
-   - Deploy to supervisor or dedicated worker
-   - Monitoring and analytics
-
-## Open Questions
-
-1. **Hosted auth:** Evaluate Clerk vs Auth0 vs roll-our-own
-   - Clerk: $25/mo for 1000 MAU, nice DX
-   - Auth0: Free tier 7500 MAU, more enterprise
-   - Roll our own: $0, full control, more code
-   - **Decision:** Start with roll-our-own JWT (simplest), migrate to hosted if auth becomes complex
-
-2. **DuckDB sync:** How often to export from Iceberg?
-   - Option A: Daily (after SQLMesh runs)
-   - Option B: After every SQLMesh plan
-   - **Decision:** Daily for now, automate after SQLMesh completion in production
-
-3. **Evidence build time:** If builds are slow, need caching strategy
-   - Monitor build times in Phase 1
-   - If >60s, investigate Evidence cache options
-   - May need incremental builds
-
-4. **Multi-commodity future:** How to expand beyond coffee?
-   - Code structure should be generic (parameterize commodity filter)
-   - Could launch cocoa.flows, wheat.supply, etc.
-   - Evidence supports parameterized pages (easy to expand)
-
-5. **C migration decision point:** What metrics trigger rewrite?
-   - CPU >80% sustained under normal load
-   - Response times >100ms p95
-   - Memory >500MB for simple app
-   - User complaints about slowness
-
-## Success Metrics
-
-**Phase 1 (POC):**
- Evidence site builds successfully
- Coffee data loads from DuckDB (<2s)
- One dashboard renders with real data
- Local dev server runs without errors
-
-**Phase 2 (MVP):**
- Robyn app runs and serves Evidence dashboards
- JWT auth works (login/signup flow)
- Landing page loads <2s
- Dashboard access restricted to authenticated users
-
-**Phase 3 (Launch):**
- Stripe integration works (test payment succeeds)
- 3-4 coffee dashboards functional
- Automated deployment pipeline working
- Monitoring in place (uptime, errors, performance)
-
-**Phase 4 (Growth):**
- User signups (track conversion rate)
- Active subscribers (MRR growth)
- Dashboard usage (which insights most valuable)
- Performance metrics (response times, error rates)
-
-## Cost Analysis
-
-**Current costs (data pipeline):**
- Supervisor: €4.49/mo (Hetzner CPX11)
- Workers: €0.01-0.05/day (ephemeral)
- R2 Storage: ~€0.10/mo (Iceberg catalog)
- **Total: ~€5/mo**
-
-**Additional costs (SaaS frontend):**
- Domain: €10/year (beanflows.coffee)
- Robyn hosting: €0 (runs on supervisor or dedicated worker €4.49/mo)
- Stripe fees: 2.9% + €0.30 per transaction
- **Total: ~€5-10/mo base cost**
-
-**Scaling costs:**
- If need dedicated worker for Robyn: +€4.49/mo
- If migrate to C: No additional cost (same infrastructure)
- Stripe fees scale with revenue (good problem to have)
-
-## Next Steps (When Ready)
-
-1. Create `dashboards/` directory and initialize Evidence.dev
-2. Create SQLMesh export model for coffee data
-3. Build simple coffee production dashboard
-4. Set up Robyn project structure
-5. Implement basic JWT auth
-6. Integrate Evidence dashboards into Robyn
-
-**Decision point:** After Phase 1 POC, re-evaluate C migration based on Evidence.dev capabilities and development experience.
-
-## References
-
- Evidence.dev: https://docs.evidence.dev/
- Robyn: https://github.com/sparckles/robyn
- kcgi (C CGI library): https://kristaps.bsd.lv/kcgi/
- libjwt: https://github.com/benmcollins/libjwt
- nginx auth_request: https://nginx.org/en/docs/http/ngx_http_auth_request_module.html
- Stripe webhooks: https://stripe.com/docs/webhooks