Files
beanflows/docs/data-sources-inventory.md
Deeman b25b8780a7 docs: update inventory with ICE options research findings
- yfinance confirmed not viable (OPRA only, KC=F not covered)
- CFTC COT combined report is the free immediate path (URL change only)
- ICE Report Center settlement data viable with WebICE login automation
- Barchart OnDemand has correct coverage but requires paid subscription
- All OpenBB providers, Polygon.io, Nasdaq Data Link confirmed no KC=F coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 10:16:50 +01:00

280 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BeanFlows — Data Sources Inventory
Compiled: 2026-02-26
Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline.
---
## Pipeline Status Tracker
**Status:** ✅ Ingested — extractor + model live in `master` | 🔲 Planned — worth building | ⏸ On hold — blocked on cost/access | — Not targeted
**Score (15):** Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible.
| Source | Category | Status | Score | Credentials | Pipeline refs |
|--------|----------|--------|-------|-------------|---------------|
| CFTC COT Disaggregated Futures | Positioning | ✅ Ingested | 5 | None | `extract_cot``fct_cot_positioning``serving.cot_positioning` |
| Yahoo Finance — KC=F | Price | ✅ Ingested | 5 | None | `extract_coffee_prices``fct_coffee_prices``serving.coffee_prices` |
| ICE Report Center — warehouse stocks | Warehouse / Inventory | ✅ Ingested | 5 | None | `extract_ice_stocks``fct_ice_warehouse_stocks``serving.ice_warehouse_stocks` |
| ICE Report Center — stocks by port | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks``fct_ice_warehouse_stocks_by_port``serving.ice_warehouse_stocks_by_port` |
| ICE Report Center — aging stocks | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks``fct_ice_aging_stocks``serving.ice_aging_stocks` |
| USDA PSD Online | Fundamentals (supply/demand) | ✅ Ingested | 5 | None | `extract_psd``stg_psdalldata__commodity``serving.commodity_metrics` |
| Open-Meteo ERA5 — weather | Crop weather | ✅ Ingested | 5 | None | `extract_openmeteo``fct_weather_daily``serving.weather_daily` |
| CFTC COT Combined (futures+options) | Positioning | 🔲 Planned | 4 | None (same ZIP pattern) | Extend `extract_cot``fct_cot_combined` variant |
| ICE Report Center — options settlement | Derivatives / Volatility | 🔲 Planned | 3 | WebICE account (free) | `extract_ice_options``fct_ice_options_settlement` |
| Barchart OnDemand — options EOD | Derivatives / Volatility | ⏸ On hold | 3 | Paid subscription (contact sales) | `extract_barchart_options``fct_options_chain` |
| World Bank Commodity Prices (Pink Sheet) | Benchmark prices | 🔲 Planned | 3 | None | `extract_wb_prices``fct_wb_prices` |
| FAO Crop Calendar | Seasonality | 🔲 Planned | 3 | None (CSV) | Seed table |
| Freight / C4 route rates | Supply chain | 🔲 Planned | 2 | None (scrape) | `fct_freight_rates` |
| ICE Data Services — tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics |
| Refinitiv / LSEG | Price / Fundamentals | — | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources |
| Bloomberg Terminal | Price / News | — | 1 | Terminal license | Not cost-effective for current scope |
---
## 1. Positioning Data
### 1.1 CFTC COT Disaggregated Futures
| Field | Value |
|-------|-------|
| URL | `https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip` |
| Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) |
| Access Method | Public download — no auth, no API key |
| Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place |
| History | 2006-06-13 to present |
| License / TOS | US government data — public domain |
| Priority | **Core** |
The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~330 MB. The current-year file is overwritten each Friday; prior years are static.
Column quirk: `Swap__Positions_Short_All` and `Swap__Positions_Spread_All` use double underscores — this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB `all_varchar = TRUE` preserves exact header names; these columns must be quoted in SQL.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/cftc_cot/` — backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header)
- Landing: `data/landing/cot/{year}/{etag}.csv.gzip`
- Foundation: `foundation.fct_cot_positioning` — casts types, cleans names, computes net positions (long short), deduplicates via HASH key
- Grain: `(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)`
- Serving: `serving.cot_positioning` — adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI
- Covers all commodity codes in the report — filtering to coffee (`073642`) happens in the serving layer
**Related: CFTC COT Options-and-Futures Combined**
The same URL pattern with `com_disagg_txt_{year}.zip` gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5).
---
## 2. Price Data
### 2.1 Yahoo Finance — Coffee C (KC=F)
| Field | Value |
|-------|-------|
| URL | Yahoo Finance via `yfinance` Python library; ticker `KC=F` |
| Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures |
| Access Method | `yfinance.Ticker("KC=F").history(period="max")` — free, no auth |
| Update Frequency | Daily (post-settlement, typically ~30 min after session close) |
| History | 1971-08-16 to present |
| License / TOS | Yahoo Finance ToS — data for personal/non-commercial use; not for redistribution |
| Priority | **Core** |
Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (`Adj Close`, note space in header) accounts for contract rolls.
Column quirk: `Adj Close` has a space in the CSV header. DuckDB `all_varchar = TRUE` preserves this; must be quoted as `"Adj Close"` in SQL.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/coffee_prices/` — downloads full history via `ticker.history(period="max")`, idempotent via SHA256 of CSV bytes
- Landing: `data/landing/prices/coffee_kc/{hash8}.csv.gzip` (single file; hash changes when new trading days are appended)
- Foundation: `foundation.fct_coffee_prices` — casts, deduplicates via `HASH(Date, Close)`
- Grain: `trade_date`
- Serving: `serving.coffee_prices` — adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, ±2σ), RSI 14, 52-week high/low/range
---
## 3. Warehouse & Inventory Data
### 3.1 ICE Report Center — Warehouse Stocks
| Field | Value |
|-------|-------|
| URL | `https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv` (rolling) + `https://www.ice.com/marketdata/api/reports/293/results` (API, product_id=2) |
| Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) |
| Access Method | Public — no auth required. Static CSV for rolling data; private JSON API for historical report catalogue |
| Update Frequency | Daily (trading days) for stocks; monthly for aging report |
| History | Full archive available via report API (~2010 to present); static CSV is rolling |
| License / TOS | ICE — public market data |
| Priority | **Core** |
Three distinct datasets served by one extractor:
1. **Daily warehouse stocks** (`ice_stocks`) — total certified bags + pending grading. Key supply constraint indicator.
2. **Stocks by port** (`ice_stocks_by_port`) — breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis.
3. **Aging stocks** (`ice_aging`) — bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal.
The report API is undocumented but stable. Reports are discovered via `POST /api/reports/293/results` with `productId=2`, paginated. XLS/XLSX files are parsed with `xlrd`; the extractor handles both OLE2 `.xls` and modern `.xlsx` formats via magic-byte detection.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/ice_stocks/` — idempotent via SHA256 of content
- Landing:
- `data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip`
- `data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip`
- `data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip`
- Foundation: `fct_ice_warehouse_stocks`, `fct_ice_aging_stocks`, `fct_ice_warehouse_stocks_by_port`
- Serving: corresponding `serving.*` models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share
---
## 4. Fundamentals Data
### 4.1 USDA PSD Online — Production, Supply, and Distribution
| Field | Value |
|-------|-------|
| URL | `https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip` |
| Data Type | Monthly supply/demand balances by commodity × country × market year: production, imports, exports, consumption, ending stocks |
| Access Method | Public download — no auth |
| Update Frequency | Monthly (WASDE report release dates, ~11th of each month) |
| History | 2006-08 to present (archive); current year is always available |
| License / TOS | USDA FAS — US government open data |
| Priority | **Core** |
PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is `0721100` (green bean equivalent). Market year for coffee runs OctoberSeptember.
Key attributes tracked:
- `AREA_HARVESTED` (ha), `PRODUCTION` (1000 MT or 60kg bags), `DOMESTIC_CONSUMPTION`, `EXPORTS`, `ENDING_STOCKS`, `STOCKS_TO_USE_RATIO_`
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/psdonline/` — backfills from 2006-08, idempotent via etag
- Landing: `data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip`
- Staging: `staging.stg_psdalldata__commodity` — joins with seed tables for commodity/attribute/unit metadata; `cleaned.psdalldata__commodity_pivoted` — pivots attributes to wide format
- Seeds: `psd_commodity_codes.csv`, `psd_attribute_codes.csv`, `psd_unit_of_measure_codes.csv`
- Serving: `serving.commodity_metrics` — coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio
---
## 5. Weather & Climate Data
### 5.1 Open-Meteo — ERA5 Reanalysis + Forecast Blend
| Field | Value |
|-------|-------|
| URL | Archive: `https://archive-api.open-meteo.com/v1/archive` · Forecast: `https://api.open-meteo.com/v1/forecast` |
| Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ET₀, VPD |
| Access Method | Free API — no key, no registration |
| Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API |
| History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 |
| License / TOS | CC BY 4.0 — attribution required |
| Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. |
| Priority | **Core** |
Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates).
Variables fetched:
- `temperature_2m_max/min/mean` — frost detection (`<5°C`), heat stress (`>30°C`)
- `precipitation_sum` — drought and flood signals
- `wind_speed_10m_max` — wind damage proxy
- `relative_humidity_2m_max` — disease pressure (coffee leaf rust, CBD)
- `cloud_cover_mean` — solar radiation proxy
- `et0_fao_evapotranspiration` — crop water demand (Penman-Monteith)
- `vapour_pressure_deficit_max` — transpiration stress (`>1.5 kPa` = significant stress)
**12 locations** covering the world's primary Arabica and Robusta growing zones (BR ×3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See `extract/openmeteo/src/openmeteo/locations.py`.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/openmeteo/` — daily run uses forecast API (10-day window); backfill uses archive API (2020present)
- Idempotent: file-existence check per day per location before API call
- Landing: `data/landing/weather/{location_id}/{year}/{date}.json.gz` (one file per location per day)
- Foundation: `foundation.fct_weather_daily` — reads JSON glob, joins with `seeds.weather_locations`, derives boolean crop stress flags (`is_drought`, `is_heat_stress`, `is_high_vpd`, `is_frost`)
- Serving: `serving.weather_daily` — adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite `crop_stress_index` (0100)
---
## 6. Planned Sources
### 6.1 ICE Coffee C — Options Chain
| Field | Value |
|-------|-------|
| Data Type | Per-strike open interest, volume, implied volatility, greeks for KC=F options |
| Use Case | IV term structure, put/call skew, options positioning — leading indicator for futures moves |
| Priority | **High** |
**Key constraint:** KC=F options trade on ICE Futures U.S. (MIC: IFUS), not on OPRA. This eliminates every equity-centric data provider from consideration (yfinance, Intrinio, Polygon.io, ORATS, OpenBB's six providers — all source from OPRA and have zero KC=F options data).
**Option A — CFTC COT Combined report (free, immediate):**
Change the existing extractor URL from `fut_disagg_txt_{year}.zip` to `com_disagg_txt_{year}.zip`. The combined report adds delta-weighted options exposure by trader category alongside the futures positions. Same ZIP format, same CSV schema, same 2006-present history. Does not give per-strike breakdown or implied volatility — it gives the aggregate "what does managed money's total directional exposure look like including their option book." Highest ROI change in the codebase.
**Option B — ICE Report Center settlement data (free, requires login automation):**
`https://www.theice.com/reports/end-of-day` publishes daily options settlement prices by strike and expiry, including the Black-76 implied volatility ICE uses for settlement. Free with a WebICE account (free to register). The site uses cookie-based authentication after T&C acceptance — automatable via an authenticated HTTP session. Gives EOD per-strike IV + OI + settlement price. No bulk historical archive on the free tier; CSV subscriptions are paid.
**Option C — Barchart OnDemand `getFuturesOptionsEOD` (paid, best self-service):**
`https://www.barchart.com/ondemand` provides per-strike agricultural futures options data (IV, OI, volume) via `getFuturesOptions` and `getFuturesOptionsEOD` REST endpoints. Correct instrument coverage; REST-based; integrates well with the existing extraction pattern. Paid commercial subscription required — no self-service pricing published, contact sales.
**Option D — Interactive Brokers TWS API (free with account, live only):**
`ibapi` Python package returns live KC options chain (bid/ask, IV, delta, gamma, theta, vega) for an active IB futures-enabled account. No historical bulk download — live/recent snapshots only. Free if account is active.
| Source | Per-Strike | IV | History | Cost |
|--------|-----------|-----|---------|------|
| CFTC COT Combined | No (aggregate) | No | 2006-present | Free |
| ICE Report Center (EOD settlement) | Yes | Settlement IV | Recent (web) | Free w/ login |
| Barchart OnDemand EOD | Yes | Yes | Deep (unspecified) | Paid |
| ICE Data Services | Yes | Yes (full surface) | Full archive | Institutional |
| yfinance / OpenBB / Polygon.io | None | — | — | N/A |
---
### 6.2 CFTC COT — Options-and-Futures Combined
| Field | Value |
|-------|-------|
| URL | `https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip` |
| Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers |
| Access Method | Public — no auth |
| Priority | **Medium** |
Currently we ingest the futures-only (`fut_disagg`) report. The combined report (`com_disagg`) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one.
---
### 6.3 World Bank Commodity Prices (Pink Sheet)
| Field | Value |
|-------|-------|
| URL | `https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx` |
| Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) |
| Access Method | Public Excel download — no auth |
| Update Frequency | Monthly |
| History | 1960 to present |
| Priority | **Medium** |
The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse.
---
### 6.4 FAO Crop Calendar
| Field | Value |
|-------|-------|
| URL | `https://cropcalendar.apps.fao.org/` |
| Data Type | Coffee planting, flowering, and harvest windows by country |
| Access Method | Public — no auth (manual download or scrape) |
| Priority | **Medium** |
FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed.
---
## 7. Reference / Seed Data
All maintained as CSV files in `transform/sqlmesh_materia/seeds/`:
| File | Purpose |
|------|---------|
| `dim_commodity.csv` | Commodity master — code, name, exchange, unit |
| `psd_commodity_codes.csv` | USDA PSD commodity code lookup |
| `psd_attribute_codes.csv` | USDA PSD attribute code lookup (production, stocks, etc.) |
| `psd_unit_of_measure_codes.csv` | USDA PSD unit code lookup |
| `commodity_exchange_codes.csv` | Exchange code mapping |
| `psd_codes_exchange_codes_merge.csv` | Join table linking PSD codes to exchange codes |
| `weather_locations.csv` | Open-Meteo location metadata (id, name, country, lat, lon, variety) |