Files
beanflows/docs/data-sources-inventory.md
Deeman b25b8780a7 docs: update inventory with ICE options research findings
- yfinance confirmed not viable (OPRA only, KC=F not covered)
- CFTC COT combined report is the free immediate path (URL change only)
- ICE Report Center settlement data viable with WebICE login automation
- Barchart OnDemand has correct coverage but requires paid subscription
- All OpenBB providers, Polygon.io, Nasdaq Data Link confirmed no KC=F coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 10:16:50 +01:00

18 KiB
Raw Permalink Blame History

BeanFlows — Data Sources Inventory

Compiled: 2026-02-26 Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline.


Pipeline Status Tracker

Status: Ingested — extractor + model live in master | 🔲 Planned — worth building | ⏸ On hold — blocked on cost/access | — Not targeted

Score (15): Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible.

Source Category Status Score Credentials Pipeline refs
CFTC COT Disaggregated Futures Positioning Ingested 5 None extract_cotfct_cot_positioningserving.cot_positioning
Yahoo Finance — KC=F Price Ingested 5 None extract_coffee_pricesfct_coffee_pricesserving.coffee_prices
ICE Report Center — warehouse stocks Warehouse / Inventory Ingested 5 None extract_ice_stocksfct_ice_warehouse_stocksserving.ice_warehouse_stocks
ICE Report Center — stocks by port Warehouse / Inventory Ingested 4 None extract_ice_stocksfct_ice_warehouse_stocks_by_portserving.ice_warehouse_stocks_by_port
ICE Report Center — aging stocks Warehouse / Inventory Ingested 4 None extract_ice_stocksfct_ice_aging_stocksserving.ice_aging_stocks
USDA PSD Online Fundamentals (supply/demand) Ingested 5 None extract_psdstg_psdalldata__commodityserving.commodity_metrics
Open-Meteo ERA5 — weather Crop weather Ingested 5 None extract_openmeteofct_weather_dailyserving.weather_daily
CFTC COT Combined (futures+options) Positioning 🔲 Planned 4 None (same ZIP pattern) Extend extract_cotfct_cot_combined variant
ICE Report Center — options settlement Derivatives / Volatility 🔲 Planned 3 WebICE account (free) extract_ice_optionsfct_ice_options_settlement
Barchart OnDemand — options EOD Derivatives / Volatility ⏸ On hold 3 Paid subscription (contact sales) extract_barchart_optionsfct_options_chain
World Bank Commodity Prices (Pink Sheet) Benchmark prices 🔲 Planned 3 None extract_wb_pricesfct_wb_prices
FAO Crop Calendar Seasonality 🔲 Planned 3 None (CSV) Seed table
Freight / C4 route rates Supply chain 🔲 Planned 2 None (scrape) fct_freight_rates
ICE Data Services — tick data Price (granular) ⏸ On hold 2 Paid subscription Commercial; not needed for daily analytics
Refinitiv / LSEG Price / Fundamentals 1 Enterprise subscription Superseded by free ICE + CFTC + USDA sources
Bloomberg Terminal Price / News 1 Terminal license Not cost-effective for current scope

1. Positioning Data

1.1 CFTC COT Disaggregated Futures

Field Value
URL https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip
Data Type Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable)
Access Method Public download — no auth, no API key
Update Frequency Weekly (Friday 3:30 PM ET); current-year file updated in-place
History 2006-06-13 to present
License / TOS US government data — public domain
Priority Core

The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~330 MB. The current-year file is overwritten each Friday; prior years are static.

Column quirk: Swap__Positions_Short_All and Swap__Positions_Spread_All use double underscores — this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB all_varchar = TRUE preserves exact header names; these columns must be quoted in SQL.

Pipeline implementation: Ingested

  • Extractor: extract/cftc_cot/ — backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header)
  • Landing: data/landing/cot/{year}/{etag}.csv.gzip
  • Foundation: foundation.fct_cot_positioning — casts types, cleans names, computes net positions (long short), deduplicates via HASH key
  • Grain: (cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)
  • Serving: serving.cot_positioning — adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI
  • Covers all commodity codes in the report — filtering to coffee (073642) happens in the serving layer

Related: CFTC COT Options-and-Futures Combined

The same URL pattern with com_disagg_txt_{year}.zip gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5).


2. Price Data

2.1 Yahoo Finance — Coffee C (KC=F)

Field Value
URL Yahoo Finance via yfinance Python library; ticker KC=F
Data Type Daily OHLCV + adjusted close for Coffee C continuous front-month futures
Access Method yfinance.Ticker("KC=F").history(period="max") — free, no auth
Update Frequency Daily (post-settlement, typically ~30 min after session close)
History 1971-08-16 to present
License / TOS Yahoo Finance ToS — data for personal/non-commercial use; not for redistribution
Priority Core

Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (Adj Close, note space in header) accounts for contract rolls.

Column quirk: Adj Close has a space in the CSV header. DuckDB all_varchar = TRUE preserves this; must be quoted as "Adj Close" in SQL.

Pipeline implementation: Ingested

  • Extractor: extract/coffee_prices/ — downloads full history via ticker.history(period="max"), idempotent via SHA256 of CSV bytes
  • Landing: data/landing/prices/coffee_kc/{hash8}.csv.gzip (single file; hash changes when new trading days are appended)
  • Foundation: foundation.fct_coffee_prices — casts, deduplicates via HASH(Date, Close)
  • Grain: trade_date
  • Serving: serving.coffee_prices — adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, ±2σ), RSI 14, 52-week high/low/range

3. Warehouse & Inventory Data

3.1 ICE Report Center — Warehouse Stocks

Field Value
URL https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv (rolling) + https://www.ice.com/marketdata/api/reports/293/results (API, product_id=2)
Data Type Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket)
Access Method Public — no auth required. Static CSV for rolling data; private JSON API for historical report catalogue
Update Frequency Daily (trading days) for stocks; monthly for aging report
History Full archive available via report API (~2010 to present); static CSV is rolling
License / TOS ICE — public market data
Priority Core

Three distinct datasets served by one extractor:

  1. Daily warehouse stocks (ice_stocks) — total certified bags + pending grading. Key supply constraint indicator.
  2. Stocks by port (ice_stocks_by_port) — breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis.
  3. Aging stocks (ice_aging) — bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal.

The report API is undocumented but stable. Reports are discovered via POST /api/reports/293/results with productId=2, paginated. XLS/XLSX files are parsed with xlrd; the extractor handles both OLE2 .xls and modern .xlsx formats via magic-byte detection.

Pipeline implementation: Ingested

  • Extractor: extract/ice_stocks/ — idempotent via SHA256 of content
  • Landing:
    • data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip
    • data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip
    • data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip
  • Foundation: fct_ice_warehouse_stocks, fct_ice_aging_stocks, fct_ice_warehouse_stocks_by_port
  • Serving: corresponding serving.* models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share

4. Fundamentals Data

4.1 USDA PSD Online — Production, Supply, and Distribution

Field Value
URL https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip
Data Type Monthly supply/demand balances by commodity × country × market year: production, imports, exports, consumption, ending stocks
Access Method Public download — no auth
Update Frequency Monthly (WASDE report release dates, ~11th of each month)
History 2006-08 to present (archive); current year is always available
License / TOS USDA FAS — US government open data
Priority Core

PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is 0721100 (green bean equivalent). Market year for coffee runs OctoberSeptember.

Key attributes tracked:

  • AREA_HARVESTED (ha), PRODUCTION (1000 MT or 60kg bags), DOMESTIC_CONSUMPTION, EXPORTS, ENDING_STOCKS, STOCKS_TO_USE_RATIO_

Pipeline implementation: Ingested

  • Extractor: extract/psdonline/ — backfills from 2006-08, idempotent via etag
  • Landing: data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip
  • Staging: staging.stg_psdalldata__commodity — joins with seed tables for commodity/attribute/unit metadata; cleaned.psdalldata__commodity_pivoted — pivots attributes to wide format
  • Seeds: psd_commodity_codes.csv, psd_attribute_codes.csv, psd_unit_of_measure_codes.csv
  • Serving: serving.commodity_metrics — coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio

5. Weather & Climate Data

5.1 Open-Meteo — ERA5 Reanalysis + Forecast Blend

Field Value
URL Archive: https://archive-api.open-meteo.com/v1/archive · Forecast: https://api.open-meteo.com/v1/forecast
Data Type Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ET₀, VPD
Access Method Free API — no key, no registration
Update Frequency Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API
History ERA5 archive from 1940; pipeline backfilled from 2020-01-01
License / TOS CC BY 4.0 — attribution required
Rate Limiting No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs.
Priority Core

Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates).

Variables fetched:

  • temperature_2m_max/min/mean — frost detection (<5°C), heat stress (>30°C)
  • precipitation_sum — drought and flood signals
  • wind_speed_10m_max — wind damage proxy
  • relative_humidity_2m_max — disease pressure (coffee leaf rust, CBD)
  • cloud_cover_mean — solar radiation proxy
  • et0_fao_evapotranspiration — crop water demand (Penman-Monteith)
  • vapour_pressure_deficit_max — transpiration stress (>1.5 kPa = significant stress)

12 locations covering the world's primary Arabica and Robusta growing zones (BR ×3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See extract/openmeteo/src/openmeteo/locations.py.

Pipeline implementation: Ingested

  • Extractor: extract/openmeteo/ — daily run uses forecast API (10-day window); backfill uses archive API (2020present)
  • Idempotent: file-existence check per day per location before API call
  • Landing: data/landing/weather/{location_id}/{year}/{date}.json.gz (one file per location per day)
  • Foundation: foundation.fct_weather_daily — reads JSON glob, joins with seeds.weather_locations, derives boolean crop stress flags (is_drought, is_heat_stress, is_high_vpd, is_frost)
  • Serving: serving.weather_daily — adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite crop_stress_index (0100)

6. Planned Sources

6.1 ICE Coffee C — Options Chain

Field Value
Data Type Per-strike open interest, volume, implied volatility, greeks for KC=F options
Use Case IV term structure, put/call skew, options positioning — leading indicator for futures moves
Priority High

Key constraint: KC=F options trade on ICE Futures U.S. (MIC: IFUS), not on OPRA. This eliminates every equity-centric data provider from consideration (yfinance, Intrinio, Polygon.io, ORATS, OpenBB's six providers — all source from OPRA and have zero KC=F options data).

Option A — CFTC COT Combined report (free, immediate): Change the existing extractor URL from fut_disagg_txt_{year}.zip to com_disagg_txt_{year}.zip. The combined report adds delta-weighted options exposure by trader category alongside the futures positions. Same ZIP format, same CSV schema, same 2006-present history. Does not give per-strike breakdown or implied volatility — it gives the aggregate "what does managed money's total directional exposure look like including their option book." Highest ROI change in the codebase.

Option B — ICE Report Center settlement data (free, requires login automation): https://www.theice.com/reports/end-of-day publishes daily options settlement prices by strike and expiry, including the Black-76 implied volatility ICE uses for settlement. Free with a WebICE account (free to register). The site uses cookie-based authentication after T&C acceptance — automatable via an authenticated HTTP session. Gives EOD per-strike IV + OI + settlement price. No bulk historical archive on the free tier; CSV subscriptions are paid.

Option C — Barchart OnDemand getFuturesOptionsEOD (paid, best self-service): https://www.barchart.com/ondemand provides per-strike agricultural futures options data (IV, OI, volume) via getFuturesOptions and getFuturesOptionsEOD REST endpoints. Correct instrument coverage; REST-based; integrates well with the existing extraction pattern. Paid commercial subscription required — no self-service pricing published, contact sales.

Option D — Interactive Brokers TWS API (free with account, live only): ibapi Python package returns live KC options chain (bid/ask, IV, delta, gamma, theta, vega) for an active IB futures-enabled account. No historical bulk download — live/recent snapshots only. Free if account is active.

Source Per-Strike IV History Cost
CFTC COT Combined No (aggregate) No 2006-present Free
ICE Report Center (EOD settlement) Yes Settlement IV Recent (web) Free w/ login
Barchart OnDemand EOD Yes Yes Deep (unspecified) Paid
ICE Data Services Yes Yes (full surface) Full archive Institutional
yfinance / OpenBB / Polygon.io None N/A

6.2 CFTC COT — Options-and-Futures Combined

Field Value
URL https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip
Data Type Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers
Access Method Public — no auth
Priority Medium

Currently we ingest the futures-only (fut_disagg) report. The combined report (com_disagg) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one.


6.3 World Bank Commodity Prices (Pink Sheet)

Field Value
URL https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx
Data Type Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London)
Access Method Public Excel download — no auth
Update Frequency Monthly
History 1960 to present
Priority Medium

The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse.


6.4 FAO Crop Calendar

Field Value
URL https://cropcalendar.apps.fao.org/
Data Type Coffee planting, flowering, and harvest windows by country
Access Method Public — no auth (manual download or scrape)
Priority Medium

FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed.


7. Reference / Seed Data

All maintained as CSV files in transform/sqlmesh_materia/seeds/:

File Purpose
dim_commodity.csv Commodity master — code, name, exchange, unit
psd_commodity_codes.csv USDA PSD commodity code lookup
psd_attribute_codes.csv USDA PSD attribute code lookup (production, stocks, etc.)
psd_unit_of_measure_codes.csv USDA PSD unit code lookup
commodity_exchange_codes.csv Exchange code mapping
psd_codes_exchange_codes_merge.csv Join table linking PSD codes to exchange codes
weather_locations.csv Open-Meteo location metadata (id, name, country, lat, lon, variety)