Documents all 7 ingested sources (CFTC COT, Yahoo Finance KC=F, ICE stocks×3, USDA PSD, Open-Meteo ERA5) plus planned sources (ICE options, COT combined, World Bank Pink Sheet, FAO crop calendar). Matches padelnomics inventory format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
16 KiB
BeanFlows — Data Sources Inventory
Compiled: 2026-02-26 Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline.
Pipeline Status Tracker
Status: ✅ Ingested — extractor + model live in master | 🔲 Planned — worth building | ⏸ On hold — blocked on cost/access | — Not targeted
Score (1–5): Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible.
| Source | Category | Status | Score | Credentials | Pipeline refs |
|---|---|---|---|---|---|
| CFTC COT Disaggregated Futures | Positioning | ✅ Ingested | 5 | None | extract_cot → fct_cot_positioning → serving.cot_positioning |
| Yahoo Finance — KC=F | Price | ✅ Ingested | 5 | None | extract_coffee_prices → fct_coffee_prices → serving.coffee_prices |
| ICE Report Center — warehouse stocks | Warehouse / Inventory | ✅ Ingested | 5 | None | extract_ice_stocks → fct_ice_warehouse_stocks → serving.ice_warehouse_stocks |
| ICE Report Center — stocks by port | Warehouse / Inventory | ✅ Ingested | 4 | None | extract_ice_stocks → fct_ice_warehouse_stocks_by_port → serving.ice_warehouse_stocks_by_port |
| ICE Report Center — aging stocks | Warehouse / Inventory | ✅ Ingested | 4 | None | extract_ice_stocks → fct_ice_aging_stocks → serving.ice_aging_stocks |
| USDA PSD Online | Fundamentals (supply/demand) | ✅ Ingested | 5 | None | extract_psd → stg_psdalldata__commodity → serving.commodity_metrics |
| Open-Meteo ERA5 — weather | Crop weather | ✅ Ingested | 5 | None | extract_openmeteo → fct_weather_daily → serving.weather_daily |
| ICE Coffee C — options chain | Derivatives / Volatility | 🔲 Planned | 4 | None (yfinance) or paid | TBD |
| CFTC COT — options-and-futures combined | Positioning | 🔲 Planned | 3 | None (same ZIP) | fct_cot_positioning variant |
| World Bank Commodity Prices (Pink Sheet) | Benchmark prices | 🔲 Planned | 3 | None | extract_wb_prices → fct_wb_prices |
| FAO Crop Calendar | Seasonality | 🔲 Planned | 3 | None (CSV) | Seed table |
| Freight / C4 route rates | Supply chain | 🔲 Planned | 2 | None (scrape) | fct_freight_rates |
| ICE Data Services — tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics |
| Refinitiv / LSEG | Price / Fundamentals | — | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources |
| Bloomberg Terminal | Price / News | — | 1 | Terminal license | Not cost-effective for current scope |
1. Positioning Data
1.1 CFTC COT Disaggregated Futures
| Field | Value |
|---|---|
| URL | https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip |
| Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) |
| Access Method | Public download — no auth, no API key |
| Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place |
| History | 2006-06-13 to present |
| License / TOS | US government data — public domain |
| Priority | Core |
The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~3–30 MB. The current-year file is overwritten each Friday; prior years are static.
Column quirk: Swap__Positions_Short_All and Swap__Positions_Spread_All use double underscores — this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB all_varchar = TRUE preserves exact header names; these columns must be quoted in SQL.
Pipeline implementation: ✅ Ingested
- Extractor:
extract/cftc_cot/— backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header) - Landing:
data/landing/cot/{year}/{etag}.csv.gzip - Foundation:
foundation.fct_cot_positioning— casts types, cleans names, computes net positions (long − short), deduplicates via HASH key - Grain:
(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date) - Serving:
serving.cot_positioning— adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI - Covers all commodity codes in the report — filtering to coffee (
073642) happens in the serving layer
Related: CFTC COT Options-and-Futures Combined
The same URL pattern with com_disagg_txt_{year}.zip gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5).
2. Price Data
2.1 Yahoo Finance — Coffee C (KC=F)
| Field | Value |
|---|---|
| URL | Yahoo Finance via yfinance Python library; ticker KC=F |
| Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures |
| Access Method | yfinance.Ticker("KC=F").history(period="max") — free, no auth |
| Update Frequency | Daily (post-settlement, typically ~30 min after session close) |
| History | 1971-08-16 to present |
| License / TOS | Yahoo Finance ToS — data for personal/non-commercial use; not for redistribution |
| Priority | Core |
Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (Adj Close, note space in header) accounts for contract rolls.
Column quirk: Adj Close has a space in the CSV header. DuckDB all_varchar = TRUE preserves this; must be quoted as "Adj Close" in SQL.
Pipeline implementation: ✅ Ingested
- Extractor:
extract/coffee_prices/— downloads full history viaticker.history(period="max"), idempotent via SHA256 of CSV bytes - Landing:
data/landing/prices/coffee_kc/{hash8}.csv.gzip(single file; hash changes when new trading days are appended) - Foundation:
foundation.fct_coffee_prices— casts, deduplicates viaHASH(Date, Close) - Grain:
trade_date - Serving:
serving.coffee_prices— adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, ±2σ), RSI 14, 52-week high/low/range
3. Warehouse & Inventory Data
3.1 ICE Report Center — Warehouse Stocks
| Field | Value |
|---|---|
| URL | https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv (rolling) + https://www.ice.com/marketdata/api/reports/293/results (API, product_id=2) |
| Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) |
| Access Method | Public — no auth required. Static CSV for rolling data; private JSON API for historical report catalogue |
| Update Frequency | Daily (trading days) for stocks; monthly for aging report |
| History | Full archive available via report API (~2010 to present); static CSV is rolling |
| License / TOS | ICE — public market data |
| Priority | Core |
Three distinct datasets served by one extractor:
- Daily warehouse stocks (
ice_stocks) — total certified bags + pending grading. Key supply constraint indicator. - Stocks by port (
ice_stocks_by_port) — breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis. - Aging stocks (
ice_aging) — bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal.
The report API is undocumented but stable. Reports are discovered via POST /api/reports/293/results with productId=2, paginated. XLS/XLSX files are parsed with xlrd; the extractor handles both OLE2 .xls and modern .xlsx formats via magic-byte detection.
Pipeline implementation: ✅ Ingested
- Extractor:
extract/ice_stocks/— idempotent via SHA256 of content - Landing:
data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzipdata/landing/ice_aging/{year}/{date}_{hash8}.csv.gzipdata/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip
- Foundation:
fct_ice_warehouse_stocks,fct_ice_aging_stocks,fct_ice_warehouse_stocks_by_port - Serving: corresponding
serving.*models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share
4. Fundamentals Data
4.1 USDA PSD Online — Production, Supply, and Distribution
| Field | Value |
|---|---|
| URL | https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip |
| Data Type | Monthly supply/demand balances by commodity × country × market year: production, imports, exports, consumption, ending stocks |
| Access Method | Public download — no auth |
| Update Frequency | Monthly (WASDE report release dates, ~11th of each month) |
| History | 2006-08 to present (archive); current year is always available |
| License / TOS | USDA FAS — US government open data |
| Priority | Core |
PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is 0721100 (green bean equivalent). Market year for coffee runs October–September.
Key attributes tracked:
AREA_HARVESTED(ha),PRODUCTION(1000 MT or 60kg bags),DOMESTIC_CONSUMPTION,EXPORTS,ENDING_STOCKS,STOCKS_TO_USE_RATIO_
Pipeline implementation: ✅ Ingested
- Extractor:
extract/psdonline/— backfills from 2006-08, idempotent via etag - Landing:
data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip - Staging:
staging.stg_psdalldata__commodity— joins with seed tables for commodity/attribute/unit metadata;cleaned.psdalldata__commodity_pivoted— pivots attributes to wide format - Seeds:
psd_commodity_codes.csv,psd_attribute_codes.csv,psd_unit_of_measure_codes.csv - Serving:
serving.commodity_metrics— coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio
5. Weather & Climate Data
5.1 Open-Meteo — ERA5 Reanalysis + Forecast Blend
| Field | Value |
|---|---|
| URL | Archive: https://archive-api.open-meteo.com/v1/archive · Forecast: https://api.open-meteo.com/v1/forecast |
| Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ET₀, VPD |
| Access Method | Free API — no key, no registration |
| Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API |
| History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 |
| License / TOS | CC BY 4.0 — attribution required |
| Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. |
| Priority | Core |
Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates).
Variables fetched:
temperature_2m_max/min/mean— frost detection (<5°C), heat stress (>30°C)precipitation_sum— drought and flood signalswind_speed_10m_max— wind damage proxyrelative_humidity_2m_max— disease pressure (coffee leaf rust, CBD)cloud_cover_mean— solar radiation proxyet0_fao_evapotranspiration— crop water demand (Penman-Monteith)vapour_pressure_deficit_max— transpiration stress (>1.5 kPa= significant stress)
12 locations covering the world's primary Arabica and Robusta growing zones (BR ×3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See extract/openmeteo/src/openmeteo/locations.py.
Pipeline implementation: ✅ Ingested
- Extractor:
extract/openmeteo/— daily run uses forecast API (10-day window); backfill uses archive API (2020–present) - Idempotent: file-existence check per day per location before API call
- Landing:
data/landing/weather/{location_id}/{year}/{date}.json.gz(one file per location per day) - Foundation:
foundation.fct_weather_daily— reads JSON glob, joins withseeds.weather_locations, derives boolean crop stress flags (is_drought,is_heat_stress,is_high_vpd,is_frost) - Serving:
serving.weather_daily— adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), compositecrop_stress_index(0–100)
6. Planned Sources
6.1 ICE Coffee C — Options Chain
| Field | Value |
|---|---|
| Data Type | Per-strike open interest, volume, implied volatility for KC=F options |
| Use Case | IV term structure, put/call skew, options positioning — leading indicator for futures moves |
| Access Options | yfinance (free, limited history); barchart OnDemand API (paid); ICE Data Services (enterprise) |
| Priority | High |
See research note in docs/ice-options-research.md (to be added when research completes).
6.2 CFTC COT — Options-and-Futures Combined
| Field | Value |
|---|---|
| URL | https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip |
| Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers |
| Access Method | Public — no auth |
| Priority | Medium |
Currently we ingest the futures-only (fut_disagg) report. The combined report (com_disagg) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one.
6.3 World Bank Commodity Prices (Pink Sheet)
| Field | Value |
|---|---|
| URL | https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx |
| Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) |
| Access Method | Public Excel download — no auth |
| Update Frequency | Monthly |
| History | 1960 to present |
| Priority | Medium |
The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse.
6.4 FAO Crop Calendar
| Field | Value |
|---|---|
| URL | https://cropcalendar.apps.fao.org/ |
| Data Type | Coffee planting, flowering, and harvest windows by country |
| Access Method | Public — no auth (manual download or scrape) |
| Priority | Medium |
FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed.
7. Reference / Seed Data
All maintained as CSV files in transform/sqlmesh_materia/seeds/:
| File | Purpose |
|---|---|
dim_commodity.csv |
Commodity master — code, name, exchange, unit |
psd_commodity_codes.csv |
USDA PSD commodity code lookup |
psd_attribute_codes.csv |
USDA PSD attribute code lookup (production, stocks, etc.) |
psd_unit_of_measure_codes.csv |
USDA PSD unit code lookup |
commodity_exchange_codes.csv |
Exchange code mapping |
psd_codes_exchange_codes_merge.csv |
Join table linking PSD codes to exchange codes |
weather_locations.csv |
Open-Meteo location metadata (id, name, country, lat, lon, variety) |