# BeanFlows β€” Data Sources Inventory Compiled: 2026-02-26 Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline. --- ## Pipeline Status Tracker **Status:** βœ… Ingested β€” extractor + model live in `master` | πŸ”² Planned β€” worth building | ⏸ On hold β€” blocked on cost/access | β€” Not targeted **Score (1–5):** Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible. | Source | Category | Status | Score | Credentials | Pipeline refs | |--------|----------|--------|-------|-------------|---------------| | CFTC COT Disaggregated Futures | Positioning | βœ… Ingested | 5 | None | `extract_cot` β†’ `fct_cot_positioning` β†’ `serving.cot_positioning` | | Yahoo Finance β€” KC=F | Price | βœ… Ingested | 5 | None | `extract_coffee_prices` β†’ `fct_coffee_prices` β†’ `serving.coffee_prices` | | ICE Report Center β€” warehouse stocks | Warehouse / Inventory | βœ… Ingested | 5 | None | `extract_ice_stocks` β†’ `fct_ice_warehouse_stocks` β†’ `serving.ice_warehouse_stocks` | | ICE Report Center β€” stocks by port | Warehouse / Inventory | βœ… Ingested | 4 | None | `extract_ice_stocks` β†’ `fct_ice_warehouse_stocks_by_port` β†’ `serving.ice_warehouse_stocks_by_port` | | ICE Report Center β€” aging stocks | Warehouse / Inventory | βœ… Ingested | 4 | None | `extract_ice_stocks` β†’ `fct_ice_aging_stocks` β†’ `serving.ice_aging_stocks` | | USDA PSD Online | Fundamentals (supply/demand) | βœ… Ingested | 5 | None | `extract_psd` β†’ `stg_psdalldata__commodity` β†’ `serving.commodity_metrics` | | Open-Meteo ERA5 β€” weather | Crop weather | βœ… Ingested | 5 | None | `extract_openmeteo` β†’ `fct_weather_daily` β†’ `serving.weather_daily` | | ICE Coffee C β€” options chain | Derivatives / Volatility | πŸ”² Planned | 4 | None (yfinance) or paid | TBD | | CFTC COT β€” options-and-futures combined | Positioning | πŸ”² Planned | 3 | None (same ZIP) | `fct_cot_positioning` variant | | World Bank Commodity Prices (Pink Sheet) | Benchmark prices | πŸ”² Planned | 3 | None | `extract_wb_prices` β†’ `fct_wb_prices` | | FAO Crop Calendar | Seasonality | πŸ”² Planned | 3 | None (CSV) | Seed table | | Freight / C4 route rates | Supply chain | πŸ”² Planned | 2 | None (scrape) | `fct_freight_rates` | | ICE Data Services β€” tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics | | Refinitiv / LSEG | Price / Fundamentals | β€” | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources | | Bloomberg Terminal | Price / News | β€” | 1 | Terminal license | Not cost-effective for current scope | --- ## 1. Positioning Data ### 1.1 CFTC COT Disaggregated Futures | Field | Value | |-------|-------| | URL | `https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip` | | Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) | | Access Method | Public download β€” no auth, no API key | | Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place | | History | 2006-06-13 to present | | License / TOS | US government data β€” public domain | | Priority | **Core** | The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~3–30 MB. The current-year file is overwritten each Friday; prior years are static. Column quirk: `Swap__Positions_Short_All` and `Swap__Positions_Spread_All` use double underscores β€” this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB `all_varchar = TRUE` preserves exact header names; these columns must be quoted in SQL. **Pipeline implementation:** βœ… Ingested - Extractor: `extract/cftc_cot/` β€” backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header) - Landing: `data/landing/cot/{year}/{etag}.csv.gzip` - Foundation: `foundation.fct_cot_positioning` β€” casts types, cleans names, computes net positions (long βˆ’ short), deduplicates via HASH key - Grain: `(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)` - Serving: `serving.cot_positioning` β€” adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI - Covers all commodity codes in the report β€” filtering to coffee (`073642`) happens in the serving layer **Related: CFTC COT Options-and-Futures Combined** The same URL pattern with `com_disagg_txt_{year}.zip` gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5). --- ## 2. Price Data ### 2.1 Yahoo Finance β€” Coffee C (KC=F) | Field | Value | |-------|-------| | URL | Yahoo Finance via `yfinance` Python library; ticker `KC=F` | | Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures | | Access Method | `yfinance.Ticker("KC=F").history(period="max")` β€” free, no auth | | Update Frequency | Daily (post-settlement, typically ~30 min after session close) | | History | 1971-08-16 to present | | License / TOS | Yahoo Finance ToS β€” data for personal/non-commercial use; not for redistribution | | Priority | **Core** | Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (`Adj Close`, note space in header) accounts for contract rolls. Column quirk: `Adj Close` has a space in the CSV header. DuckDB `all_varchar = TRUE` preserves this; must be quoted as `"Adj Close"` in SQL. **Pipeline implementation:** βœ… Ingested - Extractor: `extract/coffee_prices/` β€” downloads full history via `ticker.history(period="max")`, idempotent via SHA256 of CSV bytes - Landing: `data/landing/prices/coffee_kc/{hash8}.csv.gzip` (single file; hash changes when new trading days are appended) - Foundation: `foundation.fct_coffee_prices` β€” casts, deduplicates via `HASH(Date, Close)` - Grain: `trade_date` - Serving: `serving.coffee_prices` β€” adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, Β±2Οƒ), RSI 14, 52-week high/low/range --- ## 3. Warehouse & Inventory Data ### 3.1 ICE Report Center β€” Warehouse Stocks | Field | Value | |-------|-------| | URL | `https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv` (rolling) + `https://www.ice.com/marketdata/api/reports/293/results` (API, product_id=2) | | Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) | | Access Method | Public β€” no auth required. Static CSV for rolling data; private JSON API for historical report catalogue | | Update Frequency | Daily (trading days) for stocks; monthly for aging report | | History | Full archive available via report API (~2010 to present); static CSV is rolling | | License / TOS | ICE β€” public market data | | Priority | **Core** | Three distinct datasets served by one extractor: 1. **Daily warehouse stocks** (`ice_stocks`) β€” total certified bags + pending grading. Key supply constraint indicator. 2. **Stocks by port** (`ice_stocks_by_port`) β€” breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis. 3. **Aging stocks** (`ice_aging`) β€” bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal. The report API is undocumented but stable. Reports are discovered via `POST /api/reports/293/results` with `productId=2`, paginated. XLS/XLSX files are parsed with `xlrd`; the extractor handles both OLE2 `.xls` and modern `.xlsx` formats via magic-byte detection. **Pipeline implementation:** βœ… Ingested - Extractor: `extract/ice_stocks/` β€” idempotent via SHA256 of content - Landing: - `data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip` - `data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip` - `data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip` - Foundation: `fct_ice_warehouse_stocks`, `fct_ice_aging_stocks`, `fct_ice_warehouse_stocks_by_port` - Serving: corresponding `serving.*` models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share --- ## 4. Fundamentals Data ### 4.1 USDA PSD Online β€” Production, Supply, and Distribution | Field | Value | |-------|-------| | URL | `https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip` | | Data Type | Monthly supply/demand balances by commodity Γ— country Γ— market year: production, imports, exports, consumption, ending stocks | | Access Method | Public download β€” no auth | | Update Frequency | Monthly (WASDE report release dates, ~11th of each month) | | History | 2006-08 to present (archive); current year is always available | | License / TOS | USDA FAS β€” US government open data | | Priority | **Core** | PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is `0721100` (green bean equivalent). Market year for coffee runs October–September. Key attributes tracked: - `AREA_HARVESTED` (ha), `PRODUCTION` (1000 MT or 60kg bags), `DOMESTIC_CONSUMPTION`, `EXPORTS`, `ENDING_STOCKS`, `STOCKS_TO_USE_RATIO_` **Pipeline implementation:** βœ… Ingested - Extractor: `extract/psdonline/` β€” backfills from 2006-08, idempotent via etag - Landing: `data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip` - Staging: `staging.stg_psdalldata__commodity` β€” joins with seed tables for commodity/attribute/unit metadata; `cleaned.psdalldata__commodity_pivoted` β€” pivots attributes to wide format - Seeds: `psd_commodity_codes.csv`, `psd_attribute_codes.csv`, `psd_unit_of_measure_codes.csv` - Serving: `serving.commodity_metrics` β€” coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio --- ## 5. Weather & Climate Data ### 5.1 Open-Meteo β€” ERA5 Reanalysis + Forecast Blend | Field | Value | |-------|-------| | URL | Archive: `https://archive-api.open-meteo.com/v1/archive` Β· Forecast: `https://api.open-meteo.com/v1/forecast` | | Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ETβ‚€, VPD | | Access Method | Free API β€” no key, no registration | | Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API | | History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 | | License / TOS | CC BY 4.0 β€” attribution required | | Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. | | Priority | **Core** | Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates). Variables fetched: - `temperature_2m_max/min/mean` β€” frost detection (`<5Β°C`), heat stress (`>30Β°C`) - `precipitation_sum` β€” drought and flood signals - `wind_speed_10m_max` β€” wind damage proxy - `relative_humidity_2m_max` β€” disease pressure (coffee leaf rust, CBD) - `cloud_cover_mean` β€” solar radiation proxy - `et0_fao_evapotranspiration` β€” crop water demand (Penman-Monteith) - `vapour_pressure_deficit_max` β€” transpiration stress (`>1.5 kPa` = significant stress) **12 locations** covering the world's primary Arabica and Robusta growing zones (BR Γ—3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See `extract/openmeteo/src/openmeteo/locations.py`. **Pipeline implementation:** βœ… Ingested - Extractor: `extract/openmeteo/` β€” daily run uses forecast API (10-day window); backfill uses archive API (2020–present) - Idempotent: file-existence check per day per location before API call - Landing: `data/landing/weather/{location_id}/{year}/{date}.json.gz` (one file per location per day) - Foundation: `foundation.fct_weather_daily` β€” reads JSON glob, joins with `seeds.weather_locations`, derives boolean crop stress flags (`is_drought`, `is_heat_stress`, `is_high_vpd`, `is_frost`) - Serving: `serving.weather_daily` β€” adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite `crop_stress_index` (0–100) --- ## 6. Planned Sources ### 6.1 ICE Coffee C β€” Options Chain | Field | Value | |-------|-------| | Data Type | Per-strike open interest, volume, implied volatility for KC=F options | | Use Case | IV term structure, put/call skew, options positioning β€” leading indicator for futures moves | | Access Options | `yfinance` (free, limited history); barchart OnDemand API (paid); ICE Data Services (enterprise) | | Priority | **High** | See research note in `docs/ice-options-research.md` (to be added when research completes). --- ### 6.2 CFTC COT β€” Options-and-Futures Combined | Field | Value | |-------|-------| | URL | `https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip` | | Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers | | Access Method | Public β€” no auth | | Priority | **Medium** | Currently we ingest the futures-only (`fut_disagg`) report. The combined report (`com_disagg`) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one. --- ### 6.3 World Bank Commodity Prices (Pink Sheet) | Field | Value | |-------|-------| | URL | `https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx` | | Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) | | Access Method | Public Excel download β€” no auth | | Update Frequency | Monthly | | History | 1960 to present | | Priority | **Medium** | The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse. --- ### 6.4 FAO Crop Calendar | Field | Value | |-------|-------| | URL | `https://cropcalendar.apps.fao.org/` | | Data Type | Coffee planting, flowering, and harvest windows by country | | Access Method | Public β€” no auth (manual download or scrape) | | Priority | **Medium** | FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed. --- ## 7. Reference / Seed Data All maintained as CSV files in `transform/sqlmesh_materia/seeds/`: | File | Purpose | |------|---------| | `dim_commodity.csv` | Commodity master β€” code, name, exchange, unit | | `psd_commodity_codes.csv` | USDA PSD commodity code lookup | | `psd_attribute_codes.csv` | USDA PSD attribute code lookup (production, stocks, etc.) | | `psd_unit_of_measure_codes.csv` | USDA PSD unit code lookup | | `commodity_exchange_codes.csv` | Exchange code mapping | | `psd_codes_exchange_codes_merge.csv` | Join table linking PSD codes to exchange codes | | `weather_locations.csv` | Open-Meteo location metadata (id, name, country, lat, lon, variety) |