Documents all 7 ingested sources (CFTC COT, Yahoo Finance KC=F, ICE stocks×3, USDA PSD, Open-Meteo ERA5) plus planned sources (ICE options, COT combined, World Bank Pink Sheet, FAO crop calendar). Matches padelnomics inventory format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
260 lines
16 KiB
Markdown
260 lines
16 KiB
Markdown
# BeanFlows — Data Sources Inventory
|
||
|
||
Compiled: 2026-02-26
|
||
Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline.
|
||
|
||
---
|
||
|
||
## Pipeline Status Tracker
|
||
|
||
**Status:** ✅ Ingested — extractor + model live in `master` | 🔲 Planned — worth building | ⏸ On hold — blocked on cost/access | — Not targeted
|
||
|
||
**Score (1–5):** Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible.
|
||
|
||
| Source | Category | Status | Score | Credentials | Pipeline refs |
|
||
|--------|----------|--------|-------|-------------|---------------|
|
||
| CFTC COT Disaggregated Futures | Positioning | ✅ Ingested | 5 | None | `extract_cot` → `fct_cot_positioning` → `serving.cot_positioning` |
|
||
| Yahoo Finance — KC=F | Price | ✅ Ingested | 5 | None | `extract_coffee_prices` → `fct_coffee_prices` → `serving.coffee_prices` |
|
||
| ICE Report Center — warehouse stocks | Warehouse / Inventory | ✅ Ingested | 5 | None | `extract_ice_stocks` → `fct_ice_warehouse_stocks` → `serving.ice_warehouse_stocks` |
|
||
| ICE Report Center — stocks by port | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks` → `fct_ice_warehouse_stocks_by_port` → `serving.ice_warehouse_stocks_by_port` |
|
||
| ICE Report Center — aging stocks | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks` → `fct_ice_aging_stocks` → `serving.ice_aging_stocks` |
|
||
| USDA PSD Online | Fundamentals (supply/demand) | ✅ Ingested | 5 | None | `extract_psd` → `stg_psdalldata__commodity` → `serving.commodity_metrics` |
|
||
| Open-Meteo ERA5 — weather | Crop weather | ✅ Ingested | 5 | None | `extract_openmeteo` → `fct_weather_daily` → `serving.weather_daily` |
|
||
| ICE Coffee C — options chain | Derivatives / Volatility | 🔲 Planned | 4 | None (yfinance) or paid | TBD |
|
||
| CFTC COT — options-and-futures combined | Positioning | 🔲 Planned | 3 | None (same ZIP) | `fct_cot_positioning` variant |
|
||
| World Bank Commodity Prices (Pink Sheet) | Benchmark prices | 🔲 Planned | 3 | None | `extract_wb_prices` → `fct_wb_prices` |
|
||
| FAO Crop Calendar | Seasonality | 🔲 Planned | 3 | None (CSV) | Seed table |
|
||
| Freight / C4 route rates | Supply chain | 🔲 Planned | 2 | None (scrape) | `fct_freight_rates` |
|
||
| ICE Data Services — tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics |
|
||
| Refinitiv / LSEG | Price / Fundamentals | — | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources |
|
||
| Bloomberg Terminal | Price / News | — | 1 | Terminal license | Not cost-effective for current scope |
|
||
|
||
---
|
||
|
||
## 1. Positioning Data
|
||
|
||
### 1.1 CFTC COT Disaggregated Futures
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip` |
|
||
| Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) |
|
||
| Access Method | Public download — no auth, no API key |
|
||
| Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place |
|
||
| History | 2006-06-13 to present |
|
||
| License / TOS | US government data — public domain |
|
||
| Priority | **Core** |
|
||
|
||
The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~3–30 MB. The current-year file is overwritten each Friday; prior years are static.
|
||
|
||
Column quirk: `Swap__Positions_Short_All` and `Swap__Positions_Spread_All` use double underscores — this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB `all_varchar = TRUE` preserves exact header names; these columns must be quoted in SQL.
|
||
|
||
**Pipeline implementation:** ✅ Ingested
|
||
- Extractor: `extract/cftc_cot/` — backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header)
|
||
- Landing: `data/landing/cot/{year}/{etag}.csv.gzip`
|
||
- Foundation: `foundation.fct_cot_positioning` — casts types, cleans names, computes net positions (long − short), deduplicates via HASH key
|
||
- Grain: `(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)`
|
||
- Serving: `serving.cot_positioning` — adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI
|
||
- Covers all commodity codes in the report — filtering to coffee (`073642`) happens in the serving layer
|
||
|
||
**Related: CFTC COT Options-and-Futures Combined**
|
||
|
||
The same URL pattern with `com_disagg_txt_{year}.zip` gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5).
|
||
|
||
---
|
||
|
||
## 2. Price Data
|
||
|
||
### 2.1 Yahoo Finance — Coffee C (KC=F)
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | Yahoo Finance via `yfinance` Python library; ticker `KC=F` |
|
||
| Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures |
|
||
| Access Method | `yfinance.Ticker("KC=F").history(period="max")` — free, no auth |
|
||
| Update Frequency | Daily (post-settlement, typically ~30 min after session close) |
|
||
| History | 1971-08-16 to present |
|
||
| License / TOS | Yahoo Finance ToS — data for personal/non-commercial use; not for redistribution |
|
||
| Priority | **Core** |
|
||
|
||
Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (`Adj Close`, note space in header) accounts for contract rolls.
|
||
|
||
Column quirk: `Adj Close` has a space in the CSV header. DuckDB `all_varchar = TRUE` preserves this; must be quoted as `"Adj Close"` in SQL.
|
||
|
||
**Pipeline implementation:** ✅ Ingested
|
||
- Extractor: `extract/coffee_prices/` — downloads full history via `ticker.history(period="max")`, idempotent via SHA256 of CSV bytes
|
||
- Landing: `data/landing/prices/coffee_kc/{hash8}.csv.gzip` (single file; hash changes when new trading days are appended)
|
||
- Foundation: `foundation.fct_coffee_prices` — casts, deduplicates via `HASH(Date, Close)`
|
||
- Grain: `trade_date`
|
||
- Serving: `serving.coffee_prices` — adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, ±2σ), RSI 14, 52-week high/low/range
|
||
|
||
---
|
||
|
||
## 3. Warehouse & Inventory Data
|
||
|
||
### 3.1 ICE Report Center — Warehouse Stocks
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv` (rolling) + `https://www.ice.com/marketdata/api/reports/293/results` (API, product_id=2) |
|
||
| Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) |
|
||
| Access Method | Public — no auth required. Static CSV for rolling data; private JSON API for historical report catalogue |
|
||
| Update Frequency | Daily (trading days) for stocks; monthly for aging report |
|
||
| History | Full archive available via report API (~2010 to present); static CSV is rolling |
|
||
| License / TOS | ICE — public market data |
|
||
| Priority | **Core** |
|
||
|
||
Three distinct datasets served by one extractor:
|
||
|
||
1. **Daily warehouse stocks** (`ice_stocks`) — total certified bags + pending grading. Key supply constraint indicator.
|
||
2. **Stocks by port** (`ice_stocks_by_port`) — breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis.
|
||
3. **Aging stocks** (`ice_aging`) — bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal.
|
||
|
||
The report API is undocumented but stable. Reports are discovered via `POST /api/reports/293/results` with `productId=2`, paginated. XLS/XLSX files are parsed with `xlrd`; the extractor handles both OLE2 `.xls` and modern `.xlsx` formats via magic-byte detection.
|
||
|
||
**Pipeline implementation:** ✅ Ingested
|
||
- Extractor: `extract/ice_stocks/` — idempotent via SHA256 of content
|
||
- Landing:
|
||
- `data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip`
|
||
- `data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip`
|
||
- `data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip`
|
||
- Foundation: `fct_ice_warehouse_stocks`, `fct_ice_aging_stocks`, `fct_ice_warehouse_stocks_by_port`
|
||
- Serving: corresponding `serving.*` models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share
|
||
|
||
---
|
||
|
||
## 4. Fundamentals Data
|
||
|
||
### 4.1 USDA PSD Online — Production, Supply, and Distribution
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip` |
|
||
| Data Type | Monthly supply/demand balances by commodity × country × market year: production, imports, exports, consumption, ending stocks |
|
||
| Access Method | Public download — no auth |
|
||
| Update Frequency | Monthly (WASDE report release dates, ~11th of each month) |
|
||
| History | 2006-08 to present (archive); current year is always available |
|
||
| License / TOS | USDA FAS — US government open data |
|
||
| Priority | **Core** |
|
||
|
||
PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is `0721100` (green bean equivalent). Market year for coffee runs October–September.
|
||
|
||
Key attributes tracked:
|
||
- `AREA_HARVESTED` (ha), `PRODUCTION` (1000 MT or 60kg bags), `DOMESTIC_CONSUMPTION`, `EXPORTS`, `ENDING_STOCKS`, `STOCKS_TO_USE_RATIO_`
|
||
|
||
**Pipeline implementation:** ✅ Ingested
|
||
- Extractor: `extract/psdonline/` — backfills from 2006-08, idempotent via etag
|
||
- Landing: `data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip`
|
||
- Staging: `staging.stg_psdalldata__commodity` — joins with seed tables for commodity/attribute/unit metadata; `cleaned.psdalldata__commodity_pivoted` — pivots attributes to wide format
|
||
- Seeds: `psd_commodity_codes.csv`, `psd_attribute_codes.csv`, `psd_unit_of_measure_codes.csv`
|
||
- Serving: `serving.commodity_metrics` — coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio
|
||
|
||
---
|
||
|
||
## 5. Weather & Climate Data
|
||
|
||
### 5.1 Open-Meteo — ERA5 Reanalysis + Forecast Blend
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | Archive: `https://archive-api.open-meteo.com/v1/archive` · Forecast: `https://api.open-meteo.com/v1/forecast` |
|
||
| Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ET₀, VPD |
|
||
| Access Method | Free API — no key, no registration |
|
||
| Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API |
|
||
| History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 |
|
||
| License / TOS | CC BY 4.0 — attribution required |
|
||
| Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. |
|
||
| Priority | **Core** |
|
||
|
||
Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates).
|
||
|
||
Variables fetched:
|
||
- `temperature_2m_max/min/mean` — frost detection (`<5°C`), heat stress (`>30°C`)
|
||
- `precipitation_sum` — drought and flood signals
|
||
- `wind_speed_10m_max` — wind damage proxy
|
||
- `relative_humidity_2m_max` — disease pressure (coffee leaf rust, CBD)
|
||
- `cloud_cover_mean` — solar radiation proxy
|
||
- `et0_fao_evapotranspiration` — crop water demand (Penman-Monteith)
|
||
- `vapour_pressure_deficit_max` — transpiration stress (`>1.5 kPa` = significant stress)
|
||
|
||
**12 locations** covering the world's primary Arabica and Robusta growing zones (BR ×3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See `extract/openmeteo/src/openmeteo/locations.py`.
|
||
|
||
**Pipeline implementation:** ✅ Ingested
|
||
- Extractor: `extract/openmeteo/` — daily run uses forecast API (10-day window); backfill uses archive API (2020–present)
|
||
- Idempotent: file-existence check per day per location before API call
|
||
- Landing: `data/landing/weather/{location_id}/{year}/{date}.json.gz` (one file per location per day)
|
||
- Foundation: `foundation.fct_weather_daily` — reads JSON glob, joins with `seeds.weather_locations`, derives boolean crop stress flags (`is_drought`, `is_heat_stress`, `is_high_vpd`, `is_frost`)
|
||
- Serving: `serving.weather_daily` — adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite `crop_stress_index` (0–100)
|
||
|
||
---
|
||
|
||
## 6. Planned Sources
|
||
|
||
### 6.1 ICE Coffee C — Options Chain
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Data Type | Per-strike open interest, volume, implied volatility for KC=F options |
|
||
| Use Case | IV term structure, put/call skew, options positioning — leading indicator for futures moves |
|
||
| Access Options | `yfinance` (free, limited history); barchart OnDemand API (paid); ICE Data Services (enterprise) |
|
||
| Priority | **High** |
|
||
|
||
See research note in `docs/ice-options-research.md` (to be added when research completes).
|
||
|
||
---
|
||
|
||
### 6.2 CFTC COT — Options-and-Futures Combined
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip` |
|
||
| Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers |
|
||
| Access Method | Public — no auth |
|
||
| Priority | **Medium** |
|
||
|
||
Currently we ingest the futures-only (`fut_disagg`) report. The combined report (`com_disagg`) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one.
|
||
|
||
---
|
||
|
||
### 6.3 World Bank Commodity Prices (Pink Sheet)
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx` |
|
||
| Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) |
|
||
| Access Method | Public Excel download — no auth |
|
||
| Update Frequency | Monthly |
|
||
| History | 1960 to present |
|
||
| Priority | **Medium** |
|
||
|
||
The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse.
|
||
|
||
---
|
||
|
||
### 6.4 FAO Crop Calendar
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| URL | `https://cropcalendar.apps.fao.org/` |
|
||
| Data Type | Coffee planting, flowering, and harvest windows by country |
|
||
| Access Method | Public — no auth (manual download or scrape) |
|
||
| Priority | **Medium** |
|
||
|
||
FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed.
|
||
|
||
---
|
||
|
||
## 7. Reference / Seed Data
|
||
|
||
All maintained as CSV files in `transform/sqlmesh_materia/seeds/`:
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `dim_commodity.csv` | Commodity master — code, name, exchange, unit |
|
||
| `psd_commodity_codes.csv` | USDA PSD commodity code lookup |
|
||
| `psd_attribute_codes.csv` | USDA PSD attribute code lookup (production, stocks, etc.) |
|
||
| `psd_unit_of_measure_codes.csv` | USDA PSD unit code lookup |
|
||
| `commodity_exchange_codes.csv` | Exchange code mapping |
|
||
| `psd_codes_exchange_codes_merge.csv` | Join table linking PSD codes to exchange codes |
|
||
| `weather_locations.csv` | Open-Meteo location metadata (id, name, country, lat, lon, variety) |
|