Files
beanflows/docs/data-sources-inventory.md
Deeman 70415e23b8 docs: add data sources inventory
Documents all 7 ingested sources (CFTC COT, Yahoo Finance KC=F, ICE stocks×3,
USDA PSD, Open-Meteo ERA5) plus planned sources (ICE options, COT combined,
World Bank Pink Sheet, FAO crop calendar). Matches padelnomics inventory format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 09:57:46 +01:00

260 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BeanFlows — Data Sources Inventory
Compiled: 2026-02-26
Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline.
---
## Pipeline Status Tracker
**Status:** ✅ Ingested — extractor + model live in `master` | 🔲 Planned — worth building | ⏸ On hold — blocked on cost/access | — Not targeted
**Score (15):** Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible.
| Source | Category | Status | Score | Credentials | Pipeline refs |
|--------|----------|--------|-------|-------------|---------------|
| CFTC COT Disaggregated Futures | Positioning | ✅ Ingested | 5 | None | `extract_cot``fct_cot_positioning``serving.cot_positioning` |
| Yahoo Finance — KC=F | Price | ✅ Ingested | 5 | None | `extract_coffee_prices``fct_coffee_prices``serving.coffee_prices` |
| ICE Report Center — warehouse stocks | Warehouse / Inventory | ✅ Ingested | 5 | None | `extract_ice_stocks``fct_ice_warehouse_stocks``serving.ice_warehouse_stocks` |
| ICE Report Center — stocks by port | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks``fct_ice_warehouse_stocks_by_port``serving.ice_warehouse_stocks_by_port` |
| ICE Report Center — aging stocks | Warehouse / Inventory | ✅ Ingested | 4 | None | `extract_ice_stocks``fct_ice_aging_stocks``serving.ice_aging_stocks` |
| USDA PSD Online | Fundamentals (supply/demand) | ✅ Ingested | 5 | None | `extract_psd``stg_psdalldata__commodity``serving.commodity_metrics` |
| Open-Meteo ERA5 — weather | Crop weather | ✅ Ingested | 5 | None | `extract_openmeteo``fct_weather_daily``serving.weather_daily` |
| ICE Coffee C — options chain | Derivatives / Volatility | 🔲 Planned | 4 | None (yfinance) or paid | TBD |
| CFTC COT — options-and-futures combined | Positioning | 🔲 Planned | 3 | None (same ZIP) | `fct_cot_positioning` variant |
| World Bank Commodity Prices (Pink Sheet) | Benchmark prices | 🔲 Planned | 3 | None | `extract_wb_prices``fct_wb_prices` |
| FAO Crop Calendar | Seasonality | 🔲 Planned | 3 | None (CSV) | Seed table |
| Freight / C4 route rates | Supply chain | 🔲 Planned | 2 | None (scrape) | `fct_freight_rates` |
| ICE Data Services — tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics |
| Refinitiv / LSEG | Price / Fundamentals | — | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources |
| Bloomberg Terminal | Price / News | — | 1 | Terminal license | Not cost-effective for current scope |
---
## 1. Positioning Data
### 1.1 CFTC COT Disaggregated Futures
| Field | Value |
|-------|-------|
| URL | `https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip` |
| Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) |
| Access Method | Public download — no auth, no API key |
| Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place |
| History | 2006-06-13 to present |
| License / TOS | US government data — public domain |
| Priority | **Core** |
The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~330 MB. The current-year file is overwritten each Friday; prior years are static.
Column quirk: `Swap__Positions_Short_All` and `Swap__Positions_Spread_All` use double underscores — this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB `all_varchar = TRUE` preserves exact header names; these columns must be quoted in SQL.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/cftc_cot/` — backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header)
- Landing: `data/landing/cot/{year}/{etag}.csv.gzip`
- Foundation: `foundation.fct_cot_positioning` — casts types, cleans names, computes net positions (long short), deduplicates via HASH key
- Grain: `(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)`
- Serving: `serving.cot_positioning` — adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI
- Covers all commodity codes in the report — filtering to coffee (`073642`) happens in the serving layer
**Related: CFTC COT Options-and-Futures Combined**
The same URL pattern with `com_disagg_txt_{year}.zip` gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5).
---
## 2. Price Data
### 2.1 Yahoo Finance — Coffee C (KC=F)
| Field | Value |
|-------|-------|
| URL | Yahoo Finance via `yfinance` Python library; ticker `KC=F` |
| Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures |
| Access Method | `yfinance.Ticker("KC=F").history(period="max")` — free, no auth |
| Update Frequency | Daily (post-settlement, typically ~30 min after session close) |
| History | 1971-08-16 to present |
| License / TOS | Yahoo Finance ToS — data for personal/non-commercial use; not for redistribution |
| Priority | **Core** |
Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (`Adj Close`, note space in header) accounts for contract rolls.
Column quirk: `Adj Close` has a space in the CSV header. DuckDB `all_varchar = TRUE` preserves this; must be quoted as `"Adj Close"` in SQL.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/coffee_prices/` — downloads full history via `ticker.history(period="max")`, idempotent via SHA256 of CSV bytes
- Landing: `data/landing/prices/coffee_kc/{hash8}.csv.gzip` (single file; hash changes when new trading days are appended)
- Foundation: `foundation.fct_coffee_prices` — casts, deduplicates via `HASH(Date, Close)`
- Grain: `trade_date`
- Serving: `serving.coffee_prices` — adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, ±2σ), RSI 14, 52-week high/low/range
---
## 3. Warehouse & Inventory Data
### 3.1 ICE Report Center — Warehouse Stocks
| Field | Value |
|-------|-------|
| URL | `https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv` (rolling) + `https://www.ice.com/marketdata/api/reports/293/results` (API, product_id=2) |
| Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) |
| Access Method | Public — no auth required. Static CSV for rolling data; private JSON API for historical report catalogue |
| Update Frequency | Daily (trading days) for stocks; monthly for aging report |
| History | Full archive available via report API (~2010 to present); static CSV is rolling |
| License / TOS | ICE — public market data |
| Priority | **Core** |
Three distinct datasets served by one extractor:
1. **Daily warehouse stocks** (`ice_stocks`) — total certified bags + pending grading. Key supply constraint indicator.
2. **Stocks by port** (`ice_stocks_by_port`) — breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis.
3. **Aging stocks** (`ice_aging`) — bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal.
The report API is undocumented but stable. Reports are discovered via `POST /api/reports/293/results` with `productId=2`, paginated. XLS/XLSX files are parsed with `xlrd`; the extractor handles both OLE2 `.xls` and modern `.xlsx` formats via magic-byte detection.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/ice_stocks/` — idempotent via SHA256 of content
- Landing:
- `data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip`
- `data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip`
- `data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip`
- Foundation: `fct_ice_warehouse_stocks`, `fct_ice_aging_stocks`, `fct_ice_warehouse_stocks_by_port`
- Serving: corresponding `serving.*` models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share
---
## 4. Fundamentals Data
### 4.1 USDA PSD Online — Production, Supply, and Distribution
| Field | Value |
|-------|-------|
| URL | `https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip` |
| Data Type | Monthly supply/demand balances by commodity × country × market year: production, imports, exports, consumption, ending stocks |
| Access Method | Public download — no auth |
| Update Frequency | Monthly (WASDE report release dates, ~11th of each month) |
| History | 2006-08 to present (archive); current year is always available |
| License / TOS | USDA FAS — US government open data |
| Priority | **Core** |
PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is `0721100` (green bean equivalent). Market year for coffee runs OctoberSeptember.
Key attributes tracked:
- `AREA_HARVESTED` (ha), `PRODUCTION` (1000 MT or 60kg bags), `DOMESTIC_CONSUMPTION`, `EXPORTS`, `ENDING_STOCKS`, `STOCKS_TO_USE_RATIO_`
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/psdonline/` — backfills from 2006-08, idempotent via etag
- Landing: `data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip`
- Staging: `staging.stg_psdalldata__commodity` — joins with seed tables for commodity/attribute/unit metadata; `cleaned.psdalldata__commodity_pivoted` — pivots attributes to wide format
- Seeds: `psd_commodity_codes.csv`, `psd_attribute_codes.csv`, `psd_unit_of_measure_codes.csv`
- Serving: `serving.commodity_metrics` — coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio
---
## 5. Weather & Climate Data
### 5.1 Open-Meteo — ERA5 Reanalysis + Forecast Blend
| Field | Value |
|-------|-------|
| URL | Archive: `https://archive-api.open-meteo.com/v1/archive` · Forecast: `https://api.open-meteo.com/v1/forecast` |
| Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ET₀, VPD |
| Access Method | Free API — no key, no registration |
| Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API |
| History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 |
| License / TOS | CC BY 4.0 — attribution required |
| Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. |
| Priority | **Core** |
Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates).
Variables fetched:
- `temperature_2m_max/min/mean` — frost detection (`<5°C`), heat stress (`>30°C`)
- `precipitation_sum` — drought and flood signals
- `wind_speed_10m_max` — wind damage proxy
- `relative_humidity_2m_max` — disease pressure (coffee leaf rust, CBD)
- `cloud_cover_mean` — solar radiation proxy
- `et0_fao_evapotranspiration` — crop water demand (Penman-Monteith)
- `vapour_pressure_deficit_max` — transpiration stress (`>1.5 kPa` = significant stress)
**12 locations** covering the world's primary Arabica and Robusta growing zones (BR ×3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See `extract/openmeteo/src/openmeteo/locations.py`.
**Pipeline implementation:** ✅ Ingested
- Extractor: `extract/openmeteo/` — daily run uses forecast API (10-day window); backfill uses archive API (2020present)
- Idempotent: file-existence check per day per location before API call
- Landing: `data/landing/weather/{location_id}/{year}/{date}.json.gz` (one file per location per day)
- Foundation: `foundation.fct_weather_daily` — reads JSON glob, joins with `seeds.weather_locations`, derives boolean crop stress flags (`is_drought`, `is_heat_stress`, `is_high_vpd`, `is_frost`)
- Serving: `serving.weather_daily` — adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite `crop_stress_index` (0100)
---
## 6. Planned Sources
### 6.1 ICE Coffee C — Options Chain
| Field | Value |
|-------|-------|
| Data Type | Per-strike open interest, volume, implied volatility for KC=F options |
| Use Case | IV term structure, put/call skew, options positioning — leading indicator for futures moves |
| Access Options | `yfinance` (free, limited history); barchart OnDemand API (paid); ICE Data Services (enterprise) |
| Priority | **High** |
See research note in `docs/ice-options-research.md` (to be added when research completes).
---
### 6.2 CFTC COT — Options-and-Futures Combined
| Field | Value |
|-------|-------|
| URL | `https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip` |
| Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers |
| Access Method | Public — no auth |
| Priority | **Medium** |
Currently we ingest the futures-only (`fut_disagg`) report. The combined report (`com_disagg`) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one.
---
### 6.3 World Bank Commodity Prices (Pink Sheet)
| Field | Value |
|-------|-------|
| URL | `https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx` |
| Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) |
| Access Method | Public Excel download — no auth |
| Update Frequency | Monthly |
| History | 1960 to present |
| Priority | **Medium** |
The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse.
---
### 6.4 FAO Crop Calendar
| Field | Value |
|-------|-------|
| URL | `https://cropcalendar.apps.fao.org/` |
| Data Type | Coffee planting, flowering, and harvest windows by country |
| Access Method | Public — no auth (manual download or scrape) |
| Priority | **Medium** |
FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed.
---
## 7. Reference / Seed Data
All maintained as CSV files in `transform/sqlmesh_materia/seeds/`:
| File | Purpose |
|------|---------|
| `dim_commodity.csv` | Commodity master — code, name, exchange, unit |
| `psd_commodity_codes.csv` | USDA PSD commodity code lookup |
| `psd_attribute_codes.csv` | USDA PSD attribute code lookup (production, stocks, etc.) |
| `psd_unit_of_measure_codes.csv` | USDA PSD unit code lookup |
| `commodity_exchange_codes.csv` | Exchange code mapping |
| `psd_codes_exchange_codes_merge.csv` | Join table linking PSD codes to exchange codes |
| `weather_locations.csv` | Open-Meteo location metadata (id, name, country, lat, lon, variety) |