diff --git a/docs/data-sources-inventory.md b/docs/data-sources-inventory.md new file mode 100644 index 0000000..2ef7d98 --- /dev/null +++ b/docs/data-sources-inventory.md @@ -0,0 +1,259 @@ +# BeanFlows β€” Data Sources Inventory + +Compiled: 2026-02-26 +Purpose: Identify and track data sources feeding the BeanFlows DuckDB analytics pipeline. + +--- + +## Pipeline Status Tracker + +**Status:** βœ… Ingested β€” extractor + model live in `master` | πŸ”² Planned β€” worth building | ⏸ On hold β€” blocked on cost/access | β€” Not targeted + +**Score (1–5):** Overall ingestion priority. Weighs data value to BeanFlows (price analytics, COT positioning, crop weather, PSD fundamentals) against implementation effort and access barriers. 5 = core infrastructure already ingested, 1 = marginal or inaccessible. + +| Source | Category | Status | Score | Credentials | Pipeline refs | +|--------|----------|--------|-------|-------------|---------------| +| CFTC COT Disaggregated Futures | Positioning | βœ… Ingested | 5 | None | `extract_cot` β†’ `fct_cot_positioning` β†’ `serving.cot_positioning` | +| Yahoo Finance β€” KC=F | Price | βœ… Ingested | 5 | None | `extract_coffee_prices` β†’ `fct_coffee_prices` β†’ `serving.coffee_prices` | +| ICE Report Center β€” warehouse stocks | Warehouse / Inventory | βœ… Ingested | 5 | None | `extract_ice_stocks` β†’ `fct_ice_warehouse_stocks` β†’ `serving.ice_warehouse_stocks` | +| ICE Report Center β€” stocks by port | Warehouse / Inventory | βœ… Ingested | 4 | None | `extract_ice_stocks` β†’ `fct_ice_warehouse_stocks_by_port` β†’ `serving.ice_warehouse_stocks_by_port` | +| ICE Report Center β€” aging stocks | Warehouse / Inventory | βœ… Ingested | 4 | None | `extract_ice_stocks` β†’ `fct_ice_aging_stocks` β†’ `serving.ice_aging_stocks` | +| USDA PSD Online | Fundamentals (supply/demand) | βœ… Ingested | 5 | None | `extract_psd` β†’ `stg_psdalldata__commodity` β†’ `serving.commodity_metrics` | +| Open-Meteo ERA5 β€” weather | Crop weather | βœ… Ingested | 5 | None | `extract_openmeteo` β†’ `fct_weather_daily` β†’ `serving.weather_daily` | +| ICE Coffee C β€” options chain | Derivatives / Volatility | πŸ”² Planned | 4 | None (yfinance) or paid | TBD | +| CFTC COT β€” options-and-futures combined | Positioning | πŸ”² Planned | 3 | None (same ZIP) | `fct_cot_positioning` variant | +| World Bank Commodity Prices (Pink Sheet) | Benchmark prices | πŸ”² Planned | 3 | None | `extract_wb_prices` β†’ `fct_wb_prices` | +| FAO Crop Calendar | Seasonality | πŸ”² Planned | 3 | None (CSV) | Seed table | +| Freight / C4 route rates | Supply chain | πŸ”² Planned | 2 | None (scrape) | `fct_freight_rates` | +| ICE Data Services β€” tick data | Price (granular) | ⏸ On hold | 2 | Paid subscription | Commercial; not needed for daily analytics | +| Refinitiv / LSEG | Price / Fundamentals | β€” | 1 | Enterprise subscription | Superseded by free ICE + CFTC + USDA sources | +| Bloomberg Terminal | Price / News | β€” | 1 | Terminal license | Not cost-effective for current scope | + +--- + +## 1. Positioning Data + +### 1.1 CFTC COT Disaggregated Futures + +| Field | Value | +|-------|-------| +| URL | `https://www.cftc.gov/files/dea/history/fut_disagg_txt_{year}.zip` | +| Data Type | Weekly futures-only positioning by trader category (Producer/Merchant, Swap Dealer, Managed Money, Other Reportable, Non-Reportable) | +| Access Method | Public download β€” no auth, no API key | +| Update Frequency | Weekly (Friday 3:30 PM ET); current-year file updated in-place | +| History | 2006-06-13 to present | +| License / TOS | US government data β€” public domain | +| Priority | **Core** | + +The CFTC publishes the Disaggregated Futures-Only report as one ZIP per year, containing a single CSV with all commodity codes. Each file is ~3–30 MB. The current-year file is overwritten each Friday; prior years are static. + +Column quirk: `Swap__Positions_Short_All` and `Swap__Positions_Spread_All` use double underscores β€” this is a CFTC data quality issue (not a typo). All other swap columns use single underscores. DuckDB `all_varchar = TRUE` preserves exact header names; these columns must be quoted in SQL. + +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract/cftc_cot/` β€” backfills all years from 2006, idempotent via etag (synthetic etag = year + content-length + last-modified hash when CFTC omits etag header) +- Landing: `data/landing/cot/{year}/{etag}.csv.gzip` +- Foundation: `foundation.fct_cot_positioning` β€” casts types, cleans names, computes net positions (long βˆ’ short), deduplicates via HASH key +- Grain: `(cftc_commodity_code, report_date, cftc_contract_market_code, ingest_date)` +- Serving: `serving.cot_positioning` β€” adds COT index (normalized percentile rank over 26w / 52w rolling window), managed money net % of OI +- Covers all commodity codes in the report β€” filtering to coffee (`073642`) happens in the serving layer + +**Related: CFTC COT Options-and-Futures Combined** + +The same URL pattern with `com_disagg_txt_{year}.zip` gives the combined futures+options report. We currently use the futures-only report. Adding the combined variant would enable options-specific positioning analysis (see Section 5). + +--- + +## 2. Price Data + +### 2.1 Yahoo Finance β€” Coffee C (KC=F) + +| Field | Value | +|-------|-------| +| URL | Yahoo Finance via `yfinance` Python library; ticker `KC=F` | +| Data Type | Daily OHLCV + adjusted close for Coffee C continuous front-month futures | +| Access Method | `yfinance.Ticker("KC=F").history(period="max")` β€” free, no auth | +| Update Frequency | Daily (post-settlement, typically ~30 min after session close) | +| History | 1971-08-16 to present | +| License / TOS | Yahoo Finance ToS β€” data for personal/non-commercial use; not for redistribution | +| Priority | **Core** | + +Yahoo Finance is the only free source for Coffee C daily OHLCV with full history back to 1971. Data quality is generally good for daily analytics; occasional gaps on non-US holidays. Adjusted close (`Adj Close`, note space in header) accounts for contract rolls. + +Column quirk: `Adj Close` has a space in the CSV header. DuckDB `all_varchar = TRUE` preserves this; must be quoted as `"Adj Close"` in SQL. + +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract/coffee_prices/` β€” downloads full history via `ticker.history(period="max")`, idempotent via SHA256 of CSV bytes +- Landing: `data/landing/prices/coffee_kc/{hash8}.csv.gzip` (single file; hash changes when new trading days are appended) +- Foundation: `foundation.fct_coffee_prices` β€” casts, deduplicates via `HASH(Date, Close)` +- Grain: `trade_date` +- Serving: `serving.coffee_prices` β€” adds daily return, SMA 20/50/200, EMA 9/21, Bollinger Bands (20d, Β±2Οƒ), RSI 14, 52-week high/low/range + +--- + +## 3. Warehouse & Inventory Data + +### 3.1 ICE Report Center β€” Warehouse Stocks + +| Field | Value | +|-------|-------| +| URL | `https://www.theice.com/publicdocs/futures_us/exchange_notices/coffee_certifiedstocks.csv` (rolling) + `https://www.ice.com/marketdata/api/reports/293/results` (API, product_id=2) | +| Data Type | Daily ICE-certified and pending-grading coffee bags (total, by port, by age bucket) | +| Access Method | Public β€” no auth required. Static CSV for rolling data; private JSON API for historical report catalogue | +| Update Frequency | Daily (trading days) for stocks; monthly for aging report | +| History | Full archive available via report API (~2010 to present); static CSV is rolling | +| License / TOS | ICE β€” public market data | +| Priority | **Core** | + +Three distinct datasets served by one extractor: + +1. **Daily warehouse stocks** (`ice_stocks`) β€” total certified bags + pending grading. Key supply constraint indicator. +2. **Stocks by port** (`ice_stocks_by_port`) β€” breakdown across NY, New Orleans, Houston, Miami, Antwerp, Hamburg/Bremen, Barcelona, Virginia. Port-level flow analysis. +3. **Aging stocks** (`ice_aging`) β€” bags grouped by age bucket (e.g., "0 to 30", "31 to 60" days). Older stocks command quality discounts; aging ratio is a quality/supply stress signal. + +The report API is undocumented but stable. Reports are discovered via `POST /api/reports/293/results` with `productId=2`, paginated. XLS/XLSX files are parsed with `xlrd`; the extractor handles both OLE2 `.xls` and modern `.xlsx` formats via magic-byte detection. + +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract/ice_stocks/` β€” idempotent via SHA256 of content +- Landing: + - `data/landing/ice_stocks/{year}/{date}_{hash8}.csv.gzip` + - `data/landing/ice_aging/{year}/{date}_{hash8}.csv.gzip` + - `data/landing/ice_stocks_by_port/{year}/{date}_{hash8}.csv.gzip` +- Foundation: `fct_ice_warehouse_stocks`, `fct_ice_aging_stocks`, `fct_ice_warehouse_stocks_by_port` +- Serving: corresponding `serving.*` models with WoW change, 30d/52w rolling averages, drawdown from 52w high, age-bucket share + +--- + +## 4. Fundamentals Data + +### 4.1 USDA PSD Online β€” Production, Supply, and Distribution + +| Field | Value | +|-------|-------| +| URL | `https://apps.fas.usda.gov/psdonline/downloads/archives/{year}/{month:02d}/psd_alldata_csv.zip` | +| Data Type | Monthly supply/demand balances by commodity Γ— country Γ— market year: production, imports, exports, consumption, ending stocks | +| Access Method | Public download β€” no auth | +| Update Frequency | Monthly (WASDE report release dates, ~11th of each month) | +| History | 2006-08 to present (archive); current year is always available | +| License / TOS | USDA FAS β€” US government open data | +| Priority | **Core** | + +PSD is the primary source for global coffee supply/demand fundamentals. Each monthly file contains all commodities (not just coffee) and all reporting countries for all market years. The coffee commodity code is `0721100` (green bean equivalent). Market year for coffee runs October–September. + +Key attributes tracked: +- `AREA_HARVESTED` (ha), `PRODUCTION` (1000 MT or 60kg bags), `DOMESTIC_CONSUMPTION`, `EXPORTS`, `ENDING_STOCKS`, `STOCKS_TO_USE_RATIO_` + +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract/psdonline/` β€” backfills from 2006-08, idempotent via etag +- Landing: `data/landing/psd/{year}/{month:02d}/{etag}.csv.gzip` +- Staging: `staging.stg_psdalldata__commodity` β€” joins with seed tables for commodity/attribute/unit metadata; `cleaned.psdalldata__commodity_pivoted` β€” pivots attributes to wide format +- Seeds: `psd_commodity_codes.csv`, `psd_attribute_codes.csv`, `psd_unit_of_measure_codes.csv` +- Serving: `serving.commodity_metrics` β€” coffee and cocoa supply/demand balances, production growth YoY, stock-to-use ratio + +--- + +## 5. Weather & Climate Data + +### 5.1 Open-Meteo β€” ERA5 Reanalysis + Forecast Blend + +| Field | Value | +|-------|-------| +| URL | Archive: `https://archive-api.open-meteo.com/v1/archive` Β· Forecast: `https://api.open-meteo.com/v1/forecast` | +| Data Type | Daily weather for 12 coffee-growing regions: temperature (min/max/mean), precipitation, wind, humidity, cloud cover, ETβ‚€, VPD | +| Access Method | Free API β€” no key, no registration | +| Update Frequency | Daily; ERA5 reanalysis available to ~5 days ago, gap filled by forecast API | +| History | ERA5 archive from 1940; pipeline backfilled from 2020-01-01 | +| License / TOS | CC BY 4.0 β€” attribution required | +| Rate Limiting | No published rate limit; community API. Sleep 0.5s between location calls. Pre-check file existence to skip API calls on re-runs. | +| Priority | **Core** | + +Open-Meteo wraps ECMWF ERA5 reanalysis data, which is the scientific standard for historical weather. The API requires no key and has no formal rate limit for reasonable usage (~12 calls/day for daily updates). + +Variables fetched: +- `temperature_2m_max/min/mean` β€” frost detection (`<5Β°C`), heat stress (`>30Β°C`) +- `precipitation_sum` β€” drought and flood signals +- `wind_speed_10m_max` β€” wind damage proxy +- `relative_humidity_2m_max` β€” disease pressure (coffee leaf rust, CBD) +- `cloud_cover_mean` β€” solar radiation proxy +- `et0_fao_evapotranspiration` β€” crop water demand (Penman-Monteith) +- `vapour_pressure_deficit_max` β€” transpiration stress (`>1.5 kPa` = significant stress) + +**12 locations** covering the world's primary Arabica and Robusta growing zones (BR Γ—3, VN, CO, ET, HN, GT, ID, PE, UG, CI). See `extract/openmeteo/src/openmeteo/locations.py`. + +**Pipeline implementation:** βœ… Ingested +- Extractor: `extract/openmeteo/` β€” daily run uses forecast API (10-day window); backfill uses archive API (2020–present) +- Idempotent: file-existence check per day per location before API call +- Landing: `data/landing/weather/{location_id}/{year}/{date}.json.gz` (one file per location per day) +- Foundation: `foundation.fct_weather_daily` β€” reads JSON glob, joins with `seeds.weather_locations`, derives boolean crop stress flags (`is_drought`, `is_heat_stress`, `is_high_vpd`, `is_frost`) +- Serving: `serving.weather_daily` β€” adds rolling aggregates (7d, 30d), temperature anomaly, water balance, drought/heat/VPD streak counters (gaps-and-islands), composite `crop_stress_index` (0–100) + +--- + +## 6. Planned Sources + +### 6.1 ICE Coffee C β€” Options Chain + +| Field | Value | +|-------|-------| +| Data Type | Per-strike open interest, volume, implied volatility for KC=F options | +| Use Case | IV term structure, put/call skew, options positioning β€” leading indicator for futures moves | +| Access Options | `yfinance` (free, limited history); barchart OnDemand API (paid); ICE Data Services (enterprise) | +| Priority | **High** | + +See research note in `docs/ice-options-research.md` (to be added when research completes). + +--- + +### 6.2 CFTC COT β€” Options-and-Futures Combined + +| Field | Value | +|-------|-------| +| URL | `https://www.cftc.gov/files/dea/history/com_disagg_txt_{year}.zip` | +| Data Type | Same as 1.1 but positions include options delta-equivalent; captures net exposure of option writers | +| Access Method | Public β€” no auth | +| Priority | **Medium** | + +Currently we ingest the futures-only (`fut_disagg`) report. The combined report (`com_disagg`) adjusts for options delta and shows total directional exposure. Adding it would be a minor extractor change: same URL pattern, same CSV schema, different CFTC internal identifier. Could run as a second extractor or a variant flag in the existing one. + +--- + +### 6.3 World Bank Commodity Prices (Pink Sheet) + +| Field | Value | +|-------|-------| +| URL | `https://thedocs.worldbank.org/en/doc/18675f1d1639c7a34d463f59255d3f88-0050012023/related/CMO-Pink-Sheet.xlsx` | +| Data Type | Monthly benchmark prices for 70+ commodities including Arabica (Other Milds, NY) and Robusta (ICE London) | +| Access Method | Public Excel download β€” no auth | +| Update Frequency | Monthly | +| History | 1960 to present | +| Priority | **Medium** | + +The Pink Sheet provides monthly Arabica and Robusta price benchmarks alongside other agricultural commodities. Useful for macro context and relative value analysis. Single XLSX, easy to parse. + +--- + +### 6.4 FAO Crop Calendar + +| Field | Value | +|-------|-------| +| URL | `https://cropcalendar.apps.fao.org/` | +| Data Type | Coffee planting, flowering, and harvest windows by country | +| Access Method | Public β€” no auth (manual download or scrape) | +| Priority | **Medium** | + +FAO crop calendar provides the seasonal context needed to interpret weather anomalies correctly (e.g., drought during flowering is more damaging than drought post-harvest). Suitable as a one-time seed table per growing region, updated annually if needed. + +--- + +## 7. Reference / Seed Data + +All maintained as CSV files in `transform/sqlmesh_materia/seeds/`: + +| File | Purpose | +|------|---------| +| `dim_commodity.csv` | Commodity master β€” code, name, exchange, unit | +| `psd_commodity_codes.csv` | USDA PSD commodity code lookup | +| `psd_attribute_codes.csv` | USDA PSD attribute code lookup (production, stocks, etc.) | +| `psd_unit_of_measure_codes.csv` | USDA PSD unit code lookup | +| `commodity_exchange_codes.csv` | Exchange code mapping | +| `psd_codes_exchange_codes_merge.csv` | Join table linking PSD codes to exchange codes | +| `weather_locations.csv` | Open-Meteo location metadata (id, name, country, lat, lon, variety) |