Phase 1A — KC=F Coffee Futures Prices: - New extract/coffee_prices/ package (yfinance): downloads KC=F daily OHLCV, stores as gzip CSV with SHA256-based idempotency - SQLMesh models: raw/coffee_prices → foundation/fct_coffee_prices → serving/coffee_prices (with 20d/50d SMA, 52-week high/low, daily return %) - Dashboard: 4 metric cards + dual-line chart (close, 20d MA, 50d MA) - API: GET /commodities/<ticker>/prices Phase 1B — Data Methodology Page: - New /methodology route with full-page template (base.html) - 6 anchored sections: USDA PSD, CFTC COT, KC=F price, ICE warehouse stocks, data quality model, update schedule table - "Methodology" link added to marketing footer Phase 1C — Automated Pipeline: - supervisor.sh updated: runs extract_cot, extract_prices, extract_ice in sequence before transform - Webhook failure alerting via ALERT_WEBHOOK_URL env var (ntfy/Slack/Telegram) ICE Warehouse Stocks: - New extract/ice_stocks/ package (niquests): normalizes ICE Report Center CSV to canonical schema, hash-based idempotency, soft-fail on 404 with guidance - SQLMesh models: raw/ice_warehouse_stocks → foundation/fct_ice_warehouse_stocks → serving/ice_warehouse_stocks (30d avg, WoW change, 52w drawdown) - Dashboard: 4 metric cards + line chart (certified bags + 30d avg) - API: GET /commodities/<code>/stocks Foundation: - dim_commodity: added ticker (KC=F) and ice_stock_report_code (COFFEE-C) columns - macros/__init__.py: added prices_glob() and ice_stocks_glob() - pipelines.py: added extract_prices and extract_ice entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Data Engineering Pipeline Layers & Naming Conventions
This document outlines the standard layered architecture and model naming conventions for our data platform. Adhering to these standards is crucial for maintaining a clean, scalable, and understandable project.
Data Pipeline Layers
Each layer has a distinct purpose, transforming data from its raw state into a curated, analysis-ready format.
1. Raw Layer
The initial landing zone for all data ingested from source systems.
- Purpose: To create a permanent, immutable archive of source data.
- Key Activities:
- Data is ingested and stored in its original, unaltered format.
- Serves as the definitive source of truth, enabling reprocessing of the entire pipeline if needed.
- No transformations or schema enforcement occur at this stage.
2. Staging Layer
A workspace for initial data preparation and technical validation.
- Purpose: To convert raw data into a structured, technically sound format.
- Key Activities:
- Schema Application: A schema is applied to the raw data.
- Data Typing: Columns are cast to their correct data types (e.g., string to timestamp, integer to decimal).
- Basic Cleansing: Handles technical errors like malformed records and standardizes null values.
3. Cleaned Layer
The integrated core of the data platform, designed to create a "single version of the facts."
- Purpose: To integrate data from various sources into a unified, consistent, and historically accurate model.
- Key Activities:
- Business Logic: Complex business rules are applied to conform and validate the data.
- Integration: Data from different sources is combined using business keys.
- Core Modeling: Data is structured into a robust, integrated model (e.g., a Data Vault) that represents core business processes.
4. Serving Layer
The final, presentation-ready layer optimized for analytics, reporting, and business intelligence.
- Purpose: To provide high-performance, easy-to-query data for end-users.
- Key Activities:
- Analytics Modeling: Data from the Cleaned Layer is transformed into user-friendly models, such as Fact and Dimension tables (star schemas).
- Aggregation: Key business metrics and KPIs are pre-calculated to accelerate queries.
- Consumption: This layer feeds dashboards, reports, and analytical tools. It is often loaded into a dedicated Data Warehouse for optimal performance.
Model Naming Conventions
A consistent naming convention helps us understand a model's purpose at a glance.
Guiding Principles
- Be Explicit: Names should clearly state the layer, source, and entity.
- Be Consistent: Use the same patterns and abbreviations everywhere.
- Use Prefixes: Start filenames and model names with the layer to group them logically.
Layer-by-Layer Naming Scheme
1. Raw / Sources Layer
This layer is for defining sources, not models. The convention is to name the source after the system it comes from.
- Source Name:
[source_system](e.g.,salesforce,google_ads) - Table Name:
[original_table_name](e.g.,account,ads_performance)
2. Staging Layer
Staging models have a 1:1 relationship with a source table.
- Pattern:
stg_[source_system]__[entity_name] - Examples:
stg_stripe__charges.sqlstg_google_ads__campaigns.sql
3. Cleaned Layer
This is the integration layer for building unified business entities or a Data Vault.
- Pattern (Integrated Entity):
cln_[entity_name] - Pattern (Data Vault):
cln_[vault_component]_[entity_name] - Examples:
cln_customers.sqlcln_hub_customers.sqlcln_sat_customer_details.sql
4. Serving Layer
This layer contains business-friendly models for consumption.
- Pattern (Dimension):
dim_[entity_name] - Pattern (Fact):
fct_[business_process] - Pattern (Aggregate):
agg_[aggregation_description] - Examples:
dim_customers.sqlfct_orders.sqlagg_monthly_revenue_by_region.sql
Summary Table
| Layer | Purpose | Filename / Model Name Example | Notes |
|---|---|---|---|
| Raw | Source Declaration | sources.yml (for stripe, charges) |
No models, just declarations. |
| Staging | Basic Cleansing & Typing | stg_stripe__charges.sql |
1:1 with source tables. |
| Cleaned | Integration & Core Models | cln_customers.sql or cln_hub_customers.sql |
Integrates sources. Your Data Vault lives here. |
| Serving | Analytics & BI | dim_customers.sql or fct_orders.sql |
Business-facing, optimized for queries. |