104 lines
4.8 KiB
Markdown
104 lines
4.8 KiB
Markdown
# Data Engineering Pipeline Layers & Naming Conventions
|
|
|
|
This document outlines the standard layered architecture and model naming conventions for our data platform. Adhering to these standards is crucial for maintaining a clean, scalable, and understandable project.
|
|
|
|
---
|
|
|
|
## Data Pipeline Layers
|
|
|
|
Each layer has a distinct purpose, transforming data from its raw state into a curated, analysis-ready format.
|
|
|
|
### 1. Raw Layer
|
|
|
|
The initial landing zone for all data ingested from source systems.
|
|
|
|
* **Purpose:** To create a permanent, immutable archive of source data.
|
|
* **Key Activities:**
|
|
* Data is ingested and stored in its original, unaltered format.
|
|
* Serves as the definitive source of truth, enabling reprocessing of the entire pipeline if needed.
|
|
* No transformations or schema enforcement occur at this stage.
|
|
|
|
### 2. Staging Layer
|
|
|
|
A workspace for initial data preparation and technical validation.
|
|
|
|
* **Purpose:** To convert raw data into a structured, technically sound format.
|
|
* **Key Activities:**
|
|
* **Schema Application:** A schema is applied to the raw data.
|
|
* **Data Typing:** Columns are cast to their correct data types (e.g., string to timestamp, integer to decimal).
|
|
* **Basic Cleansing:** Handles technical errors like malformed records and standardizes null values.
|
|
|
|
### 3. Cleaned Layer
|
|
|
|
The integrated core of the data platform, designed to create a "single version of the facts."
|
|
|
|
* **Purpose:** To integrate data from various sources into a unified, consistent, and historically accurate model.
|
|
* **Key Activities:**
|
|
* **Business Logic:** Complex business rules are applied to conform and validate the data.
|
|
* **Integration:** Data from different sources is combined using business keys.
|
|
* **Core Modeling:** Data is structured into a robust, integrated model (e.g., a Data Vault) that represents core business processes.
|
|
|
|
### 4. Serving Layer
|
|
|
|
The final, presentation-ready layer optimized for analytics, reporting, and business intelligence.
|
|
|
|
* **Purpose:** To provide high-performance, easy-to-query data for end-users.
|
|
* **Key Activities:**
|
|
* **Analytics Modeling:** Data from the Cleaned Layer is transformed into user-friendly models, such as **Fact and Dimension tables** (star schemas).
|
|
* **Aggregation:** Key business metrics and KPIs are pre-calculated to accelerate queries.
|
|
* **Consumption:** This layer feeds dashboards, reports, and analytical tools. It is often loaded into a dedicated Data Warehouse for optimal performance.
|
|
|
|
---
|
|
|
|
## Model Naming Conventions
|
|
|
|
A consistent naming convention helps us understand a model's purpose at a glance.
|
|
|
|
### Guiding Principles
|
|
|
|
1. **Be Explicit:** Names should clearly state the layer, source, and entity.
|
|
2. **Be Consistent:** Use the same patterns and abbreviations everywhere.
|
|
3. **Use Prefixes:** Start filenames and model names with the layer to group them logically.
|
|
|
|
### Layer-by-Layer Naming Scheme
|
|
|
|
#### 1. Raw / Sources Layer
|
|
This layer is for defining sources, not models. The convention is to name the source after the system it comes from.
|
|
* **Source Name:** `[source_system]` (e.g., `salesforce`, `google_ads`)
|
|
* **Table Name:** `[original_table_name]` (e.g., `account`, `ads_performance`)
|
|
|
|
#### 2. Staging Layer
|
|
Staging models have a 1:1 relationship with a source table.
|
|
* **Pattern:** `stg_[source_system]__[entity_name]`
|
|
* **Examples:**
|
|
* `stg_stripe__charges.sql`
|
|
* `stg_google_ads__campaigns.sql`
|
|
|
|
#### 3. Cleaned Layer
|
|
This is the integration layer for building unified business entities or a Data Vault.
|
|
* **Pattern (Integrated Entity):** `cln_[entity_name]`
|
|
* **Pattern (Data Vault):** `cln_[vault_component]_[entity_name]`
|
|
* **Examples:**
|
|
* `cln_customers.sql`
|
|
* `cln_hub_customers.sql`
|
|
* `cln_sat_customer_details.sql`
|
|
|
|
#### 4. Serving Layer
|
|
This layer contains business-friendly models for consumption.
|
|
* **Pattern (Dimension):** `dim_[entity_name]`
|
|
* **Pattern (Fact):** `fct_[business_process]`
|
|
* **Pattern (Aggregate):** `agg_[aggregation_description]`
|
|
* **Examples:**
|
|
* `dim_customers.sql`
|
|
* `fct_orders.sql`
|
|
* `agg_monthly_revenue_by_region.sql`
|
|
|
|
### Summary Table
|
|
|
|
| Layer | Purpose | Filename / Model Name Example | Notes |
|
|
| :------ | :---------------------- | :---------------------------------------- | :---------------------------------------------- |
|
|
| Raw | Source Declaration | `sources.yml` (for `stripe`, `charges`) | No models, just declarations. |
|
|
| Staging | Basic Cleansing & Typing | `stg_stripe__charges.sql` | 1:1 with source tables. |
|
|
| Cleaned | Integration & Core Models | `cln_customers.sql` or `cln_hub_customers.sql` | Integrates sources. Your Data Vault lives here. |
|
|
| Serving | Analytics & BI | `dim_customers.sql` or `fct_orders.sql` | Business-facing, optimized for queries. |
|