Skip to content

Introduction to Modern Data Engineering

The Evolution of Data Transformation

Data engineering has undergone a dramatic transformation in the past decade. Understanding this evolution helps explain why dbt has become so essential.

The Traditional ETL Era (2000-2015)

How It Worked

┌─────────────┐      ┌──────────┐      ┌───────────────┐
│  Source DB  │ ───> │ ETL Tool │ ───> │ Data Warehouse│
└─────────────┘      └──────────┘      └───────────────┘
                     Transform Data
                     (in the tool)

Tools: Informatica, Talend, SSIS, Pentaho

Process: 1. Extract data from source systems 2. Transform data in an ETL tool 3. Load transformed data into warehouse

The Problems

Expensive: Enterprise ETL tools cost hundreds of thousands ❌ Complex: Point-and-click interfaces for complex logic ❌ Proprietary: Logic locked in vendor tools ❌ Slow: Data processed on ETL servers, not in warehouse ❌ No Version Control: Hard to track changes ❌ No Testing: Manual QA processes ❌ Black Box: Hard to debug and maintain

The Big Data Revolution (2010-2018)

How It Changed

┌─────────────┐      ┌──────────────┐      ┌──────────┐
│  Source DB  │ ───> │ Hadoop/Spark │ ───> │ Data Lake│
└─────────────┘      └──────────────┘      └──────────┘
                    Complex Code (Java/Scala)

Tools: Hadoop, Spark, Hive

Improvements: ✅ Handle massive data volumes ✅ Process data in parallel ✅ Open source

New Problems: ❌ High Complexity: Requires deep engineering skills ❌ Java/Scala Required: Not accessible to analysts ❌ Infrastructure Heavy: Managing clusters ❌ Still No Testing: Data quality issues persist

The Cloud Warehouse Era (2015-2020)

The Game Changer

Cloud data warehouses like Snowflake, BigQuery, and Redshift changed everything:

Scalable: Handle petabytes of data ✅ Fast: Columnar storage and MPP architecture ✅ Affordable: Pay for what you use ✅ Managed: No infrastructure to maintain ✅ SQL-Based: Analysts can write transformations

This enabled the ELT pattern:

┌─────────────┐      ┌───────────────┐
│  Source DB  │ ───> │ Data Warehouse│
└─────────────┘      └───────────────┘
                      Transform Here!
                         (with SQL)

Extract → Load → Transform

Instead of transforming before loading, load raw data first, then transform it inside the warehouse using its computational power.

The Modern Data Stack (2020-Present)

The Complete Picture

graph TB
    A[SaaS Apps] -->|Fivetran/Airbyte| B[Data Warehouse]
    C[Databases] -->|Fivetran/Airbyte| B
    D[APIs] -->|Custom Scripts| B

    B -->|dbt| E[Transformed Data]

    E --> F[BI Tools]
    E --> G[ML Models]
    E --> H[Reverse ETL]

    I[Version Control] -.-> J[dbt]
    J -.-> K[CI/CD]
    K -.-> E

    style J fill:#ff6b35,stroke:#333,stroke-width:3px

Components:

  1. Extract & Load: Fivetran, Airbyte, Stitch (automated data pipelines)
  2. Data Warehouse: Snowflake, BigQuery, Redshift, Databricks
  3. Transform: dbt (SQL-based transformations)
  4. Orchestrate: Airflow, Dagster, Prefect
  5. Visualize: Looker, Tableau, Metabase
  6. Observe: Monte Carlo, Datafold, Lightup

Why This Stack Wins

Best-in-class tools for each function ✅ Cloud-native and scalable ✅ SQL-first for accessibility ✅ Modular and composable ✅ Open standards and APIs

Enter dbt: The Missing Piece

The Problem dbt Solves

Before dbt, even with cloud warehouses, data teams faced:

Organizational Challenges: - SQL scripts scattered everywhere - No clear ownership - Tribal knowledge - Difficult onboarding

Technical Challenges: - No dependency management - Manual execution order - No automated testing - Undocumented data - No version control

Operational Challenges: - Broken pipelines - Data quality issues - Slow development cycles - Hard to debug

The dbt Solution

dbt brings software engineering practices to analytics:

# Traditional approach (pseudo-code)
run_script("01_staging.sql")
run_script("02_intermediate.sql")
run_script("03_marts.sql")
# Hope nothing breaks!
# dbt approach
dbt run   # Automatically determines execution order
dbt test  # Validates data quality
dbt docs generate  # Creates documentation

How dbt Works

The Workflow

┌──────────────────────────────────────────────────┐
│  Development (Local or dbt Cloud)                │
│                                                  │
│  1. Write SQL models                             │
│  2. Define tests in YAML                         │
│  3. Add documentation                            │
│  4. Run locally: dbt run / dbt test              │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│  Version Control (Git)                           │
│                                                  │
│  1. Commit changes                               │
│  2. Create pull request                          │
│  3. Review with teammates                        │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│  CI/CD (GitHub Actions, GitLab, etc.)            │
│                                                  │
│  1. Run dbt tests                                │
│  2. Check for errors                             │
│  3. Merge if passing                             │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│  Production (dbt Cloud or Airflow)               │
│                                                  │
│  1. Schedule: dbt run                            │
│  2. Test: dbt test                               │
│  3. Alert if failures                            │
└──────────────────────────────────────────────────┘

What Makes dbt Different

Not an ETL Tool: - Doesn't extract data - Doesn't load data - Only transforms data already in warehouse

It's a Development Framework: - Compiler (Jinja + SQL → SQL) - Orchestrator (manages dependencies) - Test runner (validates data) - Documentation generator

Real-World Examples

Before dbt

Acme Corp's Data Team: - 50+ SQL scripts in various folders - Run manually in specific order - No tests - discover issues in dashboards - New analyst takes 3 months to onboard - Pipeline breaks 2-3 times per week

After dbt

Acme Corp's Data Team: - 200+ dbt models in organized structure - dbt run executes everything correctly - 500+ automated tests catch issues early - New analyst productive in 2 weeks - Pipeline runs reliably every night

The Analytics Engineering Role

dbt has helped establish Analytics Engineering as a distinct role:

Data Analyst              Analytics Engineer         Data Engineer
    │                           │                         │
    ├─ SQL                      ├─ Advanced SQL           ├─ Python/Scala
    ├─ BI Tools                 ├─ dbt                    ├─ Airflow/Spark
    ├─ Excel                    ├─ Git                    ├─ Infrastructure
    ├─ Stakeholder Mgmt         ├─ Data Modeling          ├─ Pipelines
    └─ Ad-hoc Analysis          ├─ Testing                └─ Architecture
                                ├─ Documentation
                                └─ CI/CD

Analytics Engineers: - Own the transformation layer - Apply software engineering practices - Partner with analysts and engineers - Ensure data quality and reliability

Key Concepts Preview

Before diving in, here are core concepts we'll cover:

1. Models

SQL files that define transformations:

-- models/customers.sql
select * from {{ ref('stg_customers') }}

2. Sources

Raw data from your warehouse:

sources:
  - name: raw_data
    tables:
      - name: customers

3. Tests

Data quality checks:

columns:
  - name: customer_id
    tests:
      - unique
      - not_null

4. Documentation

Metadata about your data:

models:
  - name: customers
    description: One record per customer

5. Materializations

How dbt builds models (table, view, incremental, ephemeral):

{{ config(materialized='table') }}

The Transformation Landscape

dbt vs Alternatives

Feature dbt Dataform SQLMesh Legacy ETL
Language SQL + Jinja SQL SQL + Python Proprietary
Open Source ❌ (Google owned)
Testing Built-in Built-in Built-in Manual
Version Control Git Git Git Sometimes
Cloud Offering dbt Cloud Part of BigQuery Coming Vendor-specific
Community Huge Small Growing Varies
Learning Curve Moderate Moderate Moderate Steep

Why dbt Wins: - Largest community and ecosystem - Works with all major warehouses - Rich package ecosystem - Strong open source + enterprise options - Industry standard for analytics

What's Next?

Now that you understand the landscape, let's dive into dbt:

  1. What is dbt? - Deep dive into dbt's features
  2. Installation - Get dbt running locally
  3. First Project - Build your first data pipeline
  4. Core Concepts - Master dbt fundamentals

Next: What is dbt? →

Further Reading