Database vs Data Warehouse vs Data Lake

In today’s data pipeline, data often travels through different repositories as it is processed for various purposes. This blog explores the primary types of data repositories: databases, data warehouses, and data lakes.

1. What is a Database?

A database is a structured collection of data used for storing, managing, and retrieving information efficiently. Databases are optimized for everyday transactional operations.

Databases are usually defined by two features:

Databases are often normalized. Normalized data reduces redundancy by organizing data into multiple tables.
Databases typically support Online Transaction Processing (OLTP) systems. OLTP systems prioritize write operations and handling small, real-time transactions.

Customer ID	Name
1	Bob
2	John

Order ID	Sales ($)	Customer ID
1	100	1
2	200	2

In this example, the database has separate tables for customers and orders. With data split into multiple tables this way, it makes it easy to insert, update, and delete information.

We should note that an OLTP system isn’t always normalized, although it often is. OLTP refers to the purpose of the database, while normalization refers to its structure.

Examples:
- MySQL
- PostgreSQL
- Microsoft SQL Server
- Oracle Database

2. What is a Data Warehouse?

A data warehouse is a centralized system designed for reporting and data analysis. It stores historical data from various sources to support business decisions.

Data warehouses are usually defined by two features:

Data warehouses are often denormalized. Denormalized data reduces redundancy by organizing data into multiple tables.
Data warehouses typically support Online Transaction Analytical Processing (OLAP) systems. OLTP systems prioritize read operations and enabling complex queries.

Examples:
- Amazon Redshift
- Snowflake
- Google BigQuery
- Microsoft Azure Synapse

3. What is a Data Lake?

A data lake is a large repository that stores raw, unprocessed data in various formats including structured, semi-structured, and unstructured.

Because of the flexibility of data lakes, they are often used as a pre-processing staging layer in modern data pipelines.

Examples:
- Amazon S3
- Azure Data Lake Storage
- Google Cloud Storage
- Hadoop Distributed File System (HDFS)

Summary

Feature	Database	Data Warehouse	Data Lake
Purpose	Day-to-day transactions (OLTP)	Business analysis & reporting (OLAP)	Store raw data for analytics & ML
Data Type	Structured	Structured & processed	Structured, semi-structured, unstructured
Data State	Current, real-time	Historical, cleaned	Raw and unprocessed
Users	Developers, operations teams	Analysts, BI teams	Data scientists, engineers
Examples	MySQL, PostgreSQL	Snowflake, Redshift	S3, Azure Data Lake

Author:

Charles Yi

View Profile