In today’s data pipeline, data often travels through different repositories as it is processed for various purposes. This blog explores the primary types of data repositories: databases, data warehouses, and data lakes.
1. What is a Database?
A database is a structured collection of data used for storing, managing, and retrieving information efficiently. Databases are optimized for everyday transactional operations.
Databases are usually defined by two features:
- Databases are often normalized. Normalized data reduces redundancy by organizing data into multiple tables.
- Databases typically support Online Transaction Processing (OLTP) systems. OLTP systems prioritize write operations and handling small, real-time transactions.
|
Customer ID |
Name |
|
1 |
Bob |
|
2 |
John |
|
Order ID |
Sales ($) |
Customer ID |
|
1 |
100 |
1 |
|
2 |
200 |
2 |
In this example, the database has separate tables for customers and orders. With data split into multiple tables this way, it makes it easy to insert, update, and delete information.
We should note that an OLTP system isn’t always normalized, although it often is. OLTP refers to the purpose of the database, while normalization refers to its structure.
Examples:
- MySQL
- PostgreSQL
- Microsoft SQL Server
- Oracle Database
2. What is a Data Warehouse?
A data warehouse is a centralized system designed for reporting and data analysis. It stores historical data from various sources to support business decisions.
Data warehouses are usually defined by two features:
- Data warehouses are often denormalized. Denormalized data reduces redundancy by organizing data into multiple tables.
- Data warehouses typically support Online Transaction Analytical Processing (OLAP) systems. OLTP systems prioritize read operations and enabling complex queries.
Examples:
- Amazon Redshift
- Snowflake
- Google BigQuery
- Microsoft Azure Synapse
3. What is a Data Lake?
A data lake is a large repository that stores raw, unprocessed data in various formats including structured, semi-structured, and unstructured.
Because of the flexibility of data lakes, they are often used as a pre-processing staging layer in modern data pipelines.
Examples:
- Amazon S3
- Azure Data Lake Storage
- Google Cloud Storage
- Hadoop Distributed File System (HDFS)
Summary
|
Feature |
Database |
Data Warehouse |
Data Lake |
|
Purpose |
Day-to-day transactions (OLTP) |
Business analysis & reporting (OLAP) |
Store raw data for analytics & ML |
|
Data Type |
Structured |
Structured & processed |
Structured, semi-structured, unstructured |
|
Data State |
Current, real-time |
Historical, cleaned |
Raw and unprocessed |
|
Users |
Developers, operations teams |
Analysts, BI teams |
Data scientists, engineers |
|
Examples |
MySQL, PostgreSQL |
Snowflake, Redshift |
S3, Azure Data Lake |
