In this blog I will go over the basics of data engineering, and why different steps are done.
TL:DR Data engineering is the process of moving cleaning and organizing data so that its ready for people to use
How does data engineering work?
Think of data engineering like you're baking a cake, the data starts out in its raw form, messy and unorganized, a bit like how when baking a cake you start off with ingredients
The data then has to be processed by a data pipeline, this is where we will use multiple tools like SQL, or python in order to begin cleaning this data.
You could think of the pipeline as the kitchen where all your ingredients are mixed together and baked.
Different transformations will need to be done in a certain order to achieve the results you're looking for, so this will need to be planned out, similarly to a recipe you might follow.
There are a variety of tools that you may use in order to achieve the cleaning and extract the data, these could include
Programming Languages: Python, SQL, Java, Scala.
ETL Tools: Apache Nifi, Apache Airflow, Talend, Informatica.
Big Data Frameworks: Apache Hadoop, Apache Spark, Flink.
Databases:
- Relational Databases: MySQL, PostgreSQL, Oracle, SQL Server.
- NoSQL Databases: MongoDB, Cassandra, Redis.
- Data Warehouses: Snowflake, BigQuery, Redshift.
- Cloud Platforms: AWS (S3, Redshift, Glue), Google Cloud (BigQuery, Dataflow), Azure (Data Lake, Synapse).
- Stream Processing: Apache Kafka, RabbitMQ, AWS Kinesis.
You can think of these tools like the oven, taking your ingredients and turning them into a cake
After processing the the data will need to be stored somewhere for people to be able to access and use the data. This will usually be in the form of:
Data Lakes: Store vast amounts of raw, unstructured data.
Data Warehouses: Store structured, processed data optimized for querying.
Data Lakehouses: Combine the flexibility of data lakes with the structure of data warehouses (e.g., Delta Lake).
This is the equivalent of putting the cake in a box for people to eat later