Introduction to Normalized and Denormalized Data

An awareness of database structures is important contextual knowledge for data engineering. One of the key principles when thinking about database design is normalization, an approach to organizing data. This blog will introduce normalization and denormalization, discussing their strengths and weaknesses along with an approach to balancing them.

Normalization

💡

Normaliszation is the process of breaking large tables down into smaller tables that can be related (joined) to each otehr.

Reduce Data Redundancy

The primary motivation for normalizing data is to reduce data redundancy - i.e. in a smaller more focused tables one piece of data is stored in only one place to avoid the same data being stored in multiple tables.

Simplify Data Updates & Protect Data Integrity

Other benefits of this arrangement include protecting data integrity - pipelines often include update/insert and deletion processes to keep data up to date. For example, monitoring slowly changing dimensions (SCDs) like a customer address. Without a normalized database structure these processes are often more complicated and the risk of inconsistencies is potentially increased. Storing dimensions in separate tables allows the transactional tables to be updated simply with more rows for the new data and any changes that occur to the dimension tables over time can be tracked and monitored with a feature like Snapshotting in dbt.

Structured Database with a Schema to Navigate

The final benefit worth mentioning is that with the data organized logically and relationships between tables mapped, navigating across the the database to identify and query the data needed can be easier. This relies on the database documentation and design being to a high level but by storing data in separate areas with clear documentation

OLTP and Normalization

As such normalized structure is often used in an Online Transactional Processing system (OLTP). This system is transaction orientated (think till data for a store, or logging website traffic). In real-time, large numbers of transactions need to be recorded and update/insert existing tables. Write speed is the priority here and storing quality data.

Denormalization

Having explored normalized structure it makes sense to contrast it with the alternative denormalized structure.

💡

Denormalization is the process of combining the data so that it can be queried speedily. Larger tables are used that include redundant data on the grounds that although write time might be increased query time should be reduced by limiting the number of joins required to report on the data.

Reduce Joins Increase Query Speed

This structure's primary benefit is query speed for analysis and exploration. By trading off the benefits in write speed for update/insert we get a table with the information required for analysis readily usable.

Simpler Queries for Analysts

Furthermore, reducing the need for multiple join clauses simplifies queries making the tables more accessible to analysts with limited SQL knowledge.

Denormalization and OLAP

Given these benefits denormalization makes sense in an Online Analytical Processing Database system (OLAP). This type of system prioritizes search and analysis which is faster on the denormalized tables which might also include pre-aggregated data. Moreover OLAP systems typically have less frequent data updates than an OLTP system meaning that its drawbacks in the update process area are less problematic.

The old adage 'normalize until it hurts, denormalize until it works'

This adage recommends a balanced approach to database design - starting by fully normalizing a database to give a clean and consistent foundation with minimal redundancy. From this foundation by observing and monitoring database usage bottlenecks and repeatedly queried information can be identified and aspects can be denormalized to optimize the querying of the database.

Author:

Edward Hayter

View Profile