Why do we want to clean our data?
To enable efficient analysis we need to clean our data. With Tableau Prep we can enhance the quality and usability of data before analysis or visualisation. The cleaning steps help address common data challenges, such as inconsistencies, errors, missing values, and formatting issues.
Importance of Cleaning Data:
- Ensuring data quality: Cleaning data helps maintain data integrity and accuracy. It allows you to identify and correct errors, inconsistencies, and outliers, resulting in more reliable and trustworthy analysis.
- Enhancing data usability: By cleaning and standardising data, you can ensure consistency in data formats, naming conventions, and values. This improves data usability and facilitates seamless analysis and visualisation in Tableau.
- Resolving missing values: Handling missing values appropriately, either by removing them or imputing values, helps avoid potential biases and ensures a complete dataset for analysis.
The ways we can clean data:
- Split Fields
- Filter (Remove Rows or Columns)
- Trim Whitespace
- Remove or Edit NULL values
- Change Data Types
- Change Case and Strings to be the same
- Group common values together
- Remove Letters/Numbers
Example
The data below shows an example of splitting fields, renaming and changing the order of the columns. The data is from customer feedback on a product sold online.
The first row of images below shows the original and output data sets after it has been cleaned. On the second row of screenshots from left to right, we have the Changes tab in Tableau Prep and the steps to split the "Customer Name" Column into "First Name" and "Second Name".
- First I selected the column I wanted to split and clicked on the ellipsis, from the drop down I clicked on Split Values then Custom Split. (Step 1 screenshot)
- In the box that appeared I used " " as the separator and chose "All" fields for the split off dropdown as we want both text either side of the " ". (Step 2 screenshot)
- From there I removed the original column as it is no longer needed.
- Finally, renamed the two new columns to "First Name" and "Second Name".
Other steps I can do to this data set, is to trim white space from the Feedback column, exclude the row with a NULL value in the Ratings column and change the NULL values in the feedback column to "No Comment".
The resulting data would then look like the table below.