A simple Introduction to Alteryx: The importance of Data Cleaning and Understanding your Data

In this article, I want to convey how important it is that data be cleaned and how any kind of data cleaning comes first from understanding data. In tandem with this, I will be introducing how Alteryx can be used to perform common data cleaning tasks.

To set the stage, we will be looking through some mock data that is meant to simulate some generic employee data.

You are responsible for obtaining, maintaining and analyzing employee data so that your company can hire talented candidates and identify current employees who are struggling. For you to be able to make these kinds of decisions, you need to have clean consolidated data and a robust understanding of what information may be necessary to make informed decisions.  

With that said, let’s take a look at our mock data. The mock data was created manually. This can be done by dragging the Text Input tool into your canvas and typing out the data. In this case, we have two tables of data.


Table 1:


Table 2:


Pay attention to how the tables appear. Are there any potential issues with how the tables are currently formatted?

One of the first things to consider when working with data is whether or not your data is clean. Clean data just means that your data is formatted in a reasonable and consistent sort of way. In other words, it means that if you gave your data to someone else, they would be able to understand what the data is about.

Now, how do you make sure that your data is clean?

Before you can clean your data, it is necessary to intimately understand what your informational needs are. For example, if you are tracking your spending and earning, you can develop a solid idea of your spending habits. You may want to do this because you’re thinking of saving up for a vacation or because you have some loans to pay off. By understanding what your informational needs are, you can ensure that you come up with relevant and useful data that is helpful to you when planning and making decisions.

With that said, let’s take a deeper look at our tables and determine what we actually need. First, let’s make sure that all the data is standard. That means, we need to fix any spelling mistakes, use consistent case sensitivity, and make sure that numbers are in the correct units. Remember, because we used the Text Input Tool to create our tables all of our data is expressed as strings.

For this preliminary cleaning, I used the Data Cleansing Tool and the Formula Tool. The former was used to remove any words, punctuation and change case sensitivity on columns where necessary. It can also be used to handle null or missing values. As for the latter, I used that tool to create expressions that checked if a given record had a certain text match and to change such records where matches occurred. For example, we can check if the “work_time” column equals “600 minutes”, and if so, we can change that to just be “10”, as 10 hours is the same as 600 minutes. Strictly speaking, it is not always ideal to create such specific expressions when doing data cleaning work. It would be better to do data cleaning in a more systematic and automated sort of way, but given the simplicity of the dataset I decided to write custom expressions (if-else statements) as they are simple to understand and implement.

In any case, while we are not completely done with our data cleaning, let’s move on. For now, let’s ignore the “Total_Compensation” column and put aside converting our columns to be in the correct data type. These issues will be addressed later. Just always keep in mind that datatypes matter tremendously.

Table 1:


Table 2:

After our preliminary data cleaning, we need to join our tables together. This is because, as of now, the tables are separate from each other. By joining, we can make a consolidated table which will be ready for future use. There are a few options to join data together via Alteryx. For now, we will use the Join tool.

In order to join two tables together in Alteryx, both tables need to contain one or more columns which contain the same information. That is, joins are done based on primary keys. In our case, both tables contain an “ID” column. This will act as the primary key and allow us to join the two tables together to form a consolidated table.

A critically important consideration to make when joining tables together is how exactly are tables joined. To learn more about joining, please read through this article. In it, the concept of joining is explained succinctly (to be clear, even though that article focuses on a different tool — both Alteryx and Tableau Prep handles joins in nearly the exact same way). That said, we are going to completely join both tables together, which is known as a full join.

Consolidated table:

With that said, our two tables are now one. Let’s take one more look at our table and make sure everything is in order. We can take a holistic look at our data by using either the Select Tool or the Browse Tool. In this case, we’ll just stick with the Select Tool.

The major reason why we’re bringing in this tool is so that we can see what different data types make up our data and make any necessary changes.

This is crucial because what type of data we are working with will influence what sorts of calculations and analyses we are able to conduct. This is especially relevant given our “Total_Compensation” column as it represents numerical data and must comprise prior numerical data.

As we can see in the prior image, the “Total_Compensation” column only has question marks in it. Little analysis can be done with question marks. So, let’s try to think about what that column could mean. To do this, we need to revisit what our informational needs are. Since this data is employee information, we can reasonably conclude that total compensation should involve both an employee’s wage and their bonus. It would be important to track how much each employee earns, especially by department, when coming up with budgetary plans. For example, a company may expand during economic growth but contract during economic decline. By tracking how much each employee earns, and situating that with other factors and considerations, companies can make informed decisions regarding hiring, firing and other compensatory options that employees have access to.

So, how do we calculate total compensation? With the Formula Tool, we can create calculations and place those values into an existing or newly created column. While there are several ways of coming up with total compensation, for the purposes of this article, let’s just add each of the records in the “Wage” and “Bonus” columns.

While this calculation was straightforward, doing calculations with data is an incredibly involved process. It is necessary to have a key understanding of your data otherwise you may generate incorrect or non-useful calculations. More specifically, in our case, we are only able to create this calculation because we had changed the data types of the “Wage” and “Bonus” columns to be numeric instead of as strings.

And with that, we are now finally done with data cleaning. In reality, the process of cleaning data isn’t ever really done. For instance, would you actually be able to set out to meet your objectives given the data as is? Likely not. That has to do with how the dataset is incomplete (in its current form). That is, there are probably several other data elements that are missing, such as performance scores, absences/PTOs, and so on. In line with ascertaining what data is relevant and irrelevant, as you may have noticed, we removed the “Favorite_Color'' and “Least_Favorite_Food” columns. This is because they aren’t remotely useful for the purposes of your task of making informed decisions based on employee information. Determining what data elements are necessary and unnecessary goes back to the wider concept of understanding your informational needs. Ultimately, whatever data we include or exclude from our consolidated datasets, data must be cleaned and prepared properly for the sake of beneficial analysis.

With all of that said, here is a diagram of steps taken in Alteryx to go from the two original, uncleaned tables into the clean and consolidated table. To be clear, there are many different ways to come up with the same results. Alteryx is a versatile tool that can do many important tasks related to data analytics besides data cleaning. I encourage you to take the time to play with Alteryx!

Author:
Lyon Abido
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab