In this blog, I will be working through the Preppin’ Data challenge called The Data Source Bank. You can find it here. I’ll be using SQL! I highly recommend that you give the challenge a shot (whether with SQL, Tableau Prep or Alteryx).
This blog will consist of screenshots of my SQL queries, outputs and descriptions of what those queries are doing.
The primary task for this challenge is to clean the data. Specifically, there are three cleaned outputs which will involve string manipulation, conditional logic and some date functions!
To begin, I will show each of the cleaned outputs and the SQL queries I used to replicate said outputs. Just to set the stage, this is what the input data looks like, which will need to be cleaned.
The first cleaned output looks like this.
As you can see, a new column called Bank is created which is based on the Transaction Code column. In order to create this new column, we need to extract the “DTB”, “DS” and “DSB” characters from the Transaction Code column. There are a handful of ways of doing this.
One approach would be to combine the REPLACE and LEFT functions. The reason for this is two-fold: first, I need a way of getting the characters before the first (-) dash character, which is what the LEFT function does. However, as you can see in the query below I can’t just use this function. The reason for this is that the length of the banks are not the same. In the case of DS, it still has its dash.
This brings me to the second reason which is that we can use the REPLACE function to remove the dash character. The query below shows both functions combined to get the correct output.
With all of that said, it’s important to keep in mind that combining functions may not always be the best way of creating your queries. This is because this can make reading queries more difficult and, by extension, make debugging queries trickier because they are more complex. So, an alternative query to get the aforementioned output would be to use Regular Expressions. For example, the query below would work.
What the query above is doing is capturing the part of the Transaction Code that matches with the pattern of ([A-Z]{2,3}) — This pattern is looking for any combination of capital letters with a length of either 2 or 3. In this case, this perfectly matches up with DTB, DS and DSB. However, it’s important to consider that this query would also capture things like “NO” or “IF” or “XD” or "LOL" if such characters existed in the column. Moreover, if the banks were lowercase, such as "dtb", then the above Regular Expression pattern would not work.
Ultimately, whenever we are doing any kind of string manipulation or pattern matching, we have to be very careful and deliberate about what characters we are trying to capture or change. If not, any changes to the underlying data may cause our queries to output inaccurate results.
Moving on, let’s take a look at the second cleaned output.
In order to replicate this output, we will need to use the query we made to create the Bank column as well as clean the Online or In-Person column and the Transaction Date column. To fix the Online or In-Person column, we can create conditional logic that replaces the text of 1 and 2 based on their respective intended values. In that case, 1 should read as “Online” and 2 should read as “In-Person”. In my case, I used a CASE statement for this but a simple IF statement would also work. The query and output below shows this in action.
Let’s see how we can handle the Transaction Date column. In the input data, this column shows the complete date of a transaction and has a string data type. In the output, we want to just show the name of the day. Because this column is actually a string instead of date, we need to convert the column to a date using the TO_DATE function and then we can modify it to just show us the name of the day using the DAYNAME function. Just a quick note. Unfortunately, there doesn’t seem to be a way to get the complete day name (or month name) in Snowflake.
Something to keep in mind is that in order for this conversion to take place, we need to specify the date format. As you will see in the query, the format is DD/MM/YYYY HH:MI:SS which matches exactly with the original column. If this format is not specified, the query will return an error. The proper output is as follows.
With all of that, let’s look at the final result for the second cleaned output which is based on consolidating all of the previous queries together.
Another note: I wasn’t sure how to match the sorting so the above output differs from the aforementioned second cleaned output image however both outputs contain the same information.
Finally, we can start working on recreating the third cleaned output, pictured below.
Making this output is by far the simplest since all we need to do is use the query that we made in order to create the Bank column and then we need to group the query by Customer Code. In other words, there are no new data cleaning steps that need to be taken. Again, as mentioned previously, my output looks different from the image above. This is because I couldn’t figure out how the image was sorted. Here is my query and its output.
And there we go! We have replicated all of the cleaned outputs. We did this by using string manipulation, conditional logic and date functions. This was an awesome challenge!
If you want to see another example of how SQL can be used to solve Preppin’ Data challenges, please feel free to read through this article! If you get around to solving this challenge, please let me know what SQL functions you used!