On the second day of the dashboard week, we were assigned to work with cheese on https://www.cheese.com/. It means that we have to do web-scraping to extract the data from that web page. This is an interesting assignment because I really love using web-scraping and API to extract the data. In this blog, I am going to share what I have done on this assignment. The process is mostly the same as what I did on the dashboard week's first day. It includes:
1/ Do web-scraping data in Alteryx
2/ Clean/ prepare data in Alteryx
3/ Sketch a draft on Excalidraw
4/ Build visualization on Tableau Desktop
5/ Test and publish the dashboard on Tableau Public
1/ Web-scraping in Alteryx
Web scraping and API is one of my favorite topics. During training at The Data School, Valerija taught us how to do web scraping to get the data from the web page. We used to extract the data from the web page and then build the dashboard from that data. The assignment for the second day of the dashboard week is a similar one but a different page.
When I scrolled down to the end of the page, I recognized that there are 1871 kinds of cheese. They are displayed in 4 columns and 5 rows (20 types of cheese for each page). There are 94 total pages. Therefore, my goal was to get information on all 1871 different cheeses in Alteryx.
a) Getting the name of all cheeses in 94 pages
Image 3 is the workflow in which I got the names of 1871 different cheeses in 94 pages. I already knew there are 94 pages, so I created a page field from 1 to 94 and then assigned it to the original URL (www.cheese.com) as a parameter. Then, I downloaded the data and prepared the data to get only the name of each cheese.
b) Download data from each cheese
After I got the name of each cheese, I created a new URL by using the original URL with the name of each cheese. For example: for the data about Gouda cheese, I need to extract the data from www.cheese.com/gouda. Now, I got all the fields from the page.
2/ Clean/ Prepare data in Alteryx
After I got the data from web-scraping the page, I needed to clean and prepare the data to keep only the necessary fields.
For example, image 5 shows the data on Pecorino Romano cheese. The fields include: made from, Country of origin, region, family, type,... However, I only keep some necessary fields: made from which milk, country, type, texture, rind, color, flavour, aroma, and vegetarian.
This is a difficult challenge because there are some hyperlinks in Image 5 and each dimension could have multiple values.
For example: In image 6, for one sentence there are many hyperlinks for cow, goat, and sheep milk. Therefore, my goal is to get only the word without a hyperlink tag (<a href=>).
To do that, I used many RegEx tools to extract the data from the paragraph tag <p> and remove the hyperlink tag. In the result window, each cheese has multiple values and is separated by a comma. (Image 7)
I also did the same for other dimensions for each cheese. For the country of origin field, one cheese could come from many countries. Therefore, I also need to use RegEx tools to clean and get rid of the hyperlink tag. Finally, I use the Union tool and Cross Tab tool to combine all data in each branch together and transform the data. (Image 9)
After I got the prepared data, it was time to sketch my idea on Excalidraw.
3/ Sketch a draft on Excalidraw
After I cleaned and prepared the data in Alteryx, I got all the fields that I needed, including Cheese Name, Countries, Region, Flavor, Color, Made from kind of milk, Aroma, Type (soft, hard,...), Texture, .... I was planning to tell a story to let the user explore each type of cheese.
At the beginning of the sketch (Image 10), I show how many kinds of cheese are in the dataset, how many cheeses are vegetarian, the number of flavors, and the total countries the cheeses are made from. Then show the top countries that most cheeses come from. In the next section, I show what kind of milk cheeses are made from by showing the number of cheeses for each type. If the user clicks on one of the BANs, it will filter the rest of the dashboard.
In the next section, I show the top type (soft, semi-soft, hard, semi-hard, ...) and texture (smooth, creamy, ...) for a specific milk cheese that the user selected above. Then, show the detailed table including cheese name, flavor, type, texture, producers, and country,... (Image 12)
4/ Build Visualization in Tableau Desktop
After I got the idea from the sketch, now I can build the visualization and dashboard in Tableau Desktop.
From the top of the dashboard, I show some information about the cheese dataset. Then compare the number of cheeses by country. (Image 13)
In the next section, I compare different kinds of milk that cheeses are made from. Most cheeses are made from Cow's milk. However, some cheeses are a combination of many kinds of milk. Below the BANs is the symbol of each animal. The user can click on each symbol to filter the next sections.
After choosing the milk type, in the next section, I can compare the type and texture of that milk. The user can click on the bar of the type and texture to filter the detailed table.
In the last section, the detailed table will show all the information about the cheese including Cheese Name, Flavour, Aroma, Type, Texture, Colour, Producers, and Country, and distinguish whether cheese for vegetarian or not. There is a button on the top right of this section (Image 16). The user can choose the aroma, color, and country from that settings pane.
5/ Test and publish the dashboard on Tableau Public
Before publishing the dashboard on Tableau Public, I have to check the formatting of the dashboard to make sure it's consistent. Then, I also check the interactivity of each visualization. If I apply the filter action in visualization, I need to check the interactivity works for that viz. If everything is ok, I will post the dashboard on Tableau Public.
My Tableau Public Dashboard link is here: https://public.tableau.com/views/cheese_dashboard_DB2/Cheese_DB2?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link
Thank you for reading my blog about what I have done on the second day of the dashboard week. I hope to meet you again in the next blog.