Web Scrapping with Alteryx

When dealing with dynamic webpages which have JavaScript or Ajax request, doing web scrapping with Alteryx become a bit more tricky. But if we want to just scrap a static page from a website, it can be easily done with the RegEx tool in Alteryx.

Before I delve into the tool itself, let's go through some basics of Regular Expressions. (If you are already familiar with RegEx and just want to know how to build the workflow in Alteryx, go straight to Part 2)

PART 1: What are Regular Expressions?

Commonly referred to as RegEx (pronounced: Rej-Ex), are a sequence of encoded characters used as patterns for matching set of strings. It helps us to find, match and manage any string data. The fields of application range from validation to parsing/replacing strings or in this blog, web scraping.

In regular search toolbar, let's say a web browser, when we want to find a particular word - "everything", we can do CRTL + F and then type that word to search for its occurrence within the webpage. That is a very specific search, but what if we want to retrieve any word that begins with "every", so words like everyone, everybody, etc will be considered as a match?

That's when Regular Expressions come into the picture: When we want to be vague enough to capture everything we want, and in the meantime, when we want to be specific enough to capture only certain things we want.

So we use a set of encoded characters to represent a group of characters: digits, letter from a to z, etc. And then the quantifiers to define how many characters we are expecting.

For example (never mind the logic of the sentence), we want to match "everyone" and "everything" but not "every day", "every hour" and "every second", we need to find a pattern that match what we want but mismatch what we don't want. "\w" represent any alphanumeric and the characters: & and _. Next, we need to set how many times we are expecting the character to appear. "one" has three characters and "thing" have 5 characters. If we specify 3 characters after "every", we get two matches: "everyone" and "everythi"

We can use an OR operator to specify one logic for everyone and one logic for everything but this logic won't apply if we have an "everybody" in the sentence so we want to be vague enough to cover all other words that start with "every-"

That's when we want to use a quantifier instead of specifying the exact number of character we are anticipating.

The following is a snapshot of all the basic regular expressions:

Back to the main topic: Web Scrapping using RegEx tool in Alteryx. In this example I will be scraping the Alteryx Weekly challenge page to retrieve the information about the weekly challenges, specifically I will scrap the information in this table:

PART 2: Web Scrapping in Alteryx

Understand the html structure of the website

Using google chrome, on the webpage, right click and select inspect, the DevTools page will appear on the right. We can inspect any element of the webpage using the tool highlighted in red. All texts appear on the website will be embedded within these HTML tags. We will then need to parse these relevant tags in Alteryx to retrieve the info we want.

Alternatively, you can hover over the html script, the corresponding segment in the webpage will be then highlighted as in the following snapshot.

In this case, all the information I want is in <TR></TR> tag.

2. Building a workflow in Alteryx

Firstly, we need to put the URL of the website to the text input tool and then connect it to the download tool and use the default setting.

Click Run and make sure your DownloadHeaders show "HTTP/1.1 200 OK", the HTML code is download and loaded to the DownloadData field.

Then, we connect the RegEx tool after the download tool, choose the DownloadData to parse the HTML code that we download from previous step and write a regular expression to match what we want to keep. As we want to extract every TR tag, the regular expression we need to write is "<TR.*?</TR>

Let's decipher what .*? means:
. is a wildcard that matches any character except the line terminator
*? matches the previous character (in this case a wildcard) between zero and unlimited times but as few times as possible because of the use of the lazy quantifier, ?
So .*? will just give us everything in between <TR and </TR>

In this case we want to split all the HTML code with the TR tags into multiple rows so we will use the tokenize output method and select split to rows option.

Next, we filter out the rows that doesn't contain the website URL (there are some other TR tag that doesn't belong to the tag we want to scrap). Then we can use the RegEx tool to parse out the info we need into to columns which can be done in one go. It looks intimidating at first glance if you are not familiar with RegEx but the logic is pretty simple, so bear with me for awhile.

The Unprocessed HTML code in DownloadData looks like this:
<TR bgcolor="#FFFFFF"><TD>1</TD><TD><A href="https://community.alteryx.com/t5/Weekly-Challenge/Challenge-1-Join-to-Range/td-p/36621" target="_blank">Join to Range</A></TD><TD>Intermediate</TD><TD>Parse,Preparation,Join,Data Preparation,Transform</TD></TR>

Regular Expression used to grab the corresponding info we want (highlighted in bold):
<TD>(\d+)</TD><A href="(.*?)"\starget"_blank">(.*?)</A></TD><TD>(.*?)</TD><TD>(.*?)</TD><TR>

We need to wrap the regular expression in () brackets and each of them will return a column of strings that match the defined expression, in this case we get any characters in between those tags with .*? e.g. "Join to Range" and a single digits or more with \d+ e.g. "1"
NOTE: We will get 5 columns from this parse as we have 5 sets of () bracket.

After this step, we basically have the exact same table as the one presented on the website. It involved only 5 steps with 4 different tools in Alteryx! Pretty neat isn't it? :)

I carry on to add few more tools to split the concatenated Main Subject into rows so that I can easily plot something in Tableau using the table. The following is my full workflow and a cleaned table scrapped from Alteryx Weekly Challenge Webpage.

Then I make a simple Viz in Tableau which allows me to easily filter the challenges according to the topic!

Author:

Jia Yan Ng

View Profile