Scraping a Web Page Part 1- Inspecting the HTML

by Conrad Wilson

Intro to the Blog Series

A lot of information is stored on a Web page.  If there is a pinned map you can obtain spatial data. If there is an image you can obtain its URL link to download it. If there is a table you can find the data stored within. Web scraping allows us to obtain this data. In this series of blogs Ellen and I will be showing you how to use Alteryx to web scrape data. We will use a project we completed at the data school looking at the demographics of the areas surrounding Connells branches. The final visualisation can be found here.
This blog will focus on finding geographical data within the HTML of a website.

Inspecting the HTML of a Website

 

Firstly load the web page you want to scrape from. Right click on the page and select inspect. This will load the HTML of the website which shows the make-up of the website.
Figure 1. The HTML code of the website
Select the tool at the top left of the pane to highlight the code responsible for each part of the web page.
Figure 2. How to inspect elements
This allows you to see the code of the elements of the web page you hover over as seen in Figure 4. The map is highlighted and the corresponding HTML is highlighted in the inspection pane.
Figure 3. Inspecting Elements of the Web page
However if we are looking for spatial data we can simply use the search bar at the bottom of the inspection pane. Searching for latitude will return the point in the HTML where the spatial data is stored as shown in Figure 4.
Figure 4. Searching the HTML
Check to see if all of the data points you wish to scrape have the same format. If yes, then we’re ready to use alteryx to scrape this data from the website. Check out the next part of the Blog written by fellow DSX’er Ellen to find out how!