Webscraping made easy?

Let's first start with understanding what webscraping is and when you are actually allowed to do it. Scraping a web page involves fetching it and extracting from it. The extraction of data from a website, simply put. However, it can be illegal. So when should you do it? There is no specific law against scraping or using public information or data through scraping, however, if the scraping of the information violates the intellectual property rights of the website owner or breaches any terms and conditions of website use, the owner may hold a claim against the user.

So my first recommendation of this blog is to make sure to read the copyright which can sit on the FAQ of the website, about us page or at the bottom of the homepage.

  1. Start with opening Alteryx and inputting the URL in a text input tool.
  2. Put a download tool after it al0ng with a select tool and right click on the select tool to cache and run. This way you dont keep downloading and scraping again and again - you could run into trouble!
  3. You will end up with 3 fields and 1 record - URL, DownloadData, DownloadHeader.
  4. This will usually be in an HTML format and now is the hardest task of webscraping - all the parsing.
  5. Inspect the structure of your data - mainly look for patterns for when new rows of data are being shown.
  6. Once the pattern has been identified as seen below - we have identified each hotel restarts when the row starts with </tr><tr><td>.
.

7. Hint: if you are scraping a table from the web, there will be HMTL syntax defining a data row (<tr>), a data column/header (<th>) or table data (<td>). This could help you split to rows/columns.

8. Keep using the regex, formula, multi row formula and filter tools to parse out the data and extract exactly what you want.

My thoughts on webscraping:

  • It is not easy.
  • Alteryx makes bringing in the data quite simple, however, everything after can be tedious.

Neat trick/hack:

  • Google sheets has made web scraping super super easy for extracting lists and tables from the web.
  • Simple open a new sheet in Google sheets
  • Type =importhtml followed by the following syntax
  • Magically, it has been extracted, parsed and organized for you!
Author:
Sherina Mahtani
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab