Dashboard Week Day 2: Scraping PDFs with Alteryx

by Sarah Jellenc

On day 2 of Dashboard Week, our task was to scrape data from a PDF using Alteryx and build a viz with the resulting data set. We each chose a different PDF of results tables from a meet of the IAAF Diamond League (I chose the Monaco 2019 results). According to Andy, this is one of the hardest data prep exercises!

Today’s task was particularly challenging because the tables in the PDF had alternating structures; we needed to use a bit of R, a lot of Regex, a blog, and the help of several colleagues to get the job done.

The original PDF to be scraped.
Kicked it off with a bit of R…

Once I had imported the PDF into Alteryx, I got to work restructuring the data. This wasn’t entirely straightforward, but I learned quite a bit.

My initial workflow.

Next, it was time to build the viz. I sketched out a few ideas first, then decided to focus on the top competitors in each event.

My fancy sketching.

And finally, the (almost) finished product:

In the future, I’d like to add both the competitors’ countries of origin and result times to the tooltips.