#TipWeek: Bulk Downloads with Alteryx

by Nai Louza

2016-02-15_17-40-42

This dashboard is based on data from the U.S. Bureau of Transportation Statistics that contains all flight data for U.S. airlines from 1987 to 2015 (link). ALL. Flight. Data. I’ll give you a moment to wrap your brain around that.

Suffice to say, it’s a lot of data. The challenge we were given that week was to narrow it down to a maximum of 10 million records and create a dashboard from those records that could be published to Tableau Public. This is because until May of 2015, Tableau Public could only process a maximum of 1M records for display. This changed to 10M in May 2015, which gives people a lot more flexibility in publishing large data sets, like the ones you get from the Bureau of Transport Stats.

The first thing you’ll notice when you go to the site to download the data is that you cannot do a bulk download; you can only filter to the year and month and then download a singular zip file. Although I could have just downloaded the 2000-2001 data I used to create this viz, part of our challenge that week was to figure out how to download the full dataset from the website and join it all in Alteryx. Did I mention this is ALL U.S. flight data from 1987 to 2015? Fudgies.

Well, if you download a zip file for one month/year (making sure to check on the Prezipped File box at the top),  look at your downloads list (ctrl+j in Chrome) and you’ll notice that you get a URL of the zip file:

2016-02-15_16-35-15

This URL has a pattern where it is named based on the selected year and month. We can use this to create a list of all zip file links for all years and months (HEADS UP! I know this was created in Excel using a formula, but I’ll be showing you how to create this same list in Alteryx for tomorrow’s #tipweek tip :)). You will also need a column specifying the location where you want to save all your files. You’ll see why in a minute:

2016-02-15_16-36-18

Once you’ve got your list of download URLs and locations, open up Alteryx and input the file. Then drag in the “Download” tool and connect it to your input tool. Configure your download tool so that it takes the URL from the “URL” field in the list you created (make sure to check on “encode URL text”) and output “to a file” > “filename from field” > set this to the field that contains the download location information previously inputted.

ezgif.com-video-to-gif

Run the workflow (you do NOT need to attach anything as an output) and Alteryx will go through and download all the zip files into the locations you’ve specified. Given that the data set is so large, it does take a couple of hours to run through everything. However, I would much rather run through a workflow like this rather than make my knuckles bleed trying to individually click and download each file off the site! Tomorrow, I will go into more details about how to create the Excel list inputted using a (somewhat) simple Alteryx workflow.