Combining specific excel sheets from multiple files

by Joe Phillpott-Clark

Last week DS19 carried out an internal project with data from the National Student Survey. This data was of survey responses collected from 2005-2019. We were taking the data from files on the NSS site, where each year had at least one file for responses. Each excel file had multiple sheets with different levels of granularity, and a notes sheet.

List of sheet names in each file

We decided on a level of granularity, and so I needed to take the NSS3 sheets (or its equivalent name) and union them together to create a full dataset for analysis. To do so, I used a method with two batch macros.

I’ll be using the NSS data to walk through the batch macro which can be downloaded here – https://www.officeforstudents.org.uk/advice-and-guidance/student-information-and-data/national-student-survey-nss/get-the-nss-data/

FIRST BATCH MACRO – Creating a list of all file and sheet name combinations

STEP 1

To start with, download all the files you wish to bring together. For this example, we will use the data from years 2012 to 2019 (Earlier years are xls files and would need to be converted to xlsx first).

Macro 1
Step 2

To create the batch macro, connect any single file using an input data tool, and then attach a control parameter as done in the image above.

Input tool configuration:

Change the Table or Query to the option <List of Sheet Names>. We’ll use this output later to create a full file + sheet path.

Action tool configuration:

Click on ‘file value’ and then ‘replace a specific string’. We want to remove the ‘|||<List of Sheet Names>’ part of the string as we are just looking to change the file path for each year.

Step 3

Next we just want to take the first part of the file path without this ‘<List of Sheet Names>’ and so we can use a simple text to columns tool.

Our final output for one year looks something like this. With two columns that we’ll add together to get a full path for the specific file and sheet.

Macro 2

SECOND BATCH MACRO – Union sheets from different files

To configure the second batch macro we simply add a control parameter connected to the input tool and a macro output.

Input tool configuration:

This time we choose a specific sheet under ‘Table or Query Name’.

Also select ‘File Name Only’ for the ‘Output file name as field’ option. I renamed each file to its year when downloading so that now when we output the file name as field, we have created a Year column to use in Tableau.

A batch macro will automatically union the iteration outputs together, creating the whole dataset we want. (Make sure your data structure across files is the same!)

FINAL WORKFLOW – Combining both macros

Final workflow

The first tool to use is a directory tool. To configure this tool, simply point it to the folder that contains your files. It’ll then output file paths and attributes. I then used a simple filter tool to get rid of the xls files that I also had in the same folder.

The first macro now outputs a list of file paths and sheet names for every file located in the directory we specified.

Next I used a filter tool, to only keep the rows of data that were pointing towards a sheet I wanted to keep. In this example it was just any sheet named ‘NSS3’. However, if your sheet names are varying across files you may have to add some more conditions in your filter.

After that, a formula tool allows us to create the full path that also includes the specific sheet name we’ve been looking for.

This is creating a list of full paths to feed into a second macro used to union all the chosen data sets together. In the second macro tool, choose this ‘Full path’ field in the configuration so the macro will repeat for every different ‘Full path’ file.

Finally, a data cleansing tool is used to remove some leading and trailing whitespaces, and that’s it!

Thanks for reading, and if you have any questions, or even an alternative method, be sure to comment or you can message me on twitter @Joe_DS19. 👍

Avatar

Joe Phillpott-Clark

Sat 09 May 2020

Sat 09 May 2020

Thu 23 Apr 2020

Wed 22 Apr 2020