In week 1, Carl had fun asking us about hobbies. We naively answered him.
A few hours later, the afternoon started with Carl coupling us and assigning us five different messy datasets created by carefully putting stuff from the internet on a spreadsheet to make our lives difficult.
Thanks, Carl!
Paulien and I were given a dataset on music hits.
One of the issues we wanted to solve was extracting embedded hyperlinks from a column.
It was Day 2 at the Data School, and I stayed at the office a few more hours to bother Andy, the DS10 guys, people on Convo, and Robbin (Thank you so much for the advice!).
It took some time, but the answer was there waiting for us.
The Alteryx online community: a great place to find answers
Here’s what I found.
How Alteryx reads files: Lesson #1
In Input Excel file containing Hyperlinks, someone explained the core concept as something like this:
Excel doesn’t store the hyperlinks as values for the field but rather as an external relationship inside the XLSX package. Therefore, when bringing in Excel files, Alteryx brings in the values; it doesn’t look for external relationships.
In another discussion, Extract Hyperlinks, Ben Moss (DS2 cohort) explained:
.xlsx files are actually a collection of different files which Excel puts together upon opening. In contrast, tools such as python, R and Alteryx just fetch the data layer from the file.
I had to find a way to solve this puzzle.
Ben’s solution to extract hyperlinks
I downloaded one Alteryx workflow Package that a contributor had shared in the first discussion.
Robbin then came to the rescue, sharing what Ben Moss had done.
It was exactly what I was looking for.
As I shared in my other article, Dealing with datasets on Alteryx . . . in Arabic!, last week I tried to analyse a dataset in Arabic. I tried to find one online with some tricky stuff to solve, and I found it on Forbes Middle East.
There, too, I faced the very same problem: an embedded hyperlink.
I seized the chance to try Ben’s solution.
- I downloaded his package,
- I slavishly followed all the steps that he had made in it, using my data source, and there it was!
- Solved!
A step-by-step guide to extracting hyperlinks from an Excel file using Alteryx
a. Start from the folder where you previously saved your Excel file.
You will have to drag the same Excel file 3 times onto the Alteryx pane.
First time
b. Drag the file from your folder onto the pane in Alteryx. It will automatically add an “Input Data” icon.
Otherwise, go to In/Out and drag the icon.
Select the appropriate sheet. I had only one sheet in my Excel file, and it was called “Sheet”.
c. Look at the Configuration Pane:
2. File Format: Change it to Zip Archive (*.zip).
3. File in Archive: Open it. In the new window Extract File:
- Set Select file type to extract to Other Files.
- Check “worksheets”. It will automatically select the appropriate folders and files.
5. Output File Name as Field: No.
6. Parse Selected Files as: XML Files (*.xml).
10. Check Return Child Values.
Second time
d. Again, drag the file from your folder onto the pane in Alteryx. Select the appropriate sheet.
e. Look at the Configuration Pane:
2. File Format: Change it to Zip Archive (*.zip).
3. File in Archive: Open it. In the new window Extract File:
- Set Select file type to extract to Other Files.
- (*This step differs from the first time*) Check “sheet1.xml”. It will automatically select the appropriate folders and files.
- Set Parse other files as to XML Files (*xml).
5. Output File Name as Field: No.
6. Parse Selected Files as: XML Files (*.xml).
9. (*This step differs from the first time*) Type “hyperlink”.
10. Check Return Child Values.
f. Join the two files by 1 specific field: Id <-> r:id.
Third time
g. Again, drag the file from your folder onto the pane in Alteryx. Select the appropriate sheet.
h. Leave the Configuration Pane as is.
i. Add a Record ID tool with starting value = 2.
j. Add the Formula tool. Look at the configuration pane:
- In Output column, open the dropdown, click on + Add Column.
- Type this formula: “A”+tostring([RecordID]).
k. Add the Join tool. Then:
- Join the previous Join tool with the Formula.
- Join by Specific Fields: Left (Column) <-> Right (ref).
- Rename the Target column as “URL”.
The previously embedded URLs now appear as String values in the URL column.
You are ready to clean the dataset even more and select the Output Data tool to create an updated .xlsx file.
I explain the steps I took to edit the dataset in another post.
Read it here: How to clean a messy dataset . . . in Arabic.
Extracting embedded hyperlinks: Mission accomplished!
A big thank you to Robbin and to Ben Moss for sharing his solution on the Alteryx Community!