Friendly Neighbourhood Web-Scraper

For a personal project I wanted to learn how to web-scrape and perform sentiment analysis on text, so I scraped some data from the Letterboxd reviews of the latest movie I’d watched – Spider-Man: No Way Home. This was before joining the Data School, so I had no knowledge of data prep software such as Alteryx, endeavoring to do the cleaning and analysis in Excel and R Studio. Recently I revisited this project to see how I would clean it in Alteryx instead.

Data: https://letterboxd.com/film/spider-man-no-way-home/reviews/by/activity/

This was my raw data. The scores provided a challenge because ideally you would want them as a number so you can perform calculations and aggregations, but the score on the original website was represented by star icons. These came through as a certain number of characters for each score. Additionally, some reviews had no score as this was a rewatch, and users can only submit a score once.

There were also some issues with the text of the reviews – it turns out that UTF-8 encoded characters sometimes don’t show up properly after scraping the data, with the blue-highlighted characters showing up instead. This would pose a problem when analysing the sentiments of words in the reviews – removing punctuation would not work, as some of the replacement characters are letters.

To fix the score in Excel, I first used a rather long formula that was extremely tedious to write. Essentially, it matches every possible string in the score to a number.

In Alteryx, it was a lot more simple. This was my workflow and the initial data:

After some initial cleansing and converting the date to a DateTime format, I used a formula to get the score. This contained a RegEx function that counts the number of times a “â” letter is matched in the score string, and that count then becomes the numeric score. The field was previously made lowercase in the data cleansing tool, since non-whole numbers ended in a capitalized character. If the string contains a half fraction symbol, the formula adds 0.5 to the score. This is only carried out if the “â” letter is present in the score field – otherwise, there is a null value since no score was given.

To fix the incorrect characters in the review text, I copied and pasted a table I found online of UTF-8 encoded characters and the most common characters that appear in their place. The “Expected” field shows how the characters should appear, and the “Actual” field shows the incorrect characters that appear in their place.

Data source: https://www.i18nqa.com/debug/utf8-debug.html

I used the Find Replace tool and configured it so that in every review, it finds all instances of a character from the “Actual” field in the text input table, then replaces each of those characters (or sets of characters) with the corresponding character in the “Expected” field.

There is however, one character I did not replace – in the data, an apostrophe often appeared as a “â” character. However, several of the strings in the “Actual” field contain this character, so if the “â” was included in the lookup table, the tool might have replaced the “â” within these strings instead of the whole set. For example, “â€š” would be changed to “’€š” rather than a comma.

To change the leftover "â" characters, I used another Find Replace tool, this time using a text input with just that letter. Of course this would mess up some reviews in other languages, as the "â" would sometimes actually be a part of the word, but my analysis was only going to focus on the English reviews, so this isn’t much of a problem

All of these steps result in the output below. Some square symbols still seem to appear in the output, but when you view the actual data in the cell, the boxes disappear.

Overall, data cleansing in Alteryx was much easier than in Excel, with the additional option to see all the steps I took and change my workflow if needed – as I did when I added the second Find Replace tool. I did not fix the encoding problem in the review before, because when I used certain packages in R Studio to parse out the words in the reviews, the problem was gone.

Next I would have to parse out the words and assign sentiments to each one, but that sounds like a part 2 post.

Author:

Dorinna Pentchev

View Profile