Sentiment Analysis using a Dictionary

by James Fox

Paul Houghton has a great blog post and webinar on creating a sentiment analysis workflow in Alteryx using either an API, Dictionary, or Custom model. Check them out for a really good guide on how to use each of them!

In this blog I will take a deeper dive specifically into using the Dictionary method for Sentiment analysis, breaking down each step for cleaning and preparing the dictionary, followed by the sentiment scoring.

In this case example I am trying to understand how the sentiment of Pitchfork reviews compares to the Score given to an Artist’s Album.

This can help us answer questions such as; Is the sentiment of a review reflective of its score? Do different reviewers score higher then they feel about an album? How has this all changed over time?

Step 1 – Find your Dictionary

First of all it is important to find a good lexicon to score your words for you, I am using the MPQA Subjectivity Lexicon.

Step 2 – Extract data from the Dictionary

Each row of the MPQA Subjectivity Lexicon comes in this format:

type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative

Luckily it comes with a README file that can help us understand what all that means. For this exercise I will be extracting the word, the polarity, and the type.

To extract this information we will use 3 regex tools.

  1. type=(.*)subj
  2. word1=(.+)\spos
  3. priorpolarity=(.*)

Then we have to assign values to find out the sentiment score. I have given weak words a value of 0.5 and strong words 1 using a formula tool.

There are neutral words and both words, but I will choose to ignore them (as I would give them a score of 0 anyway)

Next a formula to give the negative words a negative value.

Step 3 – Join the sentiment scores with the words in your data

These are 3 rows of the Pitchfork data:

We need to extract each word from those rows into new rows.

We can do with with a text to column tool when setting to to ‘Rows’ followed by a data cleaning to to make the works lower case (as joins in Alteryx are case sensitive) .

After joining the data we have each reveiw with each word and its sentiment score:

Following that step we need to summarise the sentiment score all of the words for each review.

Resulting in each Review with a total sentiment score

Step 4 – Additional Sentiment adjustment

After reaching this point you can make additional calculations to make the score more valid.

For example the review had different number of words, so those with more words tended to have a higher score.

I divided the socre by number of words, then calculated the score to be relative from 0 to 10, so I could easily compare the sentiment with the score with both being 0 to 10.

As the calculations can be fairly arbitary I didn’t want to go into this section too much. I want to open the floor to ideas and would love to hear you thoughts of how to validate the sentiment score in the best way possible!

Additionally, this scoring will be improved using the Stemmed part of the Dictionary, and perhaps using also factoring in other ways to clean the data – such as removing common words such as ‘the’, ‘and’, ‘in’, etc