Analyzing the Game of Survivor -- Analyzing Sentiment of Reddit Posts (3)¶
Welcome back, to the third installment of the Survivor analysis series! My last post began to investigate the [relationship between Reddit mentions and particular contestants]. The next step is to dive a bit deeper into the Reddit comments themselves, using some basic NLP techniques.
In this third installment of the series, I will continue to digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.
Introduction to NLP and Sentiment Analysis¶
We will be using a similar query to the first post on this issue, so I will skip over any explanations here.
A quick explanation of NLP -- and a disclaimer: I have done some work in this field, but am by no means an expert. The next few paragraphs gives a brief overview of some of the topics in this area. You are encouraged to dive deeper into this yourself. The following isn't necessary to understand this analysis, but I wanted to give a quick overview of some of the challenges in this field and some of the potential shortcoming of this analysis before I dive in.
Sentiment analysis is a type of analysis in the realm of Natural Language Processing which investigates the sentiment of particular words and sentences. There is quite a lot of research on this topic, but essentially an annotated list of sentences is used to generate a model to predict the sentiment of the sentence using the attributes of the text itself.
Extracting these attributes is an article in and to itself, and can be done in quite a few ways. The simplest, and most widely known, is using a bag of words approach. In this approach, each word is first tokenized to represent its meaning, and then across a variety of examples is vectorized based on the counts of each words. This assumes that essentially, everything interesting about a sentence can be broken down into its components (words). That is, the whole is the sum of the parts.
However, we know this isn't the case. Words have contextual meaning as well, and there are correlations with other words in a sentence. In this space, there are some contextual encoders that try to handle this issue, like ELMO and BERT. That is a topic for another time, however.
NLP is a rich field, and there are many topics of interest in the field. For this analysis, we will mostly be glossing over the most important portions of this to use pretrained models for sentiment analysis.
Often times in these kinds of tasks, if the problem is generalizable enough, using pretrained models from a general corpus (like Wikipedia, or reviews across industries, for example) provides a sufficient model for your use-case. Of course, there may be domain specific meaning to particular words or phrases. For instance, a sentence:
I hate Russell.
is semantically very similar to:
I hate Frosted Flakes.
but
Tony got the third immunity idol.
has a Survivor-specific meaning in terms the word "immunity". In other contexts, like:
I am immune to the Chicken Pox, since I've already had it.
the word immunity has an essentially different meaning. In fact, in no cases (in general) will the word "Immunity" mean exactly the same thing as in a Survivor context. This is important to recognize, as it shows us the limitations of using a generalized model.
For this reason, there is often a desire to fine-tune these generalized models to a specific use-case.
However, while this is true, the word fine-tune is quite intentional. The generalized model handles the bulk of the work (usually a neural network in the case of embeddings) -- the fine-tuning has a smaller, incremental benefit as compared to the generalizable model. The general model gets you a lot of the way there.
While the topics above (contextual embeddings and fine-tuning models to domain-specific applications) are certainly of interest, and could potentially improve the models, I will leave this (and all model fitting) to a later date. For now, we will just use publically available, out-of-the-box solutions to get a quick view on the sentiment. Then, if the results seem worthwhile to dig deeper into, I will investigate in future articles building contextual models to gain more insight into the semantics and context of the sentences.
import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import statsmodels.api as sm
from IPython.display import HTML, display
from plotly.express import scatter
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
def pg_uri(un, pw, ip, port):
return f'postgresql://{un}:{pw}@{ip}:{port}'
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name,
c.last_name, cs.contestant_season_id, c.sex,
cs.season_id, occupation, location, age, placement,
days_lasted, votes_against,
med_evac, quit, individual_wins, attempt_number,
tribe_0, tribe_1, tribe_2, tribe_3, alliance_0,
alliance_1, alliance_2,
challenge_wins, challenge_appearances, sitout,
voted_for_bootee, votes_against_player, character_id,
r.role, r.description,
total_number_of_votes_in_episode, tribal_council_appearances,
votes_at_council, number_of_jury_votes, total_number_of_jury_votes,
number_of_days_spent_in_episode, days_in_exile,
individual_reward_challenge_appearances, individual_reward_challenge_wins,
individual_immunity_challenge_appearances, individual_immunity_challenge_wins,
tribal_reward_challenge_appearances, tribal_reward_challenge_wins,
tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins,
tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place,
fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
JOIN survivor.role r
ON cs.character_id = r.role_id
), matched_exact AS
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT *
FROM matched_exact m
'''
reddit_df = pd.read_sql(sql, eng)
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
reddit_df = reddit_df.merge(season_to_name, on='season_id')
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
pd.options.display.max_columns = 100
TextBlob¶
To analyze the sentiment of the comments, we will be using textblob, a NLP package that builds on top of NLTK and pattern, two very popular NLP libraries. They have a very simple API which will allow us to do alot of interesting analysis right out of the box!
We will be using their pre-trained sentiment analysis to extract the sentiment from a word.
To give a sense of what this looks like, let's look at a dummy example:
from textblob import TextBlob
text = 'I love Katy Perry.'
blob = TextBlob(text)
blob.sentiment
blob.noun_phrases
In each of these cases, we will get a polarity which represents the actual sentiment (best being 1, worst being -1) and subjectivity, which tries to analyze how subjective the sentence itself it (0 being "objective", 1 being subjective.)
We will take a look at both in this analysis.
Let's look at a particular comment and see what we will want to look at.
idx_two_example = reddit_df[reddit_df['body'].str.contains('Tyler made the absolutely best strategic move he could have. Will... is an idiot.')].index[0]
example_comment = reddit_df['body'].iloc[idx_two_example]
example_comment
The first thing we can do is just look at the sentiment of this entire paragraph. That's straightforward enough:
b = TextBlob(example_comment)
TextBlob(example_comment).sentiment
But there's a few thing with this. First off, we notices there are two different names in the above sentence, with different sentiments towards both. How do we associate the sentiment to the correct person?
This is a hard problem. Essentially, we would have to find a way to match each noun to a verb or adjective. This isn't a trivial task, and assumes a lot -- that we can identify the Part of Speech (not a given at all), and then we can unambiguously match them. In short, this is not easy to do and, as far as I can see, there's no way to handle this out the box with TextBlob.
There is, however, a somewhat "close enough" way to handle this problem. For now, since we're just looking at things on aggregate, this is what we will do.
First, we can extract all of the relevant contestants from the string. Remember, each comment is replicated for each possible subject, if there are multiple names. In this case, we have:
example_subject = reddit_df['first_name'].iloc[idx_two_example]
example_subject
It's Will. Looking at the above sentences, it looks like this person doesn't like Will too much. And that Tyson made the "absolutely best strategic move". So let's first extract only the sentences that have Will's name in it, and look at the average sentiment of these sentences.
[x.sentiment for x in TextBlob(example_comment).sentences]
np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])
Let's compare this with the other potential subject, on the next row:
example_subject = reddit_df['first_name'].iloc[idx_two_example + 1]
print(example_subject)
np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])
We see that this meets our expectations, rather than just using the overall polarity.
Note that this method is far from perfect, but short of being able to associate each potential subject-adjective pair, it will perform well enough for our purposes.
To give an example of how difficult this problem can be, take a look at this comment:
idx_many_example = reddit_df[reddit_df['body'].str.contains('I am really wishing for a David win, partly because')].index[0]
reddit_df['body'].iloc[idx_many_example]
Look how many names we have there! By my count there is:
- David
- Tony
- Mike
- Kristie
- Zeke
- Yul
- Jonathan
- Jessica
- Jay
- Jeff (Probst!)
- Cochran
- Michelle
Somehow, this person managed to jam that many names into so few words! (Impressive!)
In this case, using the above method doesn't work so well.
We'll stop here for now, but potentially there could be some interesting venues to go down associating subjects/objects with verbs and adjectives.
def extract_overall_and_sentence_sentiment(row):
body = row['body']
first = row['first_name']
last = row['last_name']
b = TextBlob(body)
overall_sentiment = b.sentiment
overall_polarity = overall_sentiment.polarity
overall_subj = overall_sentiment.subjectivity
rel_sentences_sentiment = [x.sentiment
for x in b.sentences
if (first in x) or (last in x)]
sentence_polarity = np.mean([y.polarity for y in rel_sentences_sentiment])
sentence_subj = np.mean([y.subjectivity for y in rel_sentences_sentiment])
metrics = [overall_polarity, overall_subj, sentence_polarity, sentence_subj]
metric_names = ['overall_polarity', 'overall_subj', 'sentence_polarity', 'sentence_subj']
return pd.Series(metrics, index=metric_names)
def add_sentiment_cols(df):
sentiment_df = df.apply(extract_overall_and_sentence_sentiment, axis=1)
return pd.concat([df, sentiment_df], axis=1)
reddit_df = add_sentiment_cols(reddit_df)
Let's take a look at one of the comments to get a sense of what this looks like!
reddit_df.sample(1)[['body', 'first_name', 'last_name',
'overall_subj', 'sentence_subj',
'overall_polarity', 'sentence_polarity']].to_dict()
It's important to remember that there seems to be a lot of discussion about which contestants could win or should win or whether they made the right decision or not. This is, of course, a separate question from sentiment -- there are plenty of people who make good strategic moves but are not the most well liked, by fans or by other contestants. Strong words in either direction ("hate", "love", etc.)
Out of curiosity, let's take a look at the overall feelings of sentiment in these comments.
from plotly.express import bar, histogram, box
histogram(data_frame=reddit_df, x='sentence_polarity', nbins=50, title='Sentence Sentiment Polarity')