Analyzing the Game of Survivor -- Analyzing Sentiment of Reddit Posts (3)¶

Welcome back, to the third installment of the Survivor analysis series! My last post began to investigate the [relationship between Reddit mentions and particular contestants]. The next step is to dive a bit deeper into the Reddit comments themselves, using some basic NLP techniques.

In this third installment of the series, I will continue to digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.

Introduction to NLP and Sentiment Analysis¶

We will be using a similar query to the first post on this issue, so I will skip over any explanations here.

A quick explanation of NLP -- and a disclaimer: I have done some work in this field, but am by no means an expert. The next few paragraphs gives a brief overview of some of the topics in this area. You are encouraged to dive deeper into this yourself. The following isn't necessary to understand this analysis, but I wanted to give a quick overview of some of the challenges in this field and some of the potential shortcoming of this analysis before I dive in.

Sentiment analysis is a type of analysis in the realm of Natural Language Processing which investigates the sentiment of particular words and sentences. There is quite a lot of research on this topic, but essentially an annotated list of sentences is used to generate a model to predict the sentiment of the sentence using the attributes of the text itself.

Extracting these attributes is an article in and to itself, and can be done in quite a few ways. The simplest, and most widely known, is using a bag of words approach. In this approach, each word is first tokenized to represent its meaning, and then across a variety of examples is vectorized based on the counts of each words. This assumes that essentially, everything interesting about a sentence can be broken down into its components (words). That is, the whole is the sum of the parts.

However, we know this isn't the case. Words have contextual meaning as well, and there are correlations with other words in a sentence. In this space, there are some contextual encoders that try to handle this issue, like ELMO and BERT. That is a topic for another time, however.

NLP is a rich field, and there are many topics of interest in the field. For this analysis, we will mostly be glossing over the most important portions of this to use pretrained models for sentiment analysis.

Often times in these kinds of tasks, if the problem is generalizable enough, using pretrained models from a general corpus (like Wikipedia, or reviews across industries, for example) provides a sufficient model for your use-case. Of course, there may be domain specific meaning to particular words or phrases. For instance, a sentence:

I hate Russell.

is semantically very similar to:

I hate Frosted Flakes.

but

Tony got the third immunity idol.

has a Survivor-specific meaning in terms the word "immunity". In other contexts, like:

I am immune to the Chicken Pox, since I've already had it.

the word immunity has an essentially different meaning. In fact, in no cases (in general) will the word "Immunity" mean exactly the same thing as in a Survivor context. This is important to recognize, as it shows us the limitations of using a generalized model.

For this reason, there is often a desire to fine-tune these generalized models to a specific use-case.

However, while this is true, the word fine-tune is quite intentional. The generalized model handles the bulk of the work (usually a neural network in the case of embeddings) -- the fine-tuning has a smaller, incremental benefit as compared to the generalizable model. The general model gets you a lot of the way there.

While the topics above (contextual embeddings and fine-tuning models to domain-specific applications) are certainly of interest, and could potentially improve the models, I will leave this (and all model fitting) to a later date. For now, we will just use publically available, out-of-the-box solutions to get a quick view on the sentiment. Then, if the results seem worthwhile to dig deeper into, I will investigate in future articles building contextual models to gain more insight into the semantics and context of the sentences.

In [1]:

import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

import statsmodels.api as sm
from IPython.display import HTML, display

from plotly.express import scatter

In [2]:

pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]

In [3]:

def pg_uri(un, pw, ip, port):
    return f'postgresql://{un}:{pw}@{ip}:{port}'

In [4]:

eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))

In [5]:

sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name, 
	   c.last_name, cs.contestant_season_id, c.sex, 
	   cs.season_id, occupation, location, age, placement, 
	   days_lasted, votes_against, 
	   med_evac, quit, individual_wins, attempt_number, 
	   tribe_0, tribe_1, tribe_2, tribe_3, alliance_0, 
	   alliance_1, alliance_2,
	   challenge_wins, challenge_appearances, sitout, 
	   voted_for_bootee, votes_against_player, character_id, 
       r.role, r.description,
	   total_number_of_votes_in_episode, tribal_council_appearances, 
	   votes_at_council, number_of_jury_votes, total_number_of_jury_votes, 
	   number_of_days_spent_in_episode, days_in_exile, 
	   individual_reward_challenge_appearances, individual_reward_challenge_wins, 
	   individual_immunity_challenge_appearances, individual_immunity_challenge_wins, 
	   tribal_reward_challenge_appearances, tribal_reward_challenge_wins, 
	   tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins, 
	   tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place, 
	   fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
JOIN survivor.role r
ON cs.character_id = r.role_id
), matched_exact AS 
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT * 
FROM matched_exact m
'''

In [6]:

reddit_df = pd.read_sql(sql, eng)

In [7]:

ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)

In [8]:

season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)

In [9]:

reddit_df = reddit_df.merge(season_to_name, on='season_id')

In [10]:

reddit_df.rename(columns={'name': 'season_name'}, inplace=True)

In [11]:

reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')

In [12]:

reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])

In [13]:

pd.options.display.max_columns = 100

TextBlob¶

To analyze the sentiment of the comments, we will be using textblob, a NLP package that builds on top of NLTK and pattern, two very popular NLP libraries. They have a very simple API which will allow us to do alot of interesting analysis right out of the box!

We will be using their pre-trained sentiment analysis to extract the sentiment from a word.

To give a sense of what this looks like, let's look at a dummy example:

In [14]:

from textblob import TextBlob

In [15]:

text = 'I love Katy Perry.'
blob = TextBlob(text)

blob.sentiment

Out[15]:

Sentiment(polarity=0.5, subjectivity=0.6)

In [16]:

blob.noun_phrases

Out[16]:

WordList(['katy perry'])

In each of these cases, we will get a polarity which represents the actual sentiment (best being 1, worst being -1) and subjectivity, which tries to analyze how subjective the sentence itself it (0 being "objective", 1 being subjective.)

We will take a look at both in this analysis.

Let's look at a particular comment and see what we will want to look at.

In [242]:

idx_two_example = reddit_df[reddit_df['body'].str.contains('Tyler made the absolutely best strategic move he could have.  Will... is an idiot.')].index[0]
example_comment = reddit_df['body'].iloc[idx_two_example]

In [243]:

example_comment

Out[243]:

'Tyler made the absolutely best strategic move he could have.  Will... is an idiot.'

The first thing we can do is just look at the sentiment of this entire paragraph. That's straightforward enough:

In [244]:

b = TextBlob(example_comment)

In [245]:

TextBlob(example_comment).sentiment

Out[245]:

Sentiment(polarity=0.09999999999999998, subjectivity=0.55)

But there's a few thing with this. First off, we notices there are two different names in the above sentence, with different sentiments towards both. How do we associate the sentiment to the correct person?

This is a hard problem. Essentially, we would have to find a way to match each noun to a verb or adjective. This isn't a trivial task, and assumes a lot -- that we can identify the Part of Speech (not a given at all), and then we can unambiguously match them. In short, this is not easy to do and, as far as I can see, there's no way to handle this out the box with TextBlob.

There is, however, a somewhat "close enough" way to handle this problem. For now, since we're just looking at things on aggregate, this is what we will do.

First, we can extract all of the relevant contestants from the string. Remember, each comment is replicated for each possible subject, if there are multiple names. In this case, we have:

In [246]:

example_subject = reddit_df['first_name'].iloc[idx_two_example]

In [247]:

example_subject

Out[247]:

'Will'

It's Will. Looking at the above sentences, it looks like this person doesn't like Will too much. And that Tyson made the "absolutely best strategic move". So let's first extract only the sentences that have Will's name in it, and look at the average sentiment of these sentences.

In [25]:

[x.sentiment for x in TextBlob(example_comment).sentences]

Out[25]:

[Sentiment(polarity=0.10000000000000003, subjectivity=0.65)]

In [248]:

np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])

Out[248]:

-0.8

Let's compare this with the other potential subject, on the next row:

In [250]:

example_subject = reddit_df['first_name'].iloc[idx_two_example + 1]
print(example_subject)
np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])

Tyler

Out[250]:

1.0

We see that this meets our expectations, rather than just using the overall polarity.

Note that this method is far from perfect, but short of being able to associate each potential subject-adjective pair, it will perform well enough for our purposes.

To give an example of how difficult this problem can be, take a look at this comment:

In [222]:

idx_many_example = reddit_df[reddit_df['body'].str.contains('I am really wishing for a David win, partly because')].index[0]
reddit_df['body'].iloc[idx_many_example]

Out[222]:

"I am really wishing for a David win, partly because I picked him as my flair and winner pick. (I just started joining this stuff this season.)\n\nDavid for the first episode was edited as super paranoid but even Tony, Mike, and Kristie were edited as super paranoid. David found two freaking idols! And probably will see the clue this merge episode. David was edited positively about showing his idol to Zeke kind of like Yul using the idol with Jonathan. David also saved Jessica which is kind of heroic. I can see the growth edit so much.\n\nThe only thing against David was that he was shown during the pre-season interviews and I don't remember last seasons where someone showed during those preseason clips and won. But Jay would be my next choice. Jeff Probst loved how a nerd like Cochran won survivor and he might like how David wins this one. Oh God, I was heartbroken when Aubry lost, please don't let another Michelle win. "

Look how many names we have there! By my count there is:

David
Tony
Mike
Kristie
Zeke
Yul
Jonathan
Jessica
Jay
Jeff (Probst!)
Cochran
Michelle

Somehow, this person managed to jam that many names into so few words! (Impressive!)

In this case, using the above method doesn't work so well.

We'll stop here for now, but potentially there could be some interesting venues to go down associating subjects/objects with verbs and adjectives.

In [29]:

def extract_overall_and_sentence_sentiment(row):
    body = row['body']
    first = row['first_name']
    last = row['last_name']
    
    b = TextBlob(body)
    overall_sentiment = b.sentiment
    overall_polarity = overall_sentiment.polarity
    overall_subj = overall_sentiment.subjectivity
    
    rel_sentences_sentiment = [x.sentiment 
                     for x in b.sentences 
                     if (first in x) or (last in x)]
    
    sentence_polarity = np.mean([y.polarity for y in rel_sentences_sentiment])
    sentence_subj = np.mean([y.subjectivity for y in rel_sentences_sentiment])

    metrics = [overall_polarity, overall_subj, sentence_polarity, sentence_subj]
    metric_names = ['overall_polarity', 'overall_subj', 'sentence_polarity', 'sentence_subj']
    
    
    return pd.Series(metrics, index=metric_names)
    
def add_sentiment_cols(df):
    sentiment_df = df.apply(extract_overall_and_sentence_sentiment, axis=1)
    return pd.concat([df, sentiment_df], axis=1)

In [30]:

reddit_df = add_sentiment_cols(reddit_df)

Let's take a look at one of the comments to get a sense of what this looks like!

In [31]:

reddit_df.sample(1)[['body', 'first_name', 'last_name', 
                     'overall_subj', 'sentence_subj',
                     'overall_polarity', 'sentence_polarity']].to_dict()

Out[31]:

{'body': {157474: 'Sell Shirin, Max.  Buy Will and Joaquin.  Probably sell Mike and buy Sierra too.'},
 'first_name': {157474: 'Sierra'},
 'last_name': {157474: 'Thomas'},
 'overall_subj': {157474: 0.0},
 'sentence_subj': {157474: 0.0},
 'overall_polarity': {157474: 0.0},
 'sentence_polarity': {157474: 0.0}}

It's important to remember that there seems to be a lot of discussion about which contestants could win or should win or whether they made the right decision or not. This is, of course, a separate question from sentiment -- there are plenty of people who make good strategic moves but are not the most well liked, by fans or by other contestants. Strong words in either direction ("hate", "love", etc.)

Out of curiosity, let's take a look at the overall feelings of sentiment in these comments.

In [32]:

from plotly.express import bar, histogram, box

In [33]:

histogram(data_frame=reddit_df, x='sentence_polarity', nbins=50, title='Sentence Sentiment Polarity')

In [34]:

histogram(data_frame=reddit_df, x='overall_polarity', nbins=50, title='Overall Sentiment Polarity')

The first thing that stands out is that there are a large number of 0 sentiments for both the sentence and overall polarity.

In [35]:

(reddit_df['sentence_polarity'] == 0).mean()

Out[35]:

0.3642624016467816

In [36]:

(reddit_df['overall_polarity'] == 0).mean()

Out[36]:

0.20653668016413645

To only look at cases where there is some sentiment, let's look at a histogram of non-zero values:

In [37]:

histogram(data_frame=reddit_df[reddit_df['sentence_polarity'] != 0], 
          x='sentence_polarity', nbins=50, 
          title='Non-Zero Sentence Sentiment Polarity', 
           )

In [38]:

histogram(data_frame=reddit_df[reddit_df['overall_polarity'] != 0], 
          x='overall_polarity', nbins=50, 
          title='Non-Zero Overall Sentiment Polarity', 
           )

We see that in both cases, it seems that most comments are positive (a slight skew to the left). This is interesting -- anyone who says redditors are inherently negative would probably disagree! To break this down a bit more, let's tae a look at the different seasons:

In [39]:

box(data_frame=reddit_df[reddit_df['sentence_polarity'] != 0], 
          x='sentence_polarity', color='season_name',
          title='Non-Zero Sentence Sentiment Polarity by Season', 
           )

Box plots by each season don't immediately reveal too much -- it appears that the distributions are somewhat similar for all of the seasons.

Next, we will look at some of the individual contestant episode combinations to see which have the highest (and lowest) skewing distributions.

In [40]:

reddit_df['contestant_episode'] = reddit_df['first_name'] + ' ' + reddit_df['last_name'] + ' ' + reddit_df['episode_name']

In [41]:

def plot_grouped_sentiment_extremes(df, extreme_group='contestant_episode',
                                    plot_group='contestant_episode', 
                                    sentiment_col='sentence_polarity', 
                                    lower=.001, upper=.999, size_thresh=10, 
                                    include_zeros=True, *args, **kwargs):
    
    
    if not include_zeros:
        df = df[df[sentiment_col] != 0].reset_index(drop=True)
        
    df = df[df[plot_group].notnull()].reset_index(drop=True)
    
    df = df[df.groupby(extreme_group)[sentiment_col].transform(len) > size_thresh].reset_index(drop=True)
    
    lower_quart, upper_quart = df.groupby(extreme_group)[sentiment_col].mean().quantile([lower, upper])
    
    
    def include_bool(x):
        return ((x.mean() <= lower_quart) or (x.mean() >= upper_quart))
    
    best_worst_episodes_bool = df.groupby(extreme_group)[sentiment_col].transform(include_bool)
    
    df['order_by'] = df.groupby(plot_group)[sentiment_col].transform(lambda x : x.median())
    
    plot_df = df[best_worst_episodes_bool].sort_values('order_by').reset_index()
    
    bx = box(data_frame=plot_df,
             x=sentiment_col, color=plot_group,
             category_orders= dict(group=plot_df[plot_group].unique().tolist()),
             *args, **kwargs)
    
    return bx

In [165]:

def create_story_contestant_episode(contestant_episode, init_text=''):
    example = reddit_df[reddit_df['contestant_episode'] == contestant_episode].iloc[0]
    
    try:
        sentences = TextBlob(example['story']).sentences
    except TypeError:
        return 
    
    display(HTML(f'Results from <i>Story</i> aspect of the Wiki for {contestant_episode}...'))
    relevant = [s for s in sentences if example['first_name'] in s or example['last_name'] in s]
    
    full_story = init_text + '</br></br></br>' if init_text else ''
    for r in relevant:
        story = str(r)
        story = story + '</br>'
        emphasized = bolden(story, example['first_name'])
        emphasized = bolden(emphasized, example['last_name'])
        full_story += emphasized
    
    full_story = force_breaks(full_story)
    return full_story

def display_with_breaks(text, every=120, split_on_words=False):
    display(HTML(force_breaks(text, every=every, split_on_words=split_on_words)))

def bolden(full_string, bolden):
    boldened = '<b>' + bolden + '</b>'
    
    return full_string.replace(bolden, boldened)

def highlight_comment(row, sentiment_col='sentence_polarity'):
    body = row['body']
    sent = row[sentiment_col]
    for n in row[['first_name', 'last_name']]:
        body = bolden(body, n)
        
    ret_str = f'<b>Comment</b>: {body} </br></br/> <b>Sentiment</b>: {sent}</br></br>'
    
    return ret_str
    
    
def get_example_comments(contestant_episode, polarity_var='sentence_polarity', n=1, polarity=-1):
    subset = reddit_df[reddit_df['contestant_episode'] == contestant_episode]
    subset.sort_values(by=polarity_var, ascending=(polarity <= 0), inplace=True)
    s_top = subset.iloc[0:n]

    s_top.apply(lambda x: display_with_breaks(highlight_comment(x, polarity_var)), axis=1)

In [ ]:

In [42]:

plot_grouped_sentiment_extremes(reddit_df, lower=.005, upper=.995, include_zeros=False)

Interestingly, we see some of the most and least popular players/episodes in this graph.

It appears that the players who were liked (for instance, Malcolm in Kill or Be Killed and John Cody in Blood is Thicker than Anything) seemed to have a smaller variance -- meaning they are more unanimously liked.

While others who were disliked (like Colton Cumbie in One World Is Out the Window and Ozzy from Double Agent) had a larger range of values for the 1st and 3rd quantile, indicating that the sentiment towards them are are more divided among Redditors.

Interestingly, even some of the least "liked" contestant episode combinations (like Sophie Clarke in Cult Like) are above zero, which reflects the average positive sentiment for most of these comments.

Disliked Player -- Examples¶

"Worst" things people said about Ozzy:¶

In [166]:

get_example_comments('Ozzy Lusth Double Agent', n=4)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: Christine was a trooper. Heartbreaking to see her get that close and to have idiot Ozzy ruin her chance...

Sentiment: -0.8

Comment: I'm liking this season, so far. Survivor: Nicaragua bored me to death, and Redemption Island was more like
Survivor: Boston Rob Beats the Shit Out of Everyone. Here's My Brief Analysis: * Sophie: Right now, in a pretty good
position. Sophie seems very smart, and is in a good position with Coach and Cochran. * Cochran: Made a smart move, but may
bite him in the ass later on. Hopefully, he can take the heat from the butthurt Savaii. * Albert: Very close with Coach,
and seems to be a good player. However, he poses an athletic threat. * Whitney: An all-around threat to Cochran and Upolu.
Athletic, and could definitely win. * Brandon: Strong with Coach and hopefully, Cochran. Does not provide much of a threat,
and if he doesn't have another episode like in the past, can make it far. * Ozzy: Now without an idol, Ozzy
is in a very vulnerable position. However, if he stays strong, and can reel a few Upolu in, can regain long-term strength. *
Edna: Coach's "little friend". She'll probably be the first one Savaii will go to for help, as a crack in the alliance. *
Dawn: Is not very strong with Savaii, but does not pose a threat to Upolu. In an awkward position. Cochran may be able to
convince to come over to Upolu. * Rick: Has said approximately three words this entire season. He definitely poses a threat
to Savaii, and will most likely stay loyal to Coach. *Jim: Probably will be the most butthurt of the Savaii members. He's
made great moves in the past, but now will have to find a crack in Upolu for votes. * Coach: In a very good position. Has
an alliance of many loyal members, has Cochran on his side, and plus, has an idol.

Sentiment: -0.65

Comment: Man, Ozzy kicked fucking ass that season. While I admit his social skills may not have been super,
I swear to god he one all of those individual immunity challenges save on or two. He was also great at camp, climbing trees
for fruit and fishing and shit. He was truly a "survivor", Yul was just better socially.

Sentiment: -0.6

Comment: Yea. Jeff looked like he was thinking, "How do we make Ozzy not sound like an idiot in the editing
room?"

Sentiment: -0.5

"Best" things people said about Ozzy¶

In [171]:

get_example_comments('Ozzy Lusth Double Agent', n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: I've always loved Ozzy, He's like Boston Rob for me. Picked him in his first season and have wanted
him to win ever since- This year I'm not so sure, he has grown up a lot and become more of a jackass, which bothers me.
I've been back and forth with Cochran- although after this week I think i'm done with him. I think I'll go for Keith (praying
that he comes back from redemption) or Dawn. I don't like anyone on Coach's tribe.

Sentiment: 0.7

Comment: As much as I enjoyed last night's move, Coch should've stayed with his alliance, and be the last 6... This
will probably get him to the top 3 with Ozzy since he's not a physical threat to win immunity.

Sentiment: 0.43333333333333335

Comment: Yeah, but I like Ozzy and think Cochran is a sniveling weasel who is too full of himself.

Sentiment: 0.35

Comment: I really like the one that is two votes after this where they convince Erik- who has the necklace to give
the necklace to Natile, and then they all vote him out. *Edit- back when Ozzy was cute and innocent.

Sentiment: 0.3333333333333333

"Worst" Things People Said About Colton¶

In [172]:

get_example_comments('Colton Cumbie One World Is Out the Window', n=4)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: While I agree that Colton can get quite ... annoying... I was actually more annoyed with the "roosters"
than anything. Damn, they are idiots, both in challenge and out. HOW DOES IT TAKE SEVEN MINUTES TO SOLVE A 8/9 UNIQUE
PIECE PUZZLE? That's almost a minute a piece. You could literally "hard break it" in less time (spam every slot fast; ala
word search tactics). Also, since he was allowed assistance, why didn't he just stand them up and let each member of the
team see them so they could solve it while he went at it.

Sentiment: -0.8

Comment: I know! I can't stand Colton, he's seriously a bad person.

Sentiment: -0.6999999999999998

Comment: How can these guys let Colton run the game. I mean seriously?

Sentiment: -0.4

Comment: It was hilarious. When Colton first said he hated the way Bill talked, I was confused... But then
Bill spoke and I understood. YO, BRO, TRIBAL COUNCIL IS SO INTENSE! LIKE, CRAZY MAN! I'M JUST SO EXCITED! It was bizarre!

Sentiment: -0.35000000000000003

"Best" things people said about Colton¶

In [173]:

get_example_comments('Colton Cumbie One World Is Out the Window', n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: It seems like most seasons people are too scared to make this move. Its understandable, knowing giving Colton
all that power could be your ticket to leaving way earlier then you may have otherwise should he wise up and vote against
you that night.

Sentiment: 0.35

Comment: Same here. The way he acted at tribal, I realized that Colton was not entirely wrong about him.

Sentiment: 0.25

Comment: In the end, it's the men's tribe's decision whether they help them out or not. Women didn't promise them
anything in return, so they don't owe them anything. Asking for help isn't dumb or wrong. When it got clear the men demanded
the use of the boat, the women apparently stayed on their own side. I've been more annoyed by the men's tribe because of
Matt and Colton to be honest. One down, one to go...

Sentiment: 0.2333333333333333

Comment: Agreed. Bill called him out on this at tribal council too. Colton acted like he'd be an outcast because
he's gay, but he only did that to himself.

Sentiment: 0.20833333333333334

Liked Player -- Examples¶

"Best" things people said about Malcolm:¶

In [174]:

get_example_comments('Malcolm Freberg Kill or Be Killed', n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: The winners rarely ever need the idol to win. Your example only shows one winner. It didn't help Malcolm
or Russell win.

Sentiment: 0.8

Comment: You are saying Russell would have made it just as far as he did without idols? Or Malcolm would have
done just as good without it? Ridiculous.

Sentiment: 0.7

Comment: Spoiler Alert :D 1. If there is a tribe switch they will be even uneven numbers because 5 people have gone
home so far. And with the fans down in numbers 6 to 9 they will most likely be a minority in both tribes. 2. The fans will
have to find cracks in the favorites relationships if they are going to survive a tribe switch. Strength doesn't matter
because a fan will be voted off no matter who wins. 3. Nobody said that, and it is still early. There were previews of
Brandon going crazy and dumping food so who knows what will happen. Eventually they will merge though, and when they do
you will see the real players come out (Sherri, Malcolm, Cochran, Matt) One of those four will win in my opinion
depending on how situations play out

Sentiment: 0.5

Comment: Phillipino Gollem. Malcolm, I love you.

Sentiment: 0.5

"Worst" things people said about Malcolm:¶

In [175]:

get_example_comments('Malcolm Freberg Kill or Be Killed', n=4, polarity=-1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: I think Pete got edited into much more of a villain than he was. Outside of the game he seems like a cool
guy (although maybe a little douchey, but so is Malcolm). At Ponderosa, RC repeatedly called him out yet he just
took it all calmly.

Sentiment: -0.05937500000000001

Comment: I wouldn't be surprised if that happens a lot on Survivor. If Tandang had have held together I bet they never
would have suggested that RC was being bullied. She may have even been edited as fairly crazy. Similarly, if Denise and
Malcolm go out soon after the merge Russell Swan is probably edited as a hero who didn't deserve to go home.

Sentiment: 0.0

Comment: I think it is a pretty fair assumption both of them are on the jury. Seems unlikely they'll go out soon but
also unlikely from what we've seen so far they'll be finalists. Corinne's name keeps popping up to go next and Malcolm
hiding from his alliance the fact that he has an idol doesn't bode well.

Sentiment: 0.0

Comment: I think Phillip is playing a great game, whether it's intentional or not. He created his own alliance and
everyone's like "You can't have a 9 person alliance, blah, blah, blah", but I think he gets that and everyone knows that
no matter what Erik, Brandon, and Brenda are on the bottom, but if the numbers don't go his way, who's going to try to vote
him out? No one, they'll go for a bigger "threat" like Andrea or Malcolm. I think Eriks assessment of him was pretty
spot on.

Sentiment: 0.0

"Best" things people said about John Cody:¶

In [179]:

get_example_comments('John Cody Blood Is Thicker than Anything', n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: I hope one of the brothers win, or John.

Sentiment: 0.8

Comment: Top - 1) Aras 2) Tina 3) John C. Bottom - 1) Katie 2) Laura B 3) Gervase I also like Caleb and Vytas
and want to throw them in there are honorable mentions

Sentiment: 0.5

Comment: Top 3 Hayden John Vytas Bottem 3 Laura b ciera katie

Sentiment: 0.5

Comment: **Top 3:** * Vytas * John * Candice **Bottom 3:** * Gervase * Colton * Hayden

Sentiment: 0.5

Worst things people said about John Cody:¶

In [180]:

get_example_comments('John Cody Blood Is Thicker than Anything', n=4, polarity=-1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: "Where they will compete, John, against your wife." But not Laura's husband? What's up with that?
Weird to call only John out.

Sentiment: -0.125

Comment: If he hadn't told Jeff to call him Cochran (a true scholar of the game, the dude knew that he was scratching
one of Probst's itches), he would have been John, and that would have been that.

Sentiment: -0.025000000000000022

Comment: Judd Sergeant (11) Chris Daugherty (9) Andrew Savage (7) Shane Powers (12) Todd Herzog (15) Matty Whitmore
(17) Stephen Fishbach (18) Mike Snow (26) Brett Clouser (19) Rafe Judkins (11) RC Saint-Amour (25) Taj Johnson-George
(18) Kim Spradlin (24) Natalie Bolton (16) Sundra Oakley (13) Christina Cha (24) Yve Rojas (21) Peih-Gee Law (15) Erinn
Lobdell (18) Sophie Clarke (23)

Sentiment: 0.0

Comment: Extra plot twist: Naonka, Alicia and Johnny FairPlay are already on the island.

Sentiment: 0.0

In [43]:

plot_grouped_sentiment_extremes(reddit_df, sentiment_col='overall_polarity',
                                lower=.005, upper=.995, include_zeros=False)

For the overall sentiment, it seems like we are getting some leakage from the other potential contestants. This is still interesting in some ways -- for instance, if you compare the distribution of Colton Cumbie in One World is Out The Window between the two graphs, you can see that Colton had a larger distribution with the sentence polarity as compared to the overall polarity.

This seems to indicate that comments concerning Colton are more divided than comments mentioning Colton, which indicates his controversiality. Also, comments concerning Colton has a slightly lower median (-.125, -.0625).

Additionally, and somewhat subjectively, we notice a few characters who were associated or allied with people who were liked (or disliked) more than they themselves were noticed or liked. For instance, Ken McNickle and Woo Hwang with David and Tony, respectively, from their seasons.

You can see below that there are some relationships between the person who is speaking and general characters in the game. For instance, Caleb Bankston is mentioned often with Colton, because they are enggaged. However, Caleb appears to be generally well liked -- especially in his relation to other characters.

In [181]:

get_example_comments('Caleb Bankston Blood Is Thicker than Anything', polarity_var='overall_polarity', 
                     n=4, polarity=-1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: Colton is a horrible person. It almost makes me want to root for Caleb less because we know if he
wins, colton will still get some of that money..

Sentiment: -0.2888888888888889

Comment: Caleb is looking like Colton, but not a douche

Sentiment: 0.0

Comment: Caleb is definitely going to turn on him as soon as he can.

Sentiment: 0.0

Comment: I'm hoping Vytas and Caleb team up against him.

Sentiment: 0.0

In [182]:

get_example_comments('Caleb Bankston Blood Is Thicker than Anything', polarity_var='overall_polarity', 
                     n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: Though fiancée Caleb seems like a nice guy

Sentiment: 0.6

Comment: Top - 1) Aras 2) Tina 3) John C. Bottom - 1) Katie 2) Laura B 3) Gervase I also like Caleb and Vytas
and want to throw them in there are honorable mentions

Sentiment: 0.5

Comment: I am already in love with Caleb.

Sentiment: 0.5

Comment: Without Caleb getting much airtime, this actually shows what kind of person he is the best. If this
is how Colton acts under pressure, Caleb must have either have infinite patience or be a complete push-over.

Sentiment: 0.38

In [184]:

get_example_comments('Kat Edorsson Thanks for the Souvenir', polarity_var='overall_polarity', 
                     n=4, polarity=-1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: I cannot imagine a worse final 3 than Kat, Alicia and Tarzan, haha. While I understand the strategy
of it, I really hate that people bring awful players to the end with them.

Sentiment: -0.4

Comment: Kat is not the sharpest pencil in the box, I don't think anyone can argue that. Either she is seriously
DUMB, or very very naive.

Sentiment: -0.3825

Comment: Kat number 2? Isn't she the complete idiot? I'm curious why you placed her there...

Sentiment: -0.26666666666666666

Comment: Kat, as a person, is completely adorable. She is a horrible player.

Sentiment: -0.25

In [185]:

get_example_comments('Kat Edorsson Thanks for the Souvenir', polarity_var='overall_polarity', 
                     n=4, polarity=1)

<ipython-input-165-1bea09ad0605>:44: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Comment: The Chelsea, Kim, Troyzan, and Jay alliance would be very good. As for bringing on the other 3 needed players,
Leif is a risk since he might say something he shouldn't (although maybe he did that because he wanted to warn Bill about
Colton's plan and try to save him, he still broke an alliance though). Jonas seems to want to cook up something of his
own now (its about time he grows a pair, albeit after Colton leaves), although he is just strategizing and might not be
very big threat in individual immunity. Sabrina would be good for the seven, but could be a threat for individual immunity,
she had an alliance with Kim and Chelsea to begin with, so bringing her on will not be an issue. So I would have Chelsea,
Kim, Jay, Troyzan, Jonas, Sabrina, and Kat. Kat because she probably will not be much of a threat in the
end and will remain more loyal. In regards to Tarzan being in there somewhere, I think he is up to something more devious
then we think and could be playing a very good and very quiet game, just bidding his time.

Sentiment: 0.29488095238095235

Comment: Not the best player, but just the right person to take with you to the final 3. Kat on one side, Alicia
on the other. Who wins?

Sentiment: 0.29214285714285715

Comment: I'd like to see Jay and Sabrina at the end, they can bring Kat because she's far to dumb to win.

Sentiment: 0.17500000000000002

Comment: Unfortunately, Survivor is more about big characters and personalities nowadays than people surviving in
the wilderness. I pretty much watch the show to laugh and be entertained. Kat is very entertaining and also seems
to be a genuinely nice girl. My ranking was purely "How much I like them." If I was ranking on chances at winning or strategic
prowess. Kat would likely finish at the bottom. She IS a complete idiot.

Sentiment: 0.15816326530612246

In [44]:

plot_grouped_sentiment_extremes(reddit_df, sentiment_col='sentence_subj',
                                lower=.005, upper=.995, include_zeros=False)

Looking at the subjectivity of the sentences, we see some interesting results. In particular, there are some examples of popular, well-known male characters (Ozzy, Jonas, and John) recieving less subjective evaluations than some female players. We investigate this in the next graphs, and it appears there os a slight increase in subjectivity at the extremes (the players with the highest/lowest subjectivity scores) but not in general for female vs. male contestants.

It is also interesting that Ozzy Lusth, Jonathan Penner and Russell Swan have the lower subjectivity scores. To me, this seems to indicate that they are being judged based on their merits in the game. All three of these players have some level of strategic gameplay (or in Ozzys case, great at challenges) that makes them seem like objectively good Survivor players. At the same time, they are controversial players and may not be the most socially well-liked. This isn't a thorough investigation, and we've seemed to gloss over subjectivity so far (and will not dive deeper), but it may be an interesting area for further analysis.

Conversely, at the bottom of the list, you see players like Malcolm Freberg and Denise Stapley. I'm focusing on these two because they're the most immediately familiar to me, but they are characters that tend to be well liked, and were underdogs (their tribes lost a lot at the start of the game). Perhaps this has led them to being ranked higher here. (Also of note, of course, are the episodes that these occur in. For Malcolm, it was actually an episode on Caramoan, which was the returning season. Malcolm was well liked previously, and perhaps was one of the few returning "favorites" who was actually liked. Perhaps this made people less objective about him?)

Of course, these are all hypotheses and would require some further research to investigate, but these are interesting results nevertheless.

Breakdowns by Gender¶

After the subjectivity analysis, as well as some data I have read concerning attitudes towards different genders of survivor players, I wanted to take a look at the subjectivity and polarity based on the contestants sex. The following graphs only consider the most (and least) subjective players for each sex.

The results are inconclusive -- it does not appear, based on this information alone, that the distributions are substantially different from one another. This statement is supported by the simple linear regression model later, which investigates what is predictive of the sentence sentiment, where Sex is removed as an insignificant feature.

In [45]:

plot_grouped_sentiment_extremes(reddit_df, plot_group='sex', sentiment_col='sentence_subj',
                                lower=.05, upper=.95, include_zeros=False)

In [46]:

plot_grouped_sentiment_extremes(reddit_df, plot_group='sex', sentiment_col='sentence_polarity',
                                lower=.05, upper=.95, include_zeros=False)

Now we've dug into the sentiment analysis a bit, the next step will be to build a linear model for sentiment and see which variables are significant. In the [next post], we will fit a linear model to the sentence_polarity variable and take a look at some of the coefficients to determine a statistically significant effect. This will be an inferential model, in the sense that we are not aiming for predictive accuracy but to explain changes in the dependent variable.

Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.

Comments

Survivor: Outwit, Outplay, Out...analyze? (Part 3) Sentiment Analysis Exploratory Analysis