Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶
After reading through [my last post] on the ETL process, you may be interested in some of the data that I had collected. Fear not, my dear data science enthusiast, for I have come baring gifts of analysis!
In this second installment of the series, I will begin the analysis by digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.
Looking at Contestants Mentioned in Comments¶
The first query I will be using uses a few different tables. First, I use a CTE (Common Table Expression) which combines information from the contestant and episode performance stats tables. I also contain, in a separate dataframe, the episode
table, which may come in handy later.
Then, we look at instances where the first or last name is contained inside of the body of the comment for comments made within a particular season. While there are cases where this will not work (for instance, when a contestant is best known by their nickname or when a shortened, or misspelled, version is used), it should give us a sense of the comments pertaining to particular players.
It's worth noting that the comments only go back to 2011 -- and the number of comments have greatly increased over time. We try a few ways of normalizing based on this information, but some players (particularly older players) will not be considered in this analysis.D
import os
from sqlalchemy import create_engine
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from copy import deepcopy
from plotly.express import line, bar
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
def pg_uri(un, pw, ip, port):
return f'postgresql://{un}:{pw}@{ip}:{port}'
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name,
c.last_name, cs.contestant_season_id,
cs.season_id, occupation, location, age, placement,
days_lasted, votes_against,
med_evac, quit, individual_wins, attempt_number,
tribe_0, tribe_1, tribe_2, tribe_3, alliance_0,
alliance_1, alliance_2,
challenge_wins, challenge_appearances, sitout,
voted_for_bootee, votes_against_player,
total_number_of_votes_in_episode, tribal_council_appearances,
votes_at_council, number_of_jury_votes, total_number_of_jury_votes,
number_of_days_spent_in_episode, days_in_exile,
individual_reward_challenge_appearances, individual_reward_challenge_wins,
individual_immunity_challenge_appearances, individual_immunity_challenge_wins,
tribal_reward_challenge_appearances, tribal_reward_challenge_wins,
tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins,
tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place,
fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
), matched_exact AS
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT *
FROM matched_exact m
'''
reddit_df = pd.read_sql(sql, eng)
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
reddit_df = reddit_df.merge(season_to_name, on='season_id')
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
pd.options.display.max_columns = 100
Taking a Look At The Data¶
reddit_df.head()
reddit_df.shape
There is a wealth of data here -- the actual content of the message, other Reddit information (like the user, upvotes, flairs, etc.) For this part of the analysis, we will just be looking at the occurances of the names in the body of the comment. In the next installment, we will look a bit deeper at some of the text inside the body and how that relates to the contestants. Additionally, we will take a look at some of the users and other information in later installments.
One thing to note is what the above query did the heavy lifting for -- finding the first and last names in the bodies of texts. So this dataframe (at a short 1.1 M rows) represents only the comments that had either one of these. Comments can appear multiple times if they contain multiple names.
Comments Per Season¶
The first -- and most obvious -- question we can answer is -- how many comments are there each season? And, which seasons are represented by the subreddit?
from plotly.express import bar, line
def plot_season_comments(df):
comments_per_season = df.groupby('season_name').size().reset_index()
comments_per_season.rename(columns={'season_name': 'Season', 0: 'Number of Comments (with names)'}, inplace=True)
return bar(data_frame=comments_per_season.sort_values(by='Number of Comments (with names)'),
x='Season', y='Number of Comments (with names)')
plot_season_comments(reddit_df)
We can see that the seasons have been those since 2011. We also see that certain seasons, in particular those that were more recent, have much more comments than other seasons. This makes sense intuitively, as there has been a good deal of increased use in Reddit over the years. Winners at War, the most recent season, has the most reddit comments, and also gained a lot of TV viewership as well, as it was an "all-star" type game.
To see this increase a bit more clearly, we can look at this over time based on the broadcast date of the episodes:
def plot_season_comments_time(df):
comments_per_season = df.sort_values(by='firstbroadcast').groupby([df['firstbroadcast'].sort_values().dt.year, 'season_name']).size().reset_index()
comments_per_season['Season, year'] = comments_per_season['season_name'] + ', ' + comments_per_season['firstbroadcast'].astype(str)
comments_per_season.rename(columns={0: 'Number of Comments (with names)'}, inplace=True)
return line(data_frame=comments_per_season, x='Season, year', y='Number of Comments (with names)', )
plot_season_comments_time(reddit_df)
Not too different from the above sorted chart, with a few exceptions of dips and peaks during certain years. Interestingly, two confounding factors exist here -- the increased popularity of Reddit over time, and the popularity of some seasons over others. Something we must keep in mind throughout this analysis!
Comments Per Contestant¶
Now, to jump into the meat of the reason for this query -- to take a look at the number of comments about particular contestants.
First we look at the contestants that had the highest absolute count of comments on Reddit. Since different seasons may have more (or less) comments based on factors not related to the popularity of the season itself, this will not necessarily give us an unbiased answer. However, it still will be interesting to consider this in both absolute and relative terms.
For the absolute chart, we look at the number of mentions of each contestant and plot a bar chart with the top 20 contestants. For the relative, we consider how many comments they got relative to the total number of mentions that season.
def plot_sorted_n_alltime(df, total=True, n=20, top=True):
grper = ['contestant_season_id', 'first_name', 'last_name']
measured = 'Number of Mentions' if total else 'Percent of Season Mentions'
abs_rel = 'Absolute' if total else 'Relative'
top_bot = 'Top' if top else 'Bottom'
title = f'{top_bot} {n} by {abs_rel} Number of Mentions'
totals = df.groupby('season_name').apply(lambda x: x.groupby(grper).size() / (x.shape[0] if not total else 1)).sort_values()
totals_clipped = totals.tail(n) if top else totals.head(n)
totals_clipped = totals_clipped.reset_index()
if not total:
totals_clipped[0] *= 100
totals_clipped['full_name'] = totals_clipped['first_name'] + \
' ' + totals_clipped['last_name'] + \
', ' + totals_clipped['season_name'].astype(str)
totals_clipped.rename(columns={0: measured, 'full_name': 'Contestant'}, inplace=True)
return bar(data_frame=totals_clipped, x=measured, y='Contestant', title=title)
plot_sorted_n_alltime(reddit_df)
plot_sorted_n_alltime(reddit_df, False)
We see some interesting results. First, some of the big names are at the top of this list -- Tony is a very popular player, as well as a two time winner of the game. We see that the first chart mainly has members from Winners at War, which makes sense since this season has much more comments than the others. Even, Adam Klein makes this list, although he may have ben one of the least popular of the contestants on that season.
The relative chart shows a bit more of a holistic view -- the top 8 contestants are immediately recognizable to me as interesting contestants from past seasons. Rob Mariano or "Boston Rob" is one of the most popular players the show has ever had. As is the nerd figure, John Cochran.
We could look at this a few different ways of course -- if we looked at the bottom of this list, I'm sure we'd see a lot of people who we've never heard of who were voted out in the first episode. Out of curiosity let's check it out!
plot_sorted_n_alltime(reddit_df, False, top=False)
Hm, the results here are somewhat interesting! One one end, there are some names who I have certainly never heard of. On the other, there are some that are quite popular -- J.T. for instance. In this case, I think it's a data error (a data error, unfortunately for JT, that I don't think is worth diving into in this analysis) that J.T. is probably not said much in the comments (maybe JT). Others have contestants thatwere popular overall, but probably unpopular or voted out quicky during their season (like Rupert).
Then, you have people who are notoriously unlucky, like Francesca Hogi, who lost both of the first episodes she was on in Survivor. Hate to say it, but she's exactly who you'd hope would be on the bottom of this list!
Season Breakdown¶
While there were definitely some interesting takeaways looking at his on the aggregate, the next step is to drill down into individual seasons -- when did people get the most comments? Were some people popular (and then voted out?)
The next plots look into this.
from plotly.express import bar
def create_count_from_episode_df(ep_df, unique_idxs):
grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
reindexed = ep_df.groupby(grper).size().reindex(unique_idxs)
reindexed.name = 'count'
reindexed.fillna({'count': 0}, inplace=True)
reindexed.drop(columns=['episode_name'], inplace=True)
reindexed = reindexed.reset_index()
return reindexed
def create_episode_counts(df):
grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
idx = df[grper].drop_duplicates()
ep_counts = df.groupby(['episode_id', 'episode_name']).apply(lambda x: create_count_from_episode_df(x, idx))
ep_counts = ep_counts.reset_index()
ep_counts['cumulative_player_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform(lambda x: x.cumsum())
ep_counts.sort_values(by=['episode_id', 'cumulative_player_counts'], inplace=True)
ep_counts['total_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform('sum')
return ep_counts
def create_racetrack_by_episode(df, *args, **kwargs):
ep_counts = create_episode_counts(df)
fig = bar(y='first_name', x='cumulative_player_counts',
data_frame=ep_counts, animation_frame='episode_name',
range_x=[0, ep_counts['cumulative_player_counts'].max() * 1.05],
*args, **kwargs)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
return fig
def create_racetrack_by_episode_for_season(df, season_id, *args, **kwargs):
subset_df = df[df['season_id'] == season_id]
season_name = subset_df['season_name'].iloc[0]
set_kwargs = dict(title = f'Cumulative Comment Counts for Season: <b>{season_name}</b>',
labels=
{'first_name': 'First Name',
'cumulative_player_counts': 'Number of Comments',
'episode_name': 'Episode Name'})
set_kwargs.update(kwargs)
return create_racetrack_by_episode(subset_df, *args, **set_kwargs)
These plots below are animated plots for each of the seasons considered in the reddit data. These are what I'm calling racecar plots, or really animated barplots where the categories (in this case the contestants) race against one another to get to the right side of the chart.
In this case, to make things interesting, I only consider when the contestant is still getting comments in reddit -- once you are voted out of the game, your total goes down to zero. An interesting thing you may notice is that this correlates very strongly with the person who was voted out last. This makes sense -- for most players, once they are voted out they are rarely, if ever, mentioned again on Reddit for the episodes after that (or rather, the time after the subsequent epsiodes). People have short memories.
Plotly Express makes this plot very easy to make, at the cost of some customization. The above mentioned aspect of the plot was actually completely unavoidable for this plot, without some big workarounds. Still, the animation capabilities of plotly express are worth mentioning. As you can see below, you can press on the play button to start the animation, and it will begin to play episode by episode until the season is "over" (the last episode of the season). After each episode, the remaining contestants are sorted. Keep an eye on the players, they can be a bit hard to track!
for season in reddit_df['season_id'].unique():
fig = create_racetrack_by_episode_for_season(reddit_df, season, height=1000, width=1000)
Episode by Episode Breakdown¶
While the last plots did show us how the cumumlative comment counts grew over time for each contestant, we'd like to be able to visualize the difference between episodes a bit more clearly. Additionally, it would be nice to have the bars stick around even after that player no longer has additional comments. For this, we will use our own custom plotly animation function below to generate similar plots for each of these seasons.
def plot_episode_by_episode_breakdown_by_season(df, season_id, *args, **kwargs):
reduced = df[df['season_id'] == season_id]
season_name = reduced['season_name'].iloc[0]
set_kwargs = dict(title = f'Cumulative Comment Counts by Episode for Season <b>{season_name}</b>',
xaxis=dict(title='Number of Comments', autorange=False),
yaxis=dict(title='First Name'))
set_kwargs.update(kwargs)
return plot_episode_by_episode_breakdown(reduced, *args, **set_kwargs)
def plot_episode_by_episode_breakdown(df, *args, **kwargs):
ep_counts = create_episode_counts(df)
ep_counts.sort_values(by='total_counts', inplace=True)
episodes = ep_counts['episode_id'].unique()
episodes.sort()
empty = dict(type='bar', orientation='h',
y=ep_counts['first_name'].unique(),
x=[None] * ep_counts['first_name'].nunique())
traces = [empty.copy() for i in range(len(episodes))]
frames = []
sliders_dict = {
"active": 0,
"yanchor": "top",
"xanchor": "left",
"currentvalue": {
"font": {"size": 20},
"prefix": "Episode: ",
"visible": True,
"xanchor": "right"
},
"transition": {"duration": 300, "easing": "cubic-in-out"},
"pad": {"b": 10, "t": 50},
"len": 0.9,
"x": 0.1,
"y": 0,
"steps": []
}
for i, ep in enumerate(episodes):
fr_dict = dict(type='bar', orientation='h')
new_bool = ep_counts['episode_id'] == ep
ep_name = ep_counts.loc[new_bool, 'episode_name'].iloc[0]
traces[i]['name'] = ep_name
fr_dict.update(dict(y = ep_counts.loc[new_bool, 'first_name'].reset_index(drop=True),
x = ep_counts.loc[new_bool, 'count'].reset_index(drop=True)))
if i > 0:
last_frame = deepcopy(frames[-1])
last_frame['data'].append(fr_dict)
last_frame['traces'] += [i]
else:
last_frame = dict(data=[fr_dict], traces=[0])
frames.append(last_frame)
slider_step = {"args": [
[ep_name],
{"frame": {"duration": 300, "redraw": False},
"mode": "immediate",
"transition": {"duration": 300}}
],
"label": ep_name,
"method": "animate"}
sliders_dict["steps"].append(slider_step)
layout = go.Layout(width=1000,
height=1000,
showlegend=True,
hovermode='closest')
layout["sliders"] = [sliders_dict]
layout["updatemenus"] = [
{
"buttons": [
{
"args": [None, {"frame": {"duration": 500, "redraw": False},
"fromcurrent": True, "transition": {"duration": 300,
"easing": "quadratic-in-out"}}],
"label": "Play",
"method": "animate"
},
{
"args": [[None], {"frame": {"duration": 0, "redraw": False},
"mode": "immediate",
"transition": {"duration": 0}}],
"label": "Pause",
"method": "animate"
}
],
"direction": "left",
"pad": {"r": 10, "t": 87},
"showactive": False,
"type": "buttons",
"x": 0.1,
"xanchor": "right",
"y": 0,
"yanchor": "top"
}
]
try:
kwargs['xaxis'].update(range=[0, ep_counts['total_counts'].max() * 1.05])
except KeyError:
kwargs['xaxis'] = dict(range=[0, ep_counts['total_counts'].max() * 1.05])
layout.update(barmode='stack',
*args, **kwargs)
fig = go.Figure(data=traces, frames=frames, layout=layout)
return fig
for season in reddit_df['season_id'].unique():
plot_episode_by_episode_breakdown_by_season(reddit_df, season).show()
Conclusions¶
This first jump into the analysis definitely gave us some interesting results! Some high-level takeaways:
r/survivor has been around since 2011. Over the years, engagement (in terms of comments) has increased by quite a lot.
Using absolute comment counts doesn't seem to be the way to go when comparing across seasons -- instead using relative comment counts to the total seemed to work best.
The players who are the most popular according to Reddit comments pass the common sense check -- Tony and Rob Mariano top the list.
Comments (or lack thereof) can be used as a proxy for in game events. We've only just started to see this (without looking at the body of the text), but we can already see that, in most cases a users' reddit comments will drop off to zero or near-zero after they are eliminated from the game. This could be interesting to dive a bit deeper into (especially if we want to eventually build a model!)
Generally, as we can see above, players who last longer tend to dominate the number of comments. For instance, in only a few rare instances do contestants who get voted out earlier than those in the final 4 or 5 end up surpassing these contestants in reddit comments. This passes the sniff test -- we will remember contestants, even those we wouldn't really like normally, who last longer simply because they will be more "in the running" with less contestants.
This has been an interesting look at the rich dataset of Reddit comments on r/survivor. But we aren't done yet!
The next iteration of the Survivor analysis Series will use Sentiment analysis and other attributes of the body of the text to try to glean information about the contestants and reddit users' behavior. Stay tuned, you won't want to miss it!