Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶

After reading through [my last post] on the ETL process, you may be interested in some of the data that I had collected. Fear not, my dear data science enthusiast, for I have come baring gifts of analysis!

In this second installment of the series, I will begin the analysis by digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.

Looking at Contestants Mentioned in Comments¶

The first query I will be using uses a few different tables. First, I use a CTE (Common Table Expression) which combines information from the contestant and episode performance stats tables. I also contain, in a separate dataframe, the episode table, which may come in handy later.

Then, we look at instances where the first or last name is contained inside of the body of the comment for comments made within a particular season. While there are cases where this will not work (for instance, when a contestant is best known by their nickname or when a shortened, or misspelled, version is used), it should give us a sense of the comments pertaining to particular players.

It's worth noting that the comments only go back to 2011 -- and the number of comments have greatly increased over time. We try a few ways of normalizing based on this information, but some players (particularly older players) will not be considered in this analysis.D

In [1]:

import os
from sqlalchemy import create_engine
import numpy as np
import pandas as pd

import plotly.graph_objects as go  
from copy import deepcopy

from plotly.express import line, bar

In [2]:

pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]

In [3]:

def pg_uri(un, pw, ip, port):
    return f'postgresql://{un}:{pw}@{ip}:{port}'

In [4]:

eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))

In [5]:

sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name, 
	   c.last_name, cs.contestant_season_id, 
	   cs.season_id, occupation, location, age, placement, 
	   days_lasted, votes_against, 
	   med_evac, quit, individual_wins, attempt_number, 
	   tribe_0, tribe_1, tribe_2, tribe_3, alliance_0, 
	   alliance_1, alliance_2,
	   challenge_wins, challenge_appearances, sitout, 
	   voted_for_bootee, votes_against_player, 
	   total_number_of_votes_in_episode, tribal_council_appearances, 
	   votes_at_council, number_of_jury_votes, total_number_of_jury_votes, 
	   number_of_days_spent_in_episode, days_in_exile, 
	   individual_reward_challenge_appearances, individual_reward_challenge_wins, 
	   individual_immunity_challenge_appearances, individual_immunity_challenge_wins, 
	   tribal_reward_challenge_appearances, tribal_reward_challenge_wins, 
	   tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins, 
	   tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place, 
	   fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
), matched_exact AS 
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT * 
FROM matched_exact m
'''

In [6]:

reddit_df = pd.read_sql(sql, eng)

In [7]:

ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)

In [8]:

season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)

In [9]:

reddit_df = reddit_df.merge(season_to_name, on='season_id')

In [10]:

reddit_df.rename(columns={'name': 'season_name'}, inplace=True)

In [11]:

reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')

In [12]:

reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])

In [13]:

pd.options.display.max_columns = 100

Taking a Look At The Data¶

In [14]:

reddit_df.head()

Out[14]:

	index_x	author	author_created_utc	author_flair_css_class	author_flair_text	author_fullname	body	controversiality	created_utc	distinguished	gilded	id	link_id	nest_level	parent_id	reply_delay	retrieved_on	score	score_hidden	subreddit	subreddit_id	edited	user_removed	mod_removed	stickied	author_cakeday	can_gild	collapsed	collapsed_reason	is_submitter	gildings	permalink	permalink_url	updated_utc	subreddit_type	no_follow	send_replies	author_flair_template_id	author_flair_background_color	author_flair_richtext	author_flair_text_color	author_flair_type	rte_mode	subreddit_name_prefixed	all_awardings	associated_award	author_patreon_flair	author_premium	awarders	collapsed_because_crowd_control	...	attempt_number	tribe_0	tribe_1	tribe_2	tribe_3	alliance_0	alliance_1	alliance_2	challenge_wins	challenge_appearances	votes_against_player	total_number_of_votes_in_episode	tribal_council_appearances	number_of_days_spent_in_episode	tribal_reward_challenge_appearances	tribal_immunity_challenge_appearances	tribal_reward_challenge_second_of_three_place	tribal_immunity_challenge_second_of_three_place	fire_immunity_challenge	tribal_immunity_challenge_third_place	episode_id	season_name	index_y	summary	story	challenges	trivia	image	firstbroadcast	viewership	wiki_link	season_episode_number	overall_episode_number	overall_slot_rating	survivor_rating	episode_name	created_y	updated_y
0	4540248	sampete1157	NaN	None	None	t2_yvimwo1	Yul	NaN	1585333566	None	NaN	flo8j0x	t3_fpql0y	None	t1_flmiy69	NaN	1.585335e+09	1.0	None	survivor	t5_2qhu3	NaN	None	None	false	None	None	None	None	false	{}	/r/survivor/comments/fpql0y/yul/flo8j0x/	None	NaN	None	true	true	None	None	[]	None	text	None	None	[]	None	false	false	[]	None	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.125	0.5	3.0	4.0	1.0	2.0	1.0	1.0	NaN	NaN	-2.0	NaN	695.0	Winners at War	695	We're in the Majors is the seventh episode of ...	Getting voted out before the merge-- that's so...	Challenge: (No Title)Two members of each tribe...	* "Boa Constrictor at Yara" (Day 17): At Yara...	https://vignette.wikia.nocookie.net/survivor/i...	2020-03-25	818000000.0	https://survivor.fandom.com/wiki/We%27re_in_th...	7.0	590.0	8.0	1.7	We%27re in the Majors	2020-07-11 01:03:00.566347+00:00	2020-07-19 00:48:27.943994+00:00
1	4538586	lvl4lapras	NaN	None	None	t2_54mpbhfb	Yul	NaN	1585316533	None	NaN	flne39u	t3_fpql0y	None	t3_fpql0y	NaN	1.585317e+09	1.0	None	survivor	t5_2qhu3	NaN	None	None	false	None	None	None	None	false	{}	/r/survivor/comments/fpql0y/yul/flne39u/	None	NaN	None	true	true	None	None	[]	None	text	None	None	[]	None	false	false	[]	None	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.125	0.5	3.0	4.0	1.0	2.0	1.0	1.0	NaN	NaN	-2.0	NaN	695.0	Winners at War	695	We're in the Majors is the seventh episode of ...	Getting voted out before the merge-- that's so...	Challenge: (No Title)Two members of each tribe...	* "Boa Constrictor at Yara" (Day 17): At Yara...	https://vignette.wikia.nocookie.net/survivor/i...	2020-03-25	818000000.0	https://survivor.fandom.com/wiki/We%27re_in_th...	7.0	590.0	8.0	1.7	We%27re in the Majors	2020-07-11 01:03:00.566347+00:00	2020-07-19 00:48:27.943994+00:00
2	4539257	swells61	NaN	34Gold WS33W	J.T.	t2_fex1f	Yul	NaN	1585324546	None	NaN	flnrm96	t3_fpql0y	None	t1_flmiy69	NaN	1.585325e+09	1.0	None	survivor	t5_2qhu3	NaN	None	None	false	None	None	None	None	false	{}	/r/survivor/comments/fpql0y/yul/flnrm96/	None	NaN	None	true	true	None	None	[{'e': 'text', 't': 'J.T.'}]	dark	richtext	None	None	[]	None	false	false	[]	None	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.125	0.5	3.0	4.0	1.0	2.0	1.0	1.0	NaN	NaN	-2.0	NaN	695.0	Winners at War	695	We're in the Majors is the seventh episode of ...	Getting voted out before the merge-- that's so...	Challenge: (No Title)Two members of each tribe...	* "Boa Constrictor at Yara" (Day 17): At Yara...	https://vignette.wikia.nocookie.net/survivor/i...	2020-03-25	818000000.0	https://survivor.fandom.com/wiki/We%27re_in_th...	7.0	590.0	8.0	1.7	We%27re in the Majors	2020-07-11 01:03:00.566347+00:00	2020-07-19 00:48:27.943994+00:00
3	4539655	ekwag	NaN	40Gold WW	Nick	t2_fssfb	Yul	NaN	1585328171	None	NaN	flnyaj9	t3_fpql0y	None	t3_fpql0y	NaN	1.585329e+09	1.0	None	survivor	t5_2qhu3	NaN	None	None	false	None	None	None	None	false	{}	/r/survivor/comments/fpql0y/yul/flnyaj9/	None	NaN	None	true	true	None	None	[{'e': 'text', 't': 'Nick'}]	dark	richtext	None	None	[]	None	false	false	[]	None	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.125	0.5	3.0	4.0	1.0	2.0	1.0	1.0	NaN	NaN	-2.0	NaN	695.0	Winners at War	695	We're in the Majors is the seventh episode of ...	Getting voted out before the merge-- that's so...	Challenge: (No Title)Two members of each tribe...	* "Boa Constrictor at Yara" (Day 17): At Yara...	https://vignette.wikia.nocookie.net/survivor/i...	2020-03-25	818000000.0	https://survivor.fandom.com/wiki/We%27re_in_th...	7.0	590.0	8.0	1.7	We%27re in the Majors	2020-07-11 01:03:00.566347+00:00	2020-07-19 00:48:27.943994+00:00
4	4539815	Lunarmise	NaN	None	None	t2_1r7mqcjo	Yul	NaN	1585329676	None	NaN	flo13ml	t3_fpql0y	None	t1_flo0u7q	NaN	1.585330e+09	1.0	None	survivor	t5_2qhu3	NaN	None	None	false	None	None	None	None	false	{}	/r/survivor/comments/fpql0y/yul/flo13ml/	None	NaN	None	true	true	None	None	[]	None	text	None	None	[]	None	false	false	[]	None	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.125	0.5	3.0	4.0	1.0	2.0	1.0	1.0	NaN	NaN	-2.0	NaN	695.0	Winners at War	695	We're in the Majors is the seventh episode of ...	Getting voted out before the merge-- that's so...	Challenge: (No Title)Two members of each tribe...	* "Boa Constrictor at Yara" (Day 17): At Yara...	https://vignette.wikia.nocookie.net/survivor/i...	2020-03-25	818000000.0	https://survivor.fandom.com/wiki/We%27re_in_th...	7.0	590.0	8.0	1.7	We%27re in the Majors	2020-07-11 01:03:00.566347+00:00	2020-07-19 00:48:27.943994+00:00

5 rows × 126 columns

In [15]:

reddit_df.shape

Out[15]:

(1149008, 126)

There is a wealth of data here -- the actual content of the message, other Reddit information (like the user, upvotes, flairs, etc.) For this part of the analysis, we will just be looking at the occurances of the names in the body of the comment. In the next installment, we will look a bit deeper at some of the text inside the body and how that relates to the contestants. Additionally, we will take a look at some of the users and other information in later installments.

One thing to note is what the above query did the heavy lifting for -- finding the first and last names in the bodies of texts. So this dataframe (at a short 1.1 M rows) represents only the comments that had either one of these. Comments can appear multiple times if they contain multiple names.

Comments Per Season¶

The first -- and most obvious -- question we can answer is -- how many comments are there each season? And, which seasons are represented by the subreddit?

In [16]:

from plotly.express import bar, line

In [17]:

def plot_season_comments(df):
    comments_per_season = df.groupby('season_name').size().reset_index()
    comments_per_season.rename(columns={'season_name': 'Season', 0: 'Number of Comments (with names)'}, inplace=True)
    return bar(data_frame=comments_per_season.sort_values(by='Number of Comments (with names)'), 
               x='Season', y='Number of Comments (with names)')

In [18]:

plot_season_comments(reddit_df)

We can see that the seasons have been those since 2011. We also see that certain seasons, in particular those that were more recent, have much more comments than other seasons. This makes sense intuitively, as there has been a good deal of increased use in Reddit over the years. Winners at War, the most recent season, has the most reddit comments, and also gained a lot of TV viewership as well, as it was an "all-star" type game.

To see this increase a bit more clearly, we can look at this over time based on the broadcast date of the episodes:

In [19]:

def plot_season_comments_time(df):
    comments_per_season = df.sort_values(by='firstbroadcast').groupby([df['firstbroadcast'].sort_values().dt.year, 'season_name']).size().reset_index()
    comments_per_season['Season, year'] = comments_per_season['season_name'] + ', ' + comments_per_season['firstbroadcast'].astype(str)
    comments_per_season.rename(columns={0: 'Number of Comments (with names)'}, inplace=True)
    return line(data_frame=comments_per_season, x='Season, year', y='Number of Comments (with names)', )

In [20]:

plot_season_comments_time(reddit_df)

Not too different from the above sorted chart, with a few exceptions of dips and peaks during certain years. Interestingly, two confounding factors exist here -- the increased popularity of Reddit over time, and the popularity of some seasons over others. Something we must keep in mind throughout this analysis!

Comments Per Contestant¶

Now, to jump into the meat of the reason for this query -- to take a look at the number of comments about particular contestants.

First we look at the contestants that had the highest absolute count of comments on Reddit. Since different seasons may have more (or less) comments based on factors not related to the popularity of the season itself, this will not necessarily give us an unbiased answer. However, it still will be interesting to consider this in both absolute and relative terms.

For the absolute chart, we look at the number of mentions of each contestant and plot a bar chart with the top 20 contestants. For the relative, we consider how many comments they got relative to the total number of mentions that season.

In [21]:

def plot_sorted_n_alltime(df, total=True, n=20, top=True):
    
    grper = ['contestant_season_id', 'first_name', 'last_name']
    
    measured = 'Number of Mentions' if total else 'Percent of Season Mentions'
    abs_rel = 'Absolute' if total else 'Relative'
    top_bot = 'Top' if top else 'Bottom'
    title = f'{top_bot} {n} by {abs_rel} Number of Mentions'
    
    totals = df.groupby('season_name').apply(lambda x: x.groupby(grper).size() / (x.shape[0] if not total else 1)).sort_values()
    totals_clipped = totals.tail(n) if top else totals.head(n)
    totals_clipped = totals_clipped.reset_index()
    
    if not total:
        totals_clipped[0] *= 100
    totals_clipped['full_name'] = totals_clipped['first_name'] + \
                                 ' ' + totals_clipped['last_name'] + \
                                 ', ' + totals_clipped['season_name'].astype(str)
    totals_clipped.rename(columns={0: measured, 'full_name': 'Contestant'}, inplace=True)
    return bar(data_frame=totals_clipped, x=measured, y='Contestant', title=title)

In [22]:

plot_sorted_n_alltime(reddit_df)

In [23]:

plot_sorted_n_alltime(reddit_df, False)

We see some interesting results. First, some of the big names are at the top of this list -- Tony is a very popular player, as well as a two time winner of the game. We see that the first chart mainly has members from Winners at War, which makes sense since this season has much more comments than the others. Even, Adam Klein makes this list, although he may have ben one of the least popular of the contestants on that season.

The relative chart shows a bit more of a holistic view -- the top 8 contestants are immediately recognizable to me as interesting contestants from past seasons. Rob Mariano or "Boston Rob" is one of the most popular players the show has ever had. As is the nerd figure, John Cochran.

We could look at this a few different ways of course -- if we looked at the bottom of this list, I'm sure we'd see a lot of people who we've never heard of who were voted out in the first episode. Out of curiosity let's check it out!

In [24]:

plot_sorted_n_alltime(reddit_df, False, top=False)

Hm, the results here are somewhat interesting! One one end, there are some names who I have certainly never heard of. On the other, there are some that are quite popular -- J.T. for instance. In this case, I think it's a data error (a data error, unfortunately for JT, that I don't think is worth diving into in this analysis) that J.T. is probably not said much in the comments (maybe JT). Others have contestants thatwere popular overall, but probably unpopular or voted out quicky during their season (like Rupert).

Then, you have people who are notoriously unlucky, like Francesca Hogi, who lost both of the first episodes she was on in Survivor. Hate to say it, but she's exactly who you'd hope would be on the bottom of this list!

Season Breakdown¶

While there were definitely some interesting takeaways looking at his on the aggregate, the next step is to drill down into individual seasons -- when did people get the most comments? Were some people popular (and then voted out?)

The next plots look into this.

In [25]:

from plotly.express import bar

In [26]:

def create_count_from_episode_df(ep_df, unique_idxs):
    grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
    reindexed = ep_df.groupby(grper).size().reindex(unique_idxs)
    reindexed.name = 'count'
    reindexed.fillna({'count': 0}, inplace=True)
    reindexed.drop(columns=['episode_name'], inplace=True)
    reindexed = reindexed.reset_index()
    return reindexed



def create_episode_counts(df):
    grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
    
    idx = df[grper].drop_duplicates()
    ep_counts = df.groupby(['episode_id', 'episode_name']).apply(lambda x: create_count_from_episode_df(x, idx))
    ep_counts = ep_counts.reset_index()
    ep_counts['cumulative_player_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform(lambda x: x.cumsum())
    ep_counts.sort_values(by=['episode_id', 'cumulative_player_counts'], inplace=True)
    ep_counts['total_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform('sum')
    return ep_counts


def create_racetrack_by_episode(df, *args, **kwargs):
    ep_counts = create_episode_counts(df)
    fig = bar(y='first_name', x='cumulative_player_counts', 
               data_frame=ep_counts,  animation_frame='episode_name', 
               range_x=[0, ep_counts['cumulative_player_counts'].max() * 1.05],
               *args, **kwargs)
    fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
    return fig

def create_racetrack_by_episode_for_season(df, season_id, *args, **kwargs):
    subset_df = df[df['season_id'] == season_id]
    season_name = subset_df['season_name'].iloc[0]
    set_kwargs = dict(title = f'Cumulative Comment Counts for Season: <b>{season_name}</b>',
                      labels=
                     {'first_name': 'First Name', 
                      'cumulative_player_counts': 'Number of Comments',
                      'episode_name': 'Episode Name'})
    set_kwargs.update(kwargs)
    return create_racetrack_by_episode(subset_df, *args, **set_kwargs)

These plots below are animated plots for each of the seasons considered in the reddit data. These are what I'm calling racecar plots, or really animated barplots where the categories (in this case the contestants) race against one another to get to the right side of the chart.

In this case, to make things interesting, I only consider when the contestant is still getting comments in reddit -- once you are voted out of the game, your total goes down to zero. An interesting thing you may notice is that this correlates very strongly with the person who was voted out last. This makes sense -- for most players, once they are voted out they are rarely, if ever, mentioned again on Reddit for the episodes after that (or rather, the time after the subsequent epsiodes). People have short memories.

Plotly Express makes this plot very easy to make, at the cost of some customization. The above mentioned aspect of the plot was actually completely unavoidable for this plot, without some big workarounds. Still, the animation capabilities of plotly express are worth mentioning. As you can see below, you can press on the play button to start the animation, and it will begin to play episode by episode until the season is "over" (the last episode of the season). After each episode, the remaining contestants are sorted. Keep an eye on the players, they can be a bit hard to track!

In [1]:

for season in reddit_df['season_id'].unique():
    fig = create_racetrack_by_episode_for_season(reddit_df, season, height=1000, width=1000)


---------------------------------------
NameErrorTraceback (most recent call last)
<ipython-input-1-9f991c8c17d1> in <module>
----> 1 for season in reddit_df['season_id'].unique():
      2     fig = create_racetrack_by_episode_for_season(reddit_df, season, height=1000, width=1000)

NameError: name 'reddit_df' is not defined

Episode by Episode Breakdown¶

While the last plots did show us how the cumumlative comment counts grew over time for each contestant, we'd like to be able to visualize the difference between episodes a bit more clearly. Additionally, it would be nice to have the bars stick around even after that player no longer has additional comments. For this, we will use our own custom plotly animation function below to generate similar plots for each of these seasons.

In [28]:

def plot_episode_by_episode_breakdown_by_season(df, season_id, *args, **kwargs):
    reduced = df[df['season_id'] == season_id]
    season_name = reduced['season_name'].iloc[0]
    
    set_kwargs = dict(title = f'Cumulative Comment Counts by Episode for Season <b>{season_name}</b>',
                      xaxis=dict(title='Number of Comments', autorange=False), 
                      yaxis=dict(title='First Name'))
    
    set_kwargs.update(kwargs)
    return plot_episode_by_episode_breakdown(reduced, *args, **set_kwargs)

In [29]:

def plot_episode_by_episode_breakdown(df, *args, **kwargs):
    ep_counts = create_episode_counts(df)
    ep_counts.sort_values(by='total_counts', inplace=True)
    episodes = ep_counts['episode_id'].unique()
    episodes.sort()

    empty = dict(type='bar', orientation='h', 
             y=ep_counts['first_name'].unique(),
             x=[None] * ep_counts['first_name'].nunique())


    traces = [empty.copy() for i in range(len(episodes))] 
    frames = []
    

    sliders_dict = {
        "active": 0,
        "yanchor": "top",
        "xanchor": "left",
        "currentvalue": {
            "font": {"size": 20},
            "prefix": "Episode: ",
            "visible": True,
            "xanchor": "right"
        },
        "transition": {"duration": 300, "easing": "cubic-in-out"},
        "pad": {"b": 10, "t": 50},
        "len": 0.9,
        "x": 0.1,
        "y": 0,
        "steps": []
    }

    
    for i, ep in enumerate(episodes):

        fr_dict = dict(type='bar', orientation='h')
        new_bool = ep_counts['episode_id'] == ep
        ep_name = ep_counts.loc[new_bool, 'episode_name'].iloc[0]
        traces[i]['name'] = ep_name

        fr_dict.update(dict(y = ep_counts.loc[new_bool, 'first_name'].reset_index(drop=True),
                            x = ep_counts.loc[new_bool, 'count'].reset_index(drop=True)))

        if i > 0:
            last_frame = deepcopy(frames[-1])
            last_frame['data'].append(fr_dict)
            last_frame['traces'] += [i]
        else:
            last_frame = dict(data=[fr_dict], traces=[0])

        frames.append(last_frame)
        
        slider_step = {"args": [
            [ep_name],
        {"frame": {"duration": 300, "redraw": False},
         "mode": "immediate",
         "transition": {"duration": 300}}
            ],
            "label": ep_name,
            "method": "animate"}
        sliders_dict["steps"].append(slider_step)

    layout = go.Layout(width=1000,
                       height=1000,
                       showlegend=True,
                       hovermode='closest')
    
    

    layout["sliders"] = [sliders_dict]
    layout["updatemenus"] = [
        {
        "buttons": [
            {
                "args": [None, {"frame": {"duration": 500, "redraw": False},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
            },
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
        }
    ]


    try:
        kwargs['xaxis'].update(range=[0, ep_counts['total_counts'].max() * 1.05])
    except KeyError:
        kwargs['xaxis'] = dict(range=[0, ep_counts['total_counts'].max() * 1.05])
    layout.update(barmode='stack',
                  *args, **kwargs)
    fig = go.Figure(data=traces, frames=frames, layout=layout)
    return fig

In [30]:

for season in reddit_df['season_id'].unique():
    plot_episode_by_episode_breakdown_by_season(reddit_df, season).show()

Conclusions¶

This first jump into the analysis definitely gave us some interesting results! Some high-level takeaways:

r/survivor has been around since 2011. Over the years, engagement (in terms of comments) has increased by quite a lot.
Using absolute comment counts doesn't seem to be the way to go when comparing across seasons -- instead using relative comment counts to the total seemed to work best.
The players who are the most popular according to Reddit comments pass the common sense check -- Tony and Rob Mariano top the list.
Comments (or lack thereof) can be used as a proxy for in game events. We've only just started to see this (without looking at the body of the text), but we can already see that, in most cases a users' reddit comments will drop off to zero or near-zero after they are eliminated from the game. This could be interesting to dive a bit deeper into (especially if we want to eventually build a model!)
Generally, as we can see above, players who last longer tend to dominate the number of comments. For instance, in only a few rare instances do contestants who get voted out earlier than those in the final 4 or 5 end up surpassing these contestants in reddit comments. This passes the sniff test -- we will remember contestants, even those we wouldn't really like normally, who last longer simply because they will be more "in the running" with less contestants.

This has been an interesting look at the rich dataset of Reddit comments on r/survivor. But we aren't done yet!

The next iteration of the Survivor analysis Series will use Sentiment analysis and other attributes of the body of the text to try to glean information about the contestants and reddit users' behavior. Stay tuned, you won't want to miss it!

Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.

Comments

Survivor: Outwit, Outplay, Out...analyze? (Part 2) Looking at Reddit Mentions

Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶

Looking at Contestants Mentioned in Comments¶

Taking a Look At The Data¶

Comments Per Season¶

Comments Per Contestant¶

Season Breakdown¶

Episode by Episode Breakdown¶

Conclusions¶

Published

Category

Tags

Contact