Stats Works
  • About This Website

Survivor: Outwit, Outplay, Out...analyze? (Part 2) Looking at Reddit Mentions

Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶

After reading through [my last post] on the ETL process, you may be interested in some of the data that I had collected. Fear not, my dear data science enthusiast, for I have come baring gifts of analysis!

In this second installment of the series, I will begin the analysis by digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.

Looking at Contestants Mentioned in Comments¶

The first query I will be using uses a few different tables. First, I use a CTE (Common Table Expression) which combines information from the contestant and episode performance stats tables. I also contain, in a separate dataframe, the episode table, which may come in handy later.

Then, we look at instances where the first or last name is contained inside of the body of the comment for comments made within a particular season. While there are cases where this will not work (for instance, when a contestant is best known by their nickname or when a shortened, or misspelled, version is used), it should give us a sense of the comments pertaining to particular players.

It's worth noting that the comments only go back to 2011 -- and the number of comments have greatly increased over time. We try a few ways of normalizing based on this information, but some players (particularly older players) will not be considered in this analysis.D

In [1]:
import os
from sqlalchemy import create_engine
import numpy as np
import pandas as pd

import plotly.graph_objects as go  
from copy import deepcopy

from plotly.express import line, bar
In [2]:
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
In [3]:
def pg_uri(un, pw, ip, port):
    return f'postgresql://{un}:{pw}@{ip}:{port}'
In [4]:
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
In [5]:
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name, 
	   c.last_name, cs.contestant_season_id, 
	   cs.season_id, occupation, location, age, placement, 
	   days_lasted, votes_against, 
	   med_evac, quit, individual_wins, attempt_number, 
	   tribe_0, tribe_1, tribe_2, tribe_3, alliance_0, 
	   alliance_1, alliance_2,
	   challenge_wins, challenge_appearances, sitout, 
	   voted_for_bootee, votes_against_player, 
	   total_number_of_votes_in_episode, tribal_council_appearances, 
	   votes_at_council, number_of_jury_votes, total_number_of_jury_votes, 
	   number_of_days_spent_in_episode, days_in_exile, 
	   individual_reward_challenge_appearances, individual_reward_challenge_wins, 
	   individual_immunity_challenge_appearances, individual_immunity_challenge_wins, 
	   tribal_reward_challenge_appearances, tribal_reward_challenge_wins, 
	   tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins, 
	   tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place, 
	   fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
), matched_exact AS 
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT * 
FROM matched_exact m
'''
In [6]:
reddit_df = pd.read_sql(sql, eng)
In [7]:
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
In [8]:
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
In [9]:
reddit_df = reddit_df.merge(season_to_name, on='season_id')
In [10]:
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
In [11]:
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
In [12]:
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
In [13]:
pd.options.display.max_columns = 100

Taking a Look At The Data¶

In [14]:
reddit_df.head()
Out[14]:
index_x author author_created_utc author_flair_css_class author_flair_text author_fullname body controversiality created_utc distinguished gilded id link_id nest_level parent_id reply_delay retrieved_on score score_hidden subreddit subreddit_id edited user_removed mod_removed stickied author_cakeday can_gild collapsed collapsed_reason is_submitter gildings permalink permalink_url updated_utc subreddit_type no_follow send_replies author_flair_template_id author_flair_background_color author_flair_richtext author_flair_text_color author_flair_type rte_mode subreddit_name_prefixed all_awardings associated_award author_patreon_flair author_premium awarders collapsed_because_crowd_control ... attempt_number tribe_0 tribe_1 tribe_2 tribe_3 alliance_0 alliance_1 alliance_2 challenge_wins challenge_appearances sitout voted_for_bootee votes_against_player total_number_of_votes_in_episode tribal_council_appearances votes_at_council number_of_jury_votes total_number_of_jury_votes number_of_days_spent_in_episode days_in_exile individual_reward_challenge_appearances individual_reward_challenge_wins individual_immunity_challenge_appearances individual_immunity_challenge_wins tribal_reward_challenge_appearances tribal_reward_challenge_wins tribal_immunity_challenge_appearances tribal_immunity_challenge_wins tribal_reward_challenge_second_of_three_place tribal_immunity_challenge_second_of_three_place fire_immunity_challenge tribal_immunity_challenge_third_place episode_id season_name index_y summary story challenges trivia image firstbroadcast viewership wiki_link season_episode_number overall_episode_number overall_slot_rating survivor_rating episode_name created_y updated_y
0 4540248 sampete1157 NaN None None t2_yvimwo1 Yul NaN 1585333566 None NaN flo8j0x t3_fpql0y None t1_flmiy69 NaN 1.585335e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flo8j0x/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
1 4538586 lvl4lapras NaN None None t2_54mpbhfb Yul NaN 1585316533 None NaN flne39u t3_fpql0y None t3_fpql0y NaN 1.585317e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flne39u/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
2 4539257 swells61 NaN 34Gold WS33W J.T. t2_fex1f Yul NaN 1585324546 None NaN flnrm96 t3_fpql0y None t1_flmiy69 NaN 1.585325e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flnrm96/ None NaN None true true None None [{'e': 'text', 't': 'J.T.'}] dark richtext None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
3 4539655 ekwag NaN 40Gold WW Nick t2_fssfb Yul NaN 1585328171 None NaN flnyaj9 t3_fpql0y None t3_fpql0y NaN 1.585329e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flnyaj9/ None NaN None true true None None [{'e': 'text', 't': 'Nick'}] dark richtext None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
4 4539815 Lunarmise NaN None None t2_1r7mqcjo Yul NaN 1585329676 None NaN flo13ml t3_fpql0y None t1_flo0u7q NaN 1.585330e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flo13ml/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00

5 rows × 126 columns

In [15]:
reddit_df.shape
Out[15]:
(1149008, 126)

There is a wealth of data here -- the actual content of the message, other Reddit information (like the user, upvotes, flairs, etc.) For this part of the analysis, we will just be looking at the occurances of the names in the body of the comment. In the next installment, we will look a bit deeper at some of the text inside the body and how that relates to the contestants. Additionally, we will take a look at some of the users and other information in later installments.

One thing to note is what the above query did the heavy lifting for -- finding the first and last names in the bodies of texts. So this dataframe (at a short 1.1 M rows) represents only the comments that had either one of these. Comments can appear multiple times if they contain multiple names.

Comments Per Season¶

The first -- and most obvious -- question we can answer is -- how many comments are there each season? And, which seasons are represented by the subreddit?

In [16]:
from plotly.express import bar, line
In [17]:
def plot_season_comments(df):
    comments_per_season = df.groupby('season_name').size().reset_index()
    comments_per_season.rename(columns={'season_name': 'Season', 0: 'Number of Comments (with names)'}, inplace=True)
    return bar(data_frame=comments_per_season.sort_values(by='Number of Comments (with names)'), 
               x='Season', y='Number of Comments (with names)')
In [18]:
plot_season_comments(reddit_df)

We can see that the seasons have been those since 2011. We also see that certain seasons, in particular those that were more recent, have much more comments than other seasons. This makes sense intuitively, as there has been a good deal of increased use in Reddit over the years. Winners at War, the most recent season, has the most reddit comments, and also gained a lot of TV viewership as well, as it was an "all-star" type game.

To see this increase a bit more clearly, we can look at this over time based on the broadcast date of the episodes:

In [19]:
def plot_season_comments_time(df):
    comments_per_season = df.sort_values(by='firstbroadcast').groupby([df['firstbroadcast'].sort_values().dt.year, 'season_name']).size().reset_index()
    comments_per_season['Season, year'] = comments_per_season['season_name'] + ', ' + comments_per_season['firstbroadcast'].astype(str)
    comments_per_season.rename(columns={0: 'Number of Comments (with names)'}, inplace=True)
    return line(data_frame=comments_per_season, x='Season, year', y='Number of Comments (with names)', )
In [20]:
plot_season_comments_time(reddit_df)

Not too different from the above sorted chart, with a few exceptions of dips and peaks during certain years. Interestingly, two confounding factors exist here -- the increased popularity of Reddit over time, and the popularity of some seasons over others. Something we must keep in mind throughout this analysis!

Comments Per Contestant¶

Now, to jump into the meat of the reason for this query -- to take a look at the number of comments about particular contestants.

First we look at the contestants that had the highest absolute count of comments on Reddit. Since different seasons may have more (or less) comments based on factors not related to the popularity of the season itself, this will not necessarily give us an unbiased answer. However, it still will be interesting to consider this in both absolute and relative terms.

For the absolute chart, we look at the number of mentions of each contestant and plot a bar chart with the top 20 contestants. For the relative, we consider how many comments they got relative to the total number of mentions that season.

In [21]:
def plot_sorted_n_alltime(df, total=True, n=20, top=True):
    
    grper = ['contestant_season_id', 'first_name', 'last_name']
    
    measured = 'Number of Mentions' if total else 'Percent of Season Mentions'
    abs_rel = 'Absolute' if total else 'Relative'
    top_bot = 'Top' if top else 'Bottom'
    title = f'{top_bot} {n} by {abs_rel} Number of Mentions'
    
    totals = df.groupby('season_name').apply(lambda x: x.groupby(grper).size() / (x.shape[0] if not total else 1)).sort_values()
    totals_clipped = totals.tail(n) if top else totals.head(n)
    totals_clipped = totals_clipped.reset_index()
    
    if not total:
        totals_clipped[0] *= 100
    totals_clipped['full_name'] = totals_clipped['first_name'] + \
                                 ' ' + totals_clipped['last_name'] + \
                                 ', ' + totals_clipped['season_name'].astype(str)
    totals_clipped.rename(columns={0: measured, 'full_name': 'Contestant'}, inplace=True)
    return bar(data_frame=totals_clipped, x=measured, y='Contestant', title=title)
    
In [22]:
plot_sorted_n_alltime(reddit_df)
In [23]:
plot_sorted_n_alltime(reddit_df, False)

We see some interesting results. First, some of the big names are at the top of this list -- Tony is a very popular player, as well as a two time winner of the game. We see that the first chart mainly has members from Winners at War, which makes sense since this season has much more comments than the others. Even, Adam Klein makes this list, although he may have ben one of the least popular of the contestants on that season.

The relative chart shows a bit more of a holistic view -- the top 8 contestants are immediately recognizable to me as interesting contestants from past seasons. Rob Mariano or "Boston Rob" is one of the most popular players the show has ever had. As is the nerd figure, John Cochran.

We could look at this a few different ways of course -- if we looked at the bottom of this list, I'm sure we'd see a lot of people who we've never heard of who were voted out in the first episode. Out of curiosity let's check it out!

In [24]:
plot_sorted_n_alltime(reddit_df, False, top=False)

Hm, the results here are somewhat interesting! One one end, there are some names who I have certainly never heard of. On the other, there are some that are quite popular -- J.T. for instance. In this case, I think it's a data error (a data error, unfortunately for JT, that I don't think is worth diving into in this analysis) that J.T. is probably not said much in the comments (maybe JT). Others have contestants thatwere popular overall, but probably unpopular or voted out quicky during their season (like Rupert).

Then, you have people who are notoriously unlucky, like Francesca Hogi, who lost both of the first episodes she was on in Survivor. Hate to say it, but she's exactly who you'd hope would be on the bottom of this list!

Season Breakdown¶

While there were definitely some interesting takeaways looking at his on the aggregate, the next step is to drill down into individual seasons -- when did people get the most comments? Were some people popular (and then voted out?)

The next plots look into this.

In [25]:
from plotly.express import bar
In [26]:
def create_count_from_episode_df(ep_df, unique_idxs):
    grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
    reindexed = ep_df.groupby(grper).size().reindex(unique_idxs)
    reindexed.name = 'count'
    reindexed.fillna({'count': 0}, inplace=True)
    reindexed.drop(columns=['episode_name'], inplace=True)
    reindexed = reindexed.reset_index()
    return reindexed



def create_episode_counts(df):
    grper = ['season_id', 'contestant_season_id', 'first_name', 'last_name']
    
    idx = df[grper].drop_duplicates()
    ep_counts = df.groupby(['episode_id', 'episode_name']).apply(lambda x: create_count_from_episode_df(x, idx))
    ep_counts = ep_counts.reset_index()
    ep_counts['cumulative_player_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform(lambda x: x.cumsum())
    ep_counts.sort_values(by=['episode_id', 'cumulative_player_counts'], inplace=True)
    ep_counts['total_counts'] = ep_counts.groupby('contestant_season_id')['count'].transform('sum')
    return ep_counts


def create_racetrack_by_episode(df, *args, **kwargs):
    ep_counts = create_episode_counts(df)
    fig = bar(y='first_name', x='cumulative_player_counts', 
               data_frame=ep_counts,  animation_frame='episode_name', 
               range_x=[0, ep_counts['cumulative_player_counts'].max() * 1.05],
               *args, **kwargs)
    fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
    return fig

def create_racetrack_by_episode_for_season(df, season_id, *args, **kwargs):
    subset_df = df[df['season_id'] == season_id]
    season_name = subset_df['season_name'].iloc[0]
    set_kwargs = dict(title = f'Cumulative Comment Counts for Season: <b>{season_name}</b>',
                      labels=
                     {'first_name': 'First Name', 
                      'cumulative_player_counts': 'Number of Comments',
                      'episode_name': 'Episode Name'})
    set_kwargs.update(kwargs)
    return create_racetrack_by_episode(subset_df, *args, **set_kwargs)
    

These plots below are animated plots for each of the seasons considered in the reddit data. These are what I'm calling racecar plots, or really animated barplots where the categories (in this case the contestants) race against one another to get to the right side of the chart.

In this case, to make things interesting, I only consider when the contestant is still getting comments in reddit -- once you are voted out of the game, your total goes down to zero. An interesting thing you may notice is that this correlates very strongly with the person who was voted out last. This makes sense -- for most players, once they are voted out they are rarely, if ever, mentioned again on Reddit for the episodes after that (or rather, the time after the subsequent epsiodes). People have short memories.

Plotly Express makes this plot very easy to make, at the cost of some customization. The above mentioned aspect of the plot was actually completely unavoidable for this plot, without some big workarounds. Still, the animation capabilities of plotly express are worth mentioning. As you can see below, you can press on the play button to start the animation, and it will begin to play episode by episode until the season is "over" (the last episode of the season). After each episode, the remaining contestants are sorted. Keep an eye on the players, they can be a bit hard to track!

In [1]:
for season in reddit_df['season_id'].unique():
    fig = create_racetrack_by_episode_for_season(reddit_df, season, height=1000, width=1000)

---------------------------------------
NameErrorTraceback (most recent call last)
<ipython-input-1-9f991c8c17d1> in <module>
----> 1 for season in reddit_df['season_id'].unique():
      2     fig = create_racetrack_by_episode_for_season(reddit_df, season, height=1000, width=1000)

NameError: name 'reddit_df' is not defined

Episode by Episode Breakdown¶

While the last plots did show us how the cumumlative comment counts grew over time for each contestant, we'd like to be able to visualize the difference between episodes a bit more clearly. Additionally, it would be nice to have the bars stick around even after that player no longer has additional comments. For this, we will use our own custom plotly animation function below to generate similar plots for each of these seasons.

In [28]:
def plot_episode_by_episode_breakdown_by_season(df, season_id, *args, **kwargs):
    reduced = df[df['season_id'] == season_id]
    season_name = reduced['season_name'].iloc[0]
    
    set_kwargs = dict(title = f'Cumulative Comment Counts by Episode for Season <b>{season_name}</b>',
                      xaxis=dict(title='Number of Comments', autorange=False), 
                      yaxis=dict(title='First Name'))
    
    set_kwargs.update(kwargs)
    return plot_episode_by_episode_breakdown(reduced, *args, **set_kwargs)
In [29]:
def plot_episode_by_episode_breakdown(df, *args, **kwargs):
    ep_counts = create_episode_counts(df)
    ep_counts.sort_values(by='total_counts', inplace=True)
    episodes = ep_counts['episode_id'].unique()
    episodes.sort()

    empty = dict(type='bar', orientation='h', 
             y=ep_counts['first_name'].unique(),
             x=[None] * ep_counts['first_name'].nunique())


    traces = [empty.copy() for i in range(len(episodes))] 
    frames = []
    

    sliders_dict = {
        "active": 0,
        "yanchor": "top",
        "xanchor": "left",
        "currentvalue": {
            "font": {"size": 20},
            "prefix": "Episode: ",
            "visible": True,
            "xanchor": "right"
        },
        "transition": {"duration": 300, "easing": "cubic-in-out"},
        "pad": {"b": 10, "t": 50},
        "len": 0.9,
        "x": 0.1,
        "y": 0,
        "steps": []
    }

    
    for i, ep in enumerate(episodes):

        fr_dict = dict(type='bar', orientation='h')
        new_bool = ep_counts['episode_id'] == ep
        ep_name = ep_counts.loc[new_bool, 'episode_name'].iloc[0]
        traces[i]['name'] = ep_name

        fr_dict.update(dict(y = ep_counts.loc[new_bool, 'first_name'].reset_index(drop=True),
                            x = ep_counts.loc[new_bool, 'count'].reset_index(drop=True)))

        if i > 0:
            last_frame = deepcopy(frames[-1])
            last_frame['data'].append(fr_dict)
            last_frame['traces'] += [i]
        else:
            last_frame = dict(data=[fr_dict], traces=[0])

        frames.append(last_frame)
        
        slider_step = {"args": [
            [ep_name],
        {"frame": {"duration": 300, "redraw": False},
         "mode": "immediate",
         "transition": {"duration": 300}}
            ],
            "label": ep_name,
            "method": "animate"}
        sliders_dict["steps"].append(slider_step)

    layout = go.Layout(width=1000,
                       height=1000,
                       showlegend=True,
                       hovermode='closest')
    
    

    layout["sliders"] = [sliders_dict]
    layout["updatemenus"] = [
        {
        "buttons": [
            {
                "args": [None, {"frame": {"duration": 500, "redraw": False},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
            },
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
        }
    ]


    try:
        kwargs['xaxis'].update(range=[0, ep_counts['total_counts'].max() * 1.05])
    except KeyError:
        kwargs['xaxis'] = dict(range=[0, ep_counts['total_counts'].max() * 1.05])
    layout.update(barmode='stack',
                  *args, **kwargs)
    fig = go.Figure(data=traces, frames=frames, layout=layout)
    return fig
In [30]:
for season in reddit_df['season_id'].unique():
    plot_episode_by_episode_breakdown_by_season(reddit_df, season).show()

Conclusions¶

This first jump into the analysis definitely gave us some interesting results! Some high-level takeaways:

  • r/survivor has been around since 2011. Over the years, engagement (in terms of comments) has increased by quite a lot.

  • Using absolute comment counts doesn't seem to be the way to go when comparing across seasons -- instead using relative comment counts to the total seemed to work best.

  • The players who are the most popular according to Reddit comments pass the common sense check -- Tony and Rob Mariano top the list.

  • Comments (or lack thereof) can be used as a proxy for in game events. We've only just started to see this (without looking at the body of the text), but we can already see that, in most cases a users' reddit comments will drop off to zero or near-zero after they are eliminated from the game. This could be interesting to dive a bit deeper into (especially if we want to eventually build a model!)

  • Generally, as we can see above, players who last longer tend to dominate the number of comments. For instance, in only a few rare instances do contestants who get voted out earlier than those in the final 4 or 5 end up surpassing these contestants in reddit comments. This passes the sniff test -- we will remember contestants, even those we wouldn't really like normally, who last longer simply because they will be more "in the running" with less contestants.

This has been an interesting look at the rich dataset of Reddit comments on r/survivor. But we aren't done yet!

The next iteration of the Survivor analysis Series will use Sentiment analysis and other attributes of the body of the text to try to glean information about the contestants and reddit users' behavior. Stay tuned, you won't want to miss it!

Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.
Comments
comments powered by Disqus

Articles in Survivor

  • « Survivor: Outwit, Outplay, Out...analyze? (Part 1) Collecting The Data
  • Survivor: Outwit, Outplay, Out...analyze? (Part 3) Sentiment Analysis Exploratory Analysis »

All Articles

  • « Survivor: Outwit, Outplay, Out...analyze? (Part 1) Collecting The Data
  • Survivor: Outwit, Outplay, Out...analyze? (Part 3) Sentiment Analysis Exploratory Analysis »

Published

Nov 5, 2021

Category

Survivor

Tags

  • data analysis 3
  • project 4
  • reddit 3
  • survivor 4

Contact

  • Stats Works - Making Statistics Work for You