Stats Works
  • About This Website

Survivor: Outwit, Outplay, Out...analyze? (Part 2) Looking at Reddit Mentions

Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶

After reading through [my last post] on the ETL process, you may be interested in some of the data that I had collected. Fear not, my dear data science enthusiast, for I have come baring gifts of analysis!

In this second installment of the series, I will begin the analysis by digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.

Looking at Contestants Mentioned in Comments¶

The first query I will be using uses a few different tables. First, I use a CTE (Common Table Expression) which combines information from the contestant and episode performance stats tables. I also contain, in a separate dataframe, the episode table, which may come in handy later.

Then, we look at instances where the first or last name is contained inside of the body of the comment for comments made within a particular season. While there are cases where this will not work (for instance, when a contestant is best known by their nickname or when a shortened, or misspelled, version is used), it should give us a sense of the comments pertaining to particular players.

It's worth noting that the comments only go back to 2011 -- and the number of comments have greatly increased over time. We try a few ways of normalizing based on this information, but some players (particularly older players) will not be considered in this analysis.D

In [1]:
import os
from sqlalchemy import create_engine
import numpy as np
import pandas as pd

import plotly.graph_objects as go  
from copy import deepcopy

from plotly.express import line, bar
In [2]:
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
In [3]:
def pg_uri(un, pw, ip, port):
    return f'postgresql://{un}:{pw}@{ip}:{port}'
In [4]:
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
In [5]:
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name, 
	   c.last_name, cs.contestant_season_id, 
	   cs.season_id, occupation, location, age, placement, 
	   days_lasted, votes_against, 
	   med_evac, quit, individual_wins, attempt_number, 
	   tribe_0, tribe_1, tribe_2, tribe_3, alliance_0, 
	   alliance_1, alliance_2,
	   challenge_wins, challenge_appearances, sitout, 
	   voted_for_bootee, votes_against_player, 
	   total_number_of_votes_in_episode, tribal_council_appearances, 
	   votes_at_council, number_of_jury_votes, total_number_of_jury_votes, 
	   number_of_days_spent_in_episode, days_in_exile, 
	   individual_reward_challenge_appearances, individual_reward_challenge_wins, 
	   individual_immunity_challenge_appearances, individual_immunity_challenge_wins, 
	   tribal_reward_challenge_appearances, tribal_reward_challenge_wins, 
	   tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins, 
	   tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place, 
	   fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
), matched_exact AS 
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT * 
FROM matched_exact m
'''
In [6]:
reddit_df = pd.read_sql(sql, eng)
In [7]:
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
In [8]:
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
In [9]:
reddit_df = reddit_df.merge(season_to_name, on='season_id')
In [10]:
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
In [11]:
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
In [12]:
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
In [13]:
pd.options.display.max_columns = 100

Taking a Look At The Data¶

In [14]:
reddit_df.head()
Out[14]:
index_x author author_created_utc author_flair_css_class author_flair_text author_fullname body controversiality created_utc distinguished gilded id link_id nest_level parent_id reply_delay retrieved_on score score_hidden subreddit subreddit_id edited user_removed mod_removed stickied author_cakeday can_gild collapsed collapsed_reason is_submitter gildings permalink permalink_url updated_utc subreddit_type no_follow send_replies author_flair_template_id author_flair_background_color author_flair_richtext author_flair_text_color author_flair_type rte_mode subreddit_name_prefixed all_awardings associated_award author_patreon_flair author_premium awarders collapsed_because_crowd_control ... attempt_number tribe_0 tribe_1 tribe_2 tribe_3 alliance_0 alliance_1 alliance_2 challenge_wins challenge_appearances sitout voted_for_bootee votes_against_player total_number_of_votes_in_episode tribal_council_appearances votes_at_council number_of_jury_votes total_number_of_jury_votes number_of_days_spent_in_episode days_in_exile individual_reward_challenge_appearances individual_reward_challenge_wins individual_immunity_challenge_appearances individual_immunity_challenge_wins tribal_reward_challenge_appearances tribal_reward_challenge_wins tribal_immunity_challenge_appearances tribal_immunity_challenge_wins tribal_reward_challenge_second_of_three_place tribal_immunity_challenge_second_of_three_place fire_immunity_challenge tribal_immunity_challenge_third_place episode_id season_name index_y summary story challenges trivia image firstbroadcast viewership wiki_link season_episode_number overall_episode_number overall_slot_rating survivor_rating episode_name created_y updated_y
0 4540248 sampete1157 NaN None None t2_yvimwo1 Yul NaN 1585333566 None NaN flo8j0x t3_fpql0y None t1_flmiy69 NaN 1.585335e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flo8j0x/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
1 4538586 lvl4lapras NaN None None t2_54mpbhfb Yul NaN 1585316533 None NaN flne39u t3_fpql0y None t3_fpql0y NaN 1.585317e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flne39u/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
2 4539257 swells61 NaN 34Gold WS33W J.T. t2_fex1f Yul NaN 1585324546 None NaN flnrm96 t3_fpql0y None t1_flmiy69 NaN 1.585325e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flnrm96/ None NaN None true true None None [{'e': 'text', 't': 'J.T.'}] dark richtext None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
3 4539655 ekwag NaN 40Gold WW Nick t2_fssfb Yul NaN 1585328171 None NaN flnyaj9 t3_fpql0y None t3_fpql0y NaN 1.585329e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flnyaj9/ None NaN None true true None None [{'e': 'text', 't': 'Nick'}] dark richtext None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00
4 4539815 Lunarmise NaN None None t2_1r7mqcjo Yul NaN 1585329676 None NaN flo13ml t3_fpql0y None t1_flo0u7q NaN 1.585330e+09 1.0 None survivor t5_2qhu3 NaN None None false None None None None false {} /r/survivor/comments/fpql0y/yul/flo13ml/ None NaN None true true None None [] None text None None [] None false false [] None ... 2.0 NaN NaN NaN NaN NaN NaN NaN 0.125 0.5 0.0 0.0 3.0 4.0 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 NaN NaN -2.0 NaN 695.0 Winners at War 695 We're in the Majors is the seventh episode of ... Getting voted out before the merge-- that's so... Challenge: (No Title)Two members of each tribe... * "Boa Constrictor at Yara" (Day 17): At Yara... https://vignette.wikia.nocookie.net/survivor/i... 2020-03-25 818000000.0 https://survivor.fandom.com/wiki/We%27re_in_th... 7.0 590.0 8.0 1.7 We%27re in the Majors 2020-07-11 01:03:00.566347+00:00 2020-07-19 00:48:27.943994+00:00

5 rows ร— 126 columns

In [15]:
reddit_df.shape
Out[15]:
(1149008, 126)

There is a wealth of data here -- the actual content of the message, other Reddit information (like the user, upvotes, flairs, etc.) For this part of the analysis, we will just be looking at the occurances of the names in the body of the comment. In the next installment, we will look a bit deeper at some of the text inside the body and how that relates to the contestants. Additionally, we will take a look at some of the users and other information in later installments.

One thing to note is what the above query did the heavy lifting for -- finding the first and last names in the bodies of texts. So this dataframe (at a short 1.1 M rows) represents only the comments that had either one of these. Comments can appear multiple times if they contain multiple names.

Comments Per Season¶

The first -- and most obvious -- question we can answer is -- how many comments are there each season? And, which seasons are represented by the subreddit?

In [16]:
from plotly.express import bar, line
In [17]:
def plot_season_comments(df):
    comments_per_season = df.groupby('season_name').size().reset_index()
    comments_per_season.rename(columns={'season_name': 'Season', 0: 'Number of Comments (with names)'}, inplace=True)
    return bar(data_frame=comments_per_season.sort_values(by='Number of Comments (with names)'), 
               x='Season', y='Number of Comments (with names)')
In [18]:
plot_season_comments(reddit_df)