Analyzing the Game of Survivor -- Looking at Fan Favorites via Reddit (2)¶
After reading through [my last post] on the ETL process, you may be interested in some of the data that I had collected. Fear not, my dear data science enthusiast, for I have come baring gifts of analysis!
In this second installment of the series, I will begin the analysis by digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.
Looking at Contestants Mentioned in Comments¶
The first query I will be using uses a few different tables. First, I use a CTE (Common Table Expression) which combines information from the contestant and episode performance stats tables. I also contain, in a separate dataframe, the episode
table, which may come in handy later.
Then, we look at instances where the first or last name is contained inside of the body of the comment for comments made within a particular season. While there are cases where this will not work (for instance, when a contestant is best known by their nickname or when a shortened, or misspelled, version is used), it should give us a sense of the comments pertaining to particular players.
It's worth noting that the comments only go back to 2011 -- and the number of comments have greatly increased over time. We try a few ways of normalizing based on this information, but some players (particularly older players) will not be considered in this analysis.D
import os
from sqlalchemy import create_engine
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from copy import deepcopy
from plotly.express import line, bar
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
def pg_uri(un, pw, ip, port):
return f'postgresql://{un}:{pw}@{ip}:{port}'
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name,
c.last_name, cs.contestant_season_id,
cs.season_id, occupation, location, age, placement,
days_lasted, votes_against,
med_evac, quit, individual_wins, attempt_number,
tribe_0, tribe_1, tribe_2, tribe_3, alliance_0,
alliance_1, alliance_2,
challenge_wins, challenge_appearances, sitout,
voted_for_bootee, votes_against_player,
total_number_of_votes_in_episode, tribal_council_appearances,
votes_at_council, number_of_jury_votes, total_number_of_jury_votes,
number_of_days_spent_in_episode, days_in_exile,
individual_reward_challenge_appearances, individual_reward_challenge_wins,
individual_immunity_challenge_appearances, individual_immunity_challenge_wins,
tribal_reward_challenge_appearances, tribal_reward_challenge_wins,
tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins,
tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place,
fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
), matched_exact AS
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT *
FROM matched_exact m
'''
reddit_df = pd.read_sql(sql, eng)
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
reddit_df = reddit_df.merge(season_to_name, on='season_id')
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
pd.options.display.max_columns = 100
Taking a Look At The Data¶
reddit_df.head()
reddit_df.shape
There is a wealth of data here -- the actual content of the message, other Reddit information (like the user, upvotes, flairs, etc.) For this part of the analysis, we will just be looking at the occurances of the names in the body of the comment. In the next installment, we will look a bit deeper at some of the text inside the body and how that relates to the contestants. Additionally, we will take a look at some of the users and other information in later installments.
One thing to note is what the above query did the heavy lifting for -- finding the first and last names in the bodies of texts. So this dataframe (at a short 1.1 M rows) represents only the comments that had either one of these. Comments can appear multiple times if they contain multiple names.
Comments Per Season¶
The first -- and most obvious -- question we can answer is -- how many comments are there each season? And, which seasons are represented by the subreddit?
from plotly.express import bar, line
def plot_season_comments(df):
comments_per_season = df.groupby('season_name').size().reset_index()
comments_per_season.rename(columns={'season_name': 'Season', 0: 'Number of Comments (with names)'}, inplace=True)
return bar(data_frame=comments_per_season.sort_values(by='Number of Comments (with names)'),
x='Season', y='Number of Comments (with names)')
plot_season_comments(reddit_df)