Measures of Spread¶
In the last post on central tendency, we discussed the basics of what a statistic and a population is as well as the common measures of central tendency, the mean, median and mode. In this post, we will dive deeper into how to describe sample and population data using Measures of Spread.
When we talked about central tendency, we discussed the idea behind the "average" observation in a pool of data. When it comes to describing a dataset, the central tendency is a great place to start -- it gives us a sense of the location of our data. For instance, if we had the weight of a variety of individuals, we can get a sense of what a "typical" weight might be using measures of central tendency.
Alien Brains¶
Imagine you are an observer who knows nothing about a particular subject. Let's say this subject is the number of neurons in a sample of aliens' brains. Or, you know, something.
Based on this information alone, an ordinary, human observer would have absolutely no sense of where to begin. That is, we have absolutely no prior information about aliens' brains. Let's say the sample of neurons in 100 aliens found in the first UFO crashing ever known to man look something like this:
import numpy as np
import pandas as pd
import scipy.stats
import plotly.offline as plt
import plotly.graph_objs as go
plt.init_notebook_mode(connected=True)
alien_neurons = pd.Series(np.random.negative_binomial(424242, .0042, 100))
data = [go.Histogram(x=alien_neurons.tolist())]
layout = go.Layout(
title='Alien Brains',
xaxis=dict(title='Number of Neurons'),
yaxis=dict(title='Number of Aliens'),
bargap=0.2,
bargroupgap=0.1)
fig = go.Figure(data=data, layout=layout)
plt.iplot(fig)
Woah! These values are in the hundreds of millions! Well actually, the human brain has over 100,000,000,000 neurons! So, this alien is not the most sophisticated in terms of neurons.
Is there a way to see this, the location, more quantitatively? Your answer should be ... yes! Using the techniques we learned last time, we can find the mean and median of this distribution. To spare the agony of performing this arithmetic on such huge numbers, we can use pandas to find these values.
alien_mean, alien_median = alien_neurons.aggregate(['mean', 'median'])
print('''Mean of Alien Neurons: {},
Median of Alien Neurons: {}'''.format(alien_mean, alien_median))
This means that the mean of this sample is 100,584,186.03 and the median is 100,592,749. These values are quite "close", which is expected because we see that the distribution is somewhat symmetric.
But... wait a minute! How are we certain that these two values are "close"? We can look at our histogram to see if this assertion is correct.
alien_mean - alien_median
data = [go.Histogram(x=alien_neurons.tolist())]
shapes = [{
'type': 'line',
'name': 'Mean',
'line': {
'color': 'red'
},
'xref': 'x',
'yref': 'y',
'x0': alien_mean,
'y0': 0,
'x1': alien_mean,
'y1': 20,
},
{
'type': 'line',
'name': 'Median',
'line': {
'color': 'black'
},
'xref': 'x',
'yref': 'y',
'x0': alien_median,
'y0': 0,
'x1': alien_median,
'y1': 20,
}]
layout = go.Layout(
title='Alien Brains',
xaxis=dict(title='Number of Neurons'),
yaxis=dict(title='Number of Aliens'),
bargap=0.2,
bargroupgap=0.1,
shapes=shapes)
fig = go.Figure(data=data, layout=layout)
plt.iplot(fig)
Well, they look pretty close. However, this is far too hand-wavy of an answer. In determining the "closeness" of these two statistics, we would like to have an objective measure relative to scale. The actual difference is around {{alien_mean - alien_median}}! With just that information alone, we would likely not say that the two are "close".
If the difference between what you got in a paycheck was \$8,563 less than what you expected, you would definitely be upset! The more we think about it, relative differences like these are reliant of the <b> scale </b> of the numbers we're talking about. 8,000 stars may be astronomically small compared to the total number of stars, but \$8,000 is significant to most people!
However, because of the scale of this distribution, this is not a large difference relative to the differences in observations.
Imagine we had a distribution that looked like this instead:
log_norm_sigma = 1.158
log_norm_mu = np.log((alien_mean - alien_median) / (1 - np.exp(
(log_norm_sigma**2) / 2)))
sim = pd.Series(
np.random.lognormal(mean=log_norm_mu, sigma=log_norm_sigma, size=100))
data = [go.Histogram(x=sim.tolist())]
shapes = [{
'type': 'line',
'name': 'Mean',
'line': {
'color': 'red'
},
'xref': 'x',
'yref': 'y',
'x0': sim.mean(),
'y0': 0,
'x1': sim.mean(),
'y1': 50,
},
{
'type': 'line',
'name': 'Median',
'line': {
'color': 'black'
},
'xref': 'x',
'yref': 'y',
'x0': sim.median(),
'y0': 0,
'x1': sim.median(),
'y1': 50,
}]
layout = go.Layout(
title='Not Alien Brains',
xaxis=dict(title='Counts'),
yaxis=dict(title='Volume'),
bargap=0.2,
bargroupgap=0.1,
shapes=shapes)
fig = go.Figure(data=data, layout=layout)
plt.iplot(fig)
As a sidenote, the way this distribution was simulated was using the fact that we want a relative difference in the mean and median in the population to be equal to 4,692, but on a completely different scale. This was using the lognormal distribution, which is a skewed distribution, so it works well for this purpose!
sim.mean() - sim.median()
We can now clearly see that the absolute differences in the mean and median (and in fact, any of the observations) is not sufficient to describe their "closeness" to one another. We have an idea of the location, but we need a sense of scale.
Why does spread matter?¶
We can already see why the measures of central tendency are insufficient to give us an idea of what a distribution looks like. To investigate further, suppose we know only that the mean of some observations is 100. What else can we say?
Well, not much. Consider the two distributions below. Both populations have mean 100, and we sample 100 values from them. The resulting samples have means very close to one another (and to 100), but they are quite different from one another.
ex_1 = np.random.normal(100, 2, 1000)
ex_2 = np.random.normal(100, 10, 1000)
data = [
go.Histogram(x=ex_1.tolist(), name='Less Variant'),
go.Histogram(x=ex_2.tolist(), name='More Variant')
]
layout = go.Layout(
title='Mean of 100',
xaxis=dict(title='Value'),
yaxis=dict(title='Volume'),
)
figure = go.Figure(data=data, layout=layout)
plt.iplot(figure)
ex_1.mean()
ex_2.mean()
We can now see how the spread of these distributions are quite different. How can we measure spread?
Range¶
The simplest measure of spread is known as the range. This is the largest value in the sample subtracted from the smallest value in the sample. This can give us an idea of where the possible values of the sample lay. That is:
$$ \bar{S}_{range} = x_{max} - x_{min} $$For instance, in our alien example, we have:
max_alien_neurons, min_alien_neurons = alien_neurons.agg(['max', 'min'])
rng = max_alien_neurons - min_alien_neurons
print('''Maximum: {}
Minimum: {}
Range: {}'''.format(max_alien_neurons, min_alien_neurons, rng))
From this, we know the difference between the largest and smallest values. To see how the original difference compares to this, we can look at the ratio of the difference and the range.
abs(alien_mean - alien_median) / rng
Here we can get a sense of how "close" the median and mean are. This ratio is necessarily between 0 and 1, with 0 representing the same point, and 1 representing the largest difference in the sample (that is $x_1 - x_2 = x_{range}$).
Although simple, the range suffers from some problems in that it only can measure the differences between the largest and smallest points. It is therefore as variable as those points. Suppose we had a dataset that looked like the following: 0, 50, 50, 50, 90, 95, 100. Then, the mean is:
example_ds = pd.Series([0, 50, 50, 50, 90, 95, 100])
example_ds.mean()
example_ds.median()
Using the ratio calculation as before, we would calculate:
(example_ds.mean() - example_ds.median()) / (
example_ds.max() - example_ds.min())
data = [go.Histogram(x=example_ds.tolist())]
shapes = [{
'type': 'line',
'name': 'Mean',
'line': {
'color': 'red'
},
'xref': 'x',
'yref': 'y',
'x0': example_ds.mean(),
'y0': 0,
'x1': example_ds.mean(),
'y1': 5,
},
{
'type': 'line',
'name': 'Median',
'line': {
'color': 'black'
},
'xref': 'x',
'yref': 'y',
'x0': example_ds.median(),
'y0': 0,
'x1': example_ds.median(),
'y1': 5,
}]
layout = go.Layout(
title='Dummy',
xaxis=dict(title='Number'),
yaxis=dict(title='Volume'),
shapes=shapes)
fig = go.Figure(data=data, layout=layout)
plt.iplot(fig)
Now let's do something crazy -- let's change the last value of the series to 1 million, and see what happens to the "ratio" of the difference between the median and mean to the range.
example_ds.replace({100: 1000000}, inplace=True)
example_ds.median()
(example_ds.mean() - example_ds.median()) / (
example_ds.max() - example_ds.min())
data = [go.Histogram(x=example_ds.tolist())]
shapes = [{
'type': 'line',
'name': 'Mean',
'line': {
'color': 'red'
},
'xref': 'x',
'yref': 'y',
'x0': example_ds.mean(),
'y0': 0,
'x1': example_ds.mean(),
'y1': 6,
},
{
'type': 'line',
'name': 'Median',
'line': {
'color': 'black'
},
'xref': 'x',
'yref': 'y',
'x0': example_ds.median(),
'y0': 0,
'x1': example_ds.median(),
'y1': 6,
}]
layout = go.Layout(
title='Dummy',
xaxis=dict(title='Number'),
yaxis=dict(title='Volume'),
shapes=shapes)
fig = go.Figure(data=data, layout=layout)
plt.iplot(fig)
Huh. Here, the mean and median are very far apart -- and the mean is quite a bit off from the "average" person, as only one observation pushed it up into the thousands. This is, again, due to the lack of robustness of the mean. However, our ratio measure does not seem to think there is much of a difference between the difference in this case and in the previous case.
We would expect that our most extreme values in the sample may be more variable then values closer to the center. In this case, we can take ranges of specific quartiles, as discussed in the previous post, to determine variability.
Interquartile range¶
In a similar way, we can define a range between any two percentiles of the data. In this case, the range is a specific case of this, where we take the 0th and 100th percentile of the data. As discussed earlier, this may be quite variant -- these are the most extreme values of our distribution.
Suppose instead we took the difference between the 25th and 75th percentiles, or the 1st and 3rd quantiles. It would follow that this measure would be more robust then the max-min range. This is known as the inter-quartile range and is (unsurprisingly) robust to outliers. That is, we can say that the interquartile range in a finite sample is:
$$ \tilde{S}_{IQR} = Q_x(.75) - Q_x(.25) $$Where $Q_x(p)$ is the quantile function, i.e. the function that finds data which is the above $p%$ of the data. Arithmetically, we can determine this by finding the medians of the values above and below the median. Most software packages contain functions which can determine this automatically. In the case of the alien brains, we have:
iqr = alien_neurons.quantile(.75) - alien_neurons.quantile(.25)
iqr
This represents how distant the closest two points which account for 50% of the data are. In other words, we have determined a measure of how far away things are. Taking a similar ratio, we can determine how distant two points are. For example:
abs(alien_mean - alien_median) / iqr
This value has a range between 0 and infinity. In general, each "unit" of this ratio measures how far away the two points are relative to the center 50% of the data. A small value indicates that two points are relatively "close", while a large value (>1) indicates they are "distant" relative to the center 50%. For more information on this robust measure of spread, see here
Boxplots¶
Using the information we have learned about the measures of spread and measures of central tendency, we can now culminate it together into a plot called a box and whisker plot. This plot shows the median, the inter-quartile range, and the total range in one figure. This can give us a good "overview" of what the data looks like in terms of quantiles.
data = go.Box({'x': alien_neurons, 'name': 'Alien Neurons'})
plt.iplot([data])
The center line represents the median, the two lines surrounding in the 25th and 75th quartiles, respectively, and the endpoints (and whiskers) represent the minimum and maximum. To compare to the non-symmetric simulated plot we discussed before, we can see large differences between the two:
data = go.Box({'x': sim, 'name': 'Not Alien Neurons'})
plt.iplot([data])
We can see now how much of the inferences about spread can be understood through the use of quantiles and differences between them. These measures are quite robust, but suffer in the sense of efficiency. In a manner equivalent to the median and mean, these measures of variability do not take into account every data point, as the mean does, but only considers the quantiles. The next measure of spread, and the most common historically and in practice, is the variance which is defined in terms of the mean instead of in terms of quantiles. But first, we will discuss the degrees of freedom implicit in the mean.
Freedom -- the dream, and the degree to which we have it¶
It is at this point that we should discuss degrees of freedom , before jumping into variance. Degrees of freedom are a concept which many struggle with in elementary statistics, and is often glossed over in respect to statistics courses. I believe that understanding the degrees of freedom are essential in understanding how randomness ties into statistical treatment of data. It is for this reason that I have included this here. If this feels too math heavy, feel free to disregard it -- it is not necessary to understand, but helpful. If you're fond of mathematics, read on.
Consider it this way. Let's say I told you we have 100 points. How many degrees of freedom do these points have? When I ask this, I essentially am asking the question how many different ways can these points vary?
Well, we only know that there are 100 points -- we know nothing of how these points are distributed. Therefore, they can conceivably be any 100 points, since we have no prior information.
Now, suppose that I tell you that the mean of these points is 100. Have the degrees of freedom changed?
The answer is yes. We can no longer vary the numbers absolutely freely -- we are constrained in the fact that the number must have some property -- the mean -- which equals some known value. In this case, n - 1 of the points can vary, after which the final number is fixed.
Let's consider an example. Imagine we pick some random numbers out of a hat and get the following: 9, 8, 2, 42, 100, -2. Suppose there is only one paper left in the hat and I told you that I, an omniscient observer, know that the mean of all the values in the hat is 1000. What is the final value?
Well, we know that
$$ \bar{X}_{mean} = \frac{\sum_{1}^{n} x_i}{n} = \frac{\sum_{1}^{n - 1}}{n} + \frac{x_n}{n} $$So, it follows that: $$ x_n = n\bar{X}_{mean} - \sum_{1}^{n-1} x_i $$ This is only using elementary algebra. In our case, the final value is
$$ 7 \bar{X}_{mean} - \sum_{1}^{6} x_i\\ 7 \times 1000 - (9 + 8 + 2 + 42 + 100 - 2) $$This is:
points_picked = pd.Series([9, 8, 2, 42, 100, -2])
last_value = 7000 - points_picked.sum()
last_value
To confirm that this is correct, lets see what the mean looks like.
points_picked.append(pd.Series(last_value)).mean()
But we could do this for any mean! Suppose the mean was instead -1000. Then
last_value = -7000 - points_picked.sum()
last_value
points_picked.append(pd.Series(last_value)).mean()
This describes what the mean tells us about the data -- if we know the mean, one of the data points essentially becomes redundant.
Now, suppose I wanted to create my own hat game, where I wanted the mean to be 42. How could I do this?
There are infinitely many sets of numbers I could choose that have a mean of 42. However, they all have the constraint that if we were to remove one value, we could reconstruct it in the way above ! That is, the final value is not random but determined . This means that we have n - 1 degrees of freedom -- or that the dimensionality of our random vector is now n -1.
Then, I could randomly select n - 1 numbers, and select the final number deterministically such that
$$ x_7 = 7 \times 42 - \sum_{1}^{6} x_i $$This will ensure that my mean is always 42. In fact, there is no such other number I could choose that has a mean of 42.
Conversely, the number I select will uniquely determine the mean. So the mean has one degree of freedom -- it can be determined entirely from one number, if the other numbers are known.
This becomes important because the more degrees of freedom we have, the more we allow random variation to play a part in our estimates. Although randomness may seem bad, in fact having more degrees of freedom can be a good thing, as is the case in linear regression -- a topic for a different day.
In general, we have
def find_point_from_mean(series_of_less, mean):
n = len(series_of_less) + 1
n_final = n * mean - series_of_less.sum()
return n_final
find_point_from_mean(points_picked, 1000)
Freedom isn't free¶
Now let's tackle the first case -- we have 100 points with a mean of 100. How many different ways can we select these 100 points?
Consider the following: knowing only the mean of the distribution, we can select any n - 1 points, in this case 99, arbitrarily and completely at random, and then select the final point such that:
$$ x_{final} = n\bar{X}_{mean} - \sum_{i=1}^{n-1} x_i $$In this case, this is $$ x_{final} = 10,000 - \sum_{i=1}^{99} x_i $$
Taken another way, we can say the following:
Knowing the mean, we can calculate:
$$ \tilde{x}_i = x_i - \bar{X}_{mean}\\ \forall x_i \in S $$Only n -1 of these values are allowed to vary -- once we know n-1 of them, the last is determined by the calculation of the mean. In general, we have that the degrees of freedom can be decomposed as follows:
$$ \begin{bmatrix}x_1 \\ x_2 \\ ... \\ x_n\end{bmatrix} = \begin{bmatrix}\bar{X}_{mean} \\ \bar{X}_{mean}\\ ... \\ \bar{X}_{mean}\end{bmatrix} + \begin{bmatrix} x_1 - \bar{X}_{mean} \\ x_2 - \bar{X}_{mean}\\ ... \\ x_n - \bar{X}_{mean}\end{bmatrix} $$Note that this is in vector notation, but it makes the results easier to see. The vector on the left, our data points $x_1, x_2, ..., x_n$ have n degrees of freedom -- given no additional information, they are free to vary randomly. The first vector on the right, the vector of means, has 1 degree of freedom -- if we do not know it, it can vary in one dimension. That means it can be any single number.
The final vector on the right now has n - 1 degrees of freedom (the degrees of freedom of the original vector minus the degrees of freedom of the mean). This means that given a mean, our data points can vary with dimension n - 1. The final dimension is then a linear combination of the other data points.
Variance¶
But how does this have any connection to the spread of a dataset? Happy you asked!
One way we can consider spread is: how far away are the observations from the mean? In order to do this, we could take the differences between each point and the mean. This would look something like this:
x_differences = alien_neurons - alien_mean
x_differences
This will give us differences for each of the points. It would be nicer to have one number to measure the variability of the data, so we can take the average of these values:
x_differences.mean()
Hm. This value is very close to zero. This is actually expected, simply because of the definition of the arithmetic mean. Consider the following:
$$ \bar{X}_{mean} = \frac{\sum_{i=1}^{n} x_i}{n}\\ \sum_{i=1}^{n}(x_i - \bar{X}_{mean}) =\\ \sum_{i=1}^{n}x_i - n\bar{X}_{mean} = \\ \sum_{i=1}^{n}x_i - n\frac{\sum_{i=1}^{n} x_i}{n} =\\ 0 $$Intuitively, the mean is the value which is in the "center" of the distribution, in the sense that the differences above the mean are equal to the differences below the mean. The mean is specifically selected such that this sum will equal to zero.
But in our case, we are not very happy with this. We are not as concerned with the sign of the distance from the mean, but how far away it is. We are not interested in the direction but the magnitude. For example, if the mean is 2 and we have two points around it, 1 and 3, we would say that they are equally distanced from the mean -- the sign of the difference (-1 and 1) does not matter.
How can we eliminate the sign? Mathematically, there are two general approaches to this -- we can take the absolute value or we can take the squares. The absolute value preserves the exact magnitude, while squaring does not.
Why might we want to take the squares? Squaring the distance values will essentially put more weight on values which are far from the mean, and less on those close. In this sense, we get a penalization in the case of extreme values -- extreme values will make our variance larger than "common" values. Additionally, historically the squares have been used because they are analytically "nicer" -- they are smooth functions for which we can take the derivative, while the absolute value is not. In the computing age, this has become less of an issue as numerical approaches to optimization problems have taken hold, but I digress.
Since the squared version is used the most in statistical literature, we will define this first. We can define the sum of squares as the following:
$$ SS = \sum_{i=1}^{n}(x_i - \bar{X}_{mean})^2 $$This gives us a sense of the total amount of squared deviations from the mean. This is a measure of spread.
It may be useful to get an average of these values, i.e. the "typical" squared deviation from the mean. This is known as the mean squared deviation from the mean, or the variance.
In the case of a known population mean, we can calculate this variance directly from our sample. In this case, all of our values are allowed to vary freely -- they are not reliant on this known population mean. So we can take the mean as follows:
$$ \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2 $$Of course, this would not happen often in practice. But since the mean was known, our values are not linearly computable directly from it -- in this sense, they can be any values.
If we do not, however, we must make a correction known as the Bessel's correction . The sample variance is the variance which takes this into account, as follows: $$ \hat{s}^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{X}_{mean})^2 $$
Here, we divide by n-1 because the vector of $x_i - \bar{x}_i$ is only allowed to vary in n-1 dimensions -- because it is determined once the other points are calculated. This is how degrees of freedom relates to the sample variance.