Here we will consider the marathon runners listed in this wikipedia page.
This post will consider various generalized linear models (GLMS) with different families.
timesm <- c(123.98,128.633,134.266,139.484,139.533,156.5,161.95,174.8,184.9,195.9,236.567,340.017,505.283)
timesf <- c(139.316,144.893,149.00,151.083,170.55,181.5,192.95,215.483,233.7,252.733,314.433,533.133)
agem <- c(35,40,45,50,55,60,65,70,75,80,85,90,100)
agef <- c(35,40,45,50,55,60,65,70,75,80,85,90)
agemsq <- agem^2
time <- c(timesf,timesm)
age <- c(agef,agem)
gender <- c(rep(1,12),rep(0,13))
marathon <- data.frame(time,age,gender)
head(marathon)
First, let's do an ordinary linear regression model with a Gaussian response variable assumption.
Linear Regression (Gaussian)¶
lin1 <- lm(timesm~agem,data=marathon)
lin2 <- lm(timesm~agem+agemsq,data=marathon)
summary(lin1)
summary(lin2)
Based on the summary, it appears that the quadratic term is quite important, it increases the R-squared significantly. So there does appear to be a quadratic effect of age on the time (or at the very least, a non-linear one).
lin3 <- lm(time~age+gender)
summary(lin3)
Here, the model shows that gender has some part in determining the time (in seconds) of completion, but it is not significant. If the coefficient is correct, it means that, holding age constant, a man will have 35.0534 faster time (or a woman will take 35.0534 additional seconds) to finish the race.
Comparing the intercepts in the models shows a pretty large difference in the predictions. The reason why the male-only intercept is different from the intercept in the second model is because the second model is not considering the interaction of age and gender. When you account for this in the model, the intercepts will be the same. The gender coefficient in the model assumes that you are holding everything else constant, that is, the difference between a 35 year old male and female, for instance. When you consider the interaction term as well, this problem is eliminated.
Gamma Distribution¶
gammalin1 <- glm(timesm~agem,family="Gamma")
summary(gammalin1)
The link function here is the canonical link function for the Gamma distribution, which is the inverse ($\frac{1}{\mu}$). All variables come up as significant.
The model can be written as follows $Y = X\beta + \epsilon$ where Y follows a Gamma distribution with link function $g(\mu) = \frac{1}{\mu}$
Inverse-Gaussian distribution¶
The exponential family has the following form: $$ f(y,\theta ,\phi ) = e^{\frac{y \theta - b(\theta )}{a(\phi )} + c(y, \phi)} $$ The inverse-Gaussian distribution is defined as: $$ f(y) = \exp (-\frac{\lambda (y - \mu )^2)}{2\mu ^2y} - \frac{1}{2} \log (\frac{2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (-\frac{\lambda (y^2 - 2y\mu + \mu ^2)}{2\mu ^2y} - \frac{1}{2} \log (\frac{2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (\lambda [-\frac{y}{2\mu ^2} + \frac{1}{\mu}] - \frac{\lambda }{2y} - \frac {1}{2}\log(\frac {2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (\frac {[-\frac{y}{2\mu ^2} + \frac{1}{\mu}]}{\frac {1}{\lambda}} - \frac{\lambda }{2y} - \frac {1}{2}\log(\frac {2\pi y^3}{\lambda})) $$
This is in the exponential form, that is, $$\theta = - \frac {1} {2 \mu ^2}$$ $$a(\phi) = \frac {1}{\lambda} = { \phi}$$ $$b(\theta) = - \frac {1}{\mu} = -\sqrt{2 \theta}$$ $$c(y,\theta) = - \frac{\lambda}{2y} - \frac{1}{2}\log(\frac{2\pi y^3}{\lambda}) = - \frac{1}{y\phi} - \frac{1}{2}\log(2\pi \phi y^3)$$
$b'(\theta) = {\sqrt {\frac{2}{\theta }}} = \mu$ , which is $\text{E}(Y)$. The canonical link function is $g(\text{E}(Y))$ = $\theta$. So, the canonical link is proportional to $\frac {1}{\mu ^2}$, which is equivalent to $X\beta$. The model is then $Y = X\beta + \epsilon$ where Y follows a inverse Gaussian distribution with link function $g(\text{E}(Y))$ = 1/$\mu ^2$
invgauslin1 <- glm(timesm~agem,family="inverse.gaussian")
summary(invgauslin1)
Both are quite difficult to interpret by themselves, as the response variables are transformed via the link function. The predictions for the gamma are as follows:
predict(gammalin1)
pred1 <- 1/predict(gammalin1)
pred2 <- 1/((predict(invgauslin1)))^.5
pred3 <- predict(lin1)
sum((pred1-timesm)^2)
sum((pred2-timesm)^2)
sum((pred3-timesm)^2)
sum((predict(lin2)-timesm)^2)
Here, we can see the sum squared error of the transformed predictions from the models. We can see that the model using the gamma distribution does an exceptional job at predicting the response variable, time. It is better than both the linear and quadratic simple linear regression models. The inverse-gaussian performs the best out of the three candidate distributions.
Below, I do the same thing, except using the identity function. This produces more easily interpretable coefficients, but they are less efficient at predicting the model. That is, when we take the sum of squares, the identity links are less successful at determining the response.
gammalin2 <- glm(timesm~agem,family="Gamma"(link="identity"))
summary(gammalin2)
invgauslin2 <- glm(timesm~agem,family="inverse.gaussian"(link=identity))
summary(invgauslin2)
predict(gammalin2)
predict(invgauslin2)
sum((predict(gammalin2)-timesm)^2)
sum((predict(invgauslin2)-timesm)^2)