Stats Works
  • About This Website

Generalized Linear Models An Exercise

Here we will consider the marathon runners listed in this wikipedia page.

This post will consider various generalized linear models (GLMS) with different families.

In [10]:
timesm <- c(123.98,128.633,134.266,139.484,139.533,156.5,161.95,174.8,184.9,195.9,236.567,340.017,505.283)
timesf <- c(139.316,144.893,149.00,151.083,170.55,181.5,192.95,215.483,233.7,252.733,314.433,533.133)
agem <- c(35,40,45,50,55,60,65,70,75,80,85,90,100)
agef <- c(35,40,45,50,55,60,65,70,75,80,85,90)

agemsq <- agem^2

time <- c(timesf,timesm)
age <- c(agef,agem)
gender <- c(rep(1,12),rep(0,13))

marathon <- data.frame(time,age,gender)
head(marathon)
timeagegender
139.31635 1
144.89340 1
149.00045 1
151.08350 1
170.55055 1
181.50060 1

First, let's do an ordinary linear regression model with a Gaussian response variable assumption.

Linear Regression (Gaussian)¶

In [3]:
lin1 <- lm(timesm~agem,data=marathon)
lin2 <- lm(timesm~agem+agemsq,data=marathon)
summary(lin1)

summary(lin2)

Call:
lm(formula = timesm ~ agem, data = marathon)

Residuals:
   Min     1Q Median     3Q    Max 
-70.80 -47.41 -15.95  28.83 149.61 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -89.203     62.054  -1.438  0.17841    
agem           4.449      0.910   4.889  0.00048 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.52 on 11 degrees of freedom
Multiple R-squared:  0.6848,	Adjusted R-squared:  0.6562 
F-statistic:  23.9 on 1 and 11 DF,  p-value: 0.0004801

Call:
lm(formula = timesm ~ agem + agemsq, data = marathon)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.811 -15.503   5.468  19.742  40.354 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 498.57207   98.79145   5.047 0.000502 ***
agem        -14.89307    3.13305  -4.754 0.000776 ***
agemsq        0.14557    0.02335   6.233 9.72e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.14 on 10 degrees of freedom
Multiple R-squared:  0.9355,	Adjusted R-squared:  0.9226 
F-statistic: 72.49 on 2 and 10 DF,  p-value: 1.118e-06

Based on the summary, it appears that the quadratic term is quite important, it increases the R-squared significantly. So there does appear to be a quadratic effect of age on the time (or at the very least, a non-linear one).

In [4]:
lin3 <- lm(time~age+gender)
summary(lin3)

Call:
lm(formula = time ~ age + gender)

Residuals:
   Min     1Q Median     3Q    Max 
-74.18 -42.85 -14.10  26.96 181.20 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -104.3254    48.9768  -2.130   0.0446 *  
age            4.6801     0.6979   6.706 9.68e-07 ***
gender        35.0534    25.7576   1.361   0.1873    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 64.15 on 22 degrees of freedom
Multiple R-squared:  0.6749,	Adjusted R-squared:  0.6454 
F-statistic: 22.84 on 2 and 22 DF,  p-value: 4.285e-06

Here, the model shows that gender has some part in determining the time (in seconds) of completion, but it is not significant. If the coefficient is correct, it means that, holding age constant, a man will have 35.0534 faster time (or a woman will take 35.0534 additional seconds) to finish the race.

Comparing the intercepts in the models shows a pretty large difference in the predictions. The reason why the male-only intercept is different from the intercept in the second model is because the second model is not considering the interaction of age and gender. When you account for this in the model, the intercepts will be the same. The gender coefficient in the model assumes that you are holding everything else constant, that is, the difference between a 35 year old male and female, for instance. When you consider the interaction term as well, this problem is eliminated.

Gamma Distribution¶

In [5]:
gammalin1 <- glm(timesm~agem,family="Gamma")
summary(gammalin1)

Call:
glm(formula = timesm ~ agem, family = "Gamma")

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.169540  -0.067254   0.001897   0.092610   0.116811  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.275e-02  5.748e-04   22.17 1.76e-10 ***
agem        -1.057e-04  6.839e-06  -15.46 8.30e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.009790868)

    Null deviance: 2.42365  on 12  degrees of freedom
Residual deviance: 0.10997  on 11  degrees of freedom
AIC: 116.42

Number of Fisher Scoring iterations: 4

The link function here is the canonical link function for the Gamma distribution, which is the inverse ($\frac{1}{\mu}$). All variables come up as significant.

The model can be written as follows $Y = X\beta + \epsilon$ where Y follows a Gamma distribution with link function $g(\mu) = \frac{1}{\mu}$

Inverse-Gaussian distribution¶

The exponential family has the following form: $$ f(y,\theta ,\phi ) = e^{\frac{y \theta - b(\theta )}{a(\phi )} + c(y, \phi)} $$ The inverse-Gaussian distribution is defined as: $$ f(y) = \exp (-\frac{\lambda (y - \mu )^2)}{2\mu ^2y} - \frac{1}{2} \log (\frac{2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (-\frac{\lambda (y^2 - 2y\mu + \mu ^2)}{2\mu ^2y} - \frac{1}{2} \log (\frac{2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (\lambda [-\frac{y}{2\mu ^2} + \frac{1}{\mu}] - \frac{\lambda }{2y} - \frac {1}{2}\log(\frac {2\pi y^3}{\lambda})) $$ $$ f(y) = \exp (\frac {[-\frac{y}{2\mu ^2} + \frac{1}{\mu}]}{\frac {1}{\lambda}} - \frac{\lambda }{2y} - \frac {1}{2}\log(\frac {2\pi y^3}{\lambda})) $$

This is in the exponential form, that is, $$\theta = - \frac {1} {2 \mu ^2}$$ $$a(\phi) = \frac {1}{\lambda} = { \phi}$$ $$b(\theta) = - \frac {1}{\mu} = -\sqrt{2 \theta}$$ $$c(y,\theta) = - \frac{\lambda}{2y} - \frac{1}{2}\log(\frac{2\pi y^3}{\lambda}) = - \frac{1}{y\phi} - \frac{1}{2}\log(2\pi \phi y^3)$$

$b'(\theta) = {\sqrt {\frac{2}{\theta }}} = \mu$ , which is $\text{E}(Y)$. The canonical link function is $g(\text{E}(Y))$ = $\theta$. So, the canonical link is proportional to $\frac {1}{\mu ^2}$, which is equivalent to $X\beta$. The model is then $Y = X\beta + \epsilon$ where Y follows a inverse Gaussian distribution with link function $g(\text{E}(Y))$ = 1/$\mu ^2$

In [7]:
invgauslin1  <-  glm(timesm~agem,family="inverse.gaussian")
summary(invgauslin1)

Call:
glm(formula = timesm ~ agem, family = "inverse.gaussian")

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0048497  -0.0023005  -0.0000681   0.0005923   0.0123111  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.958e-05  3.892e-06   25.59 3.75e-11 ***
agem        -9.617e-07  4.179e-08  -23.01 1.18e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for inverse.gaussian family taken to be 2.180685e-05)

    Null deviance: 0.01111018  on 12  degrees of freedom
Residual deviance: 0.00020857  on 11  degrees of freedom
AIC: 102.69

Number of Fisher Scoring iterations: 4

Both are quite difficult to interpret by themselves, as the response variables are transformed via the link function. The predictions for the gamma are as follows:

In [8]:
predict(gammalin1)

pred1 <- 1/predict(gammalin1)
pred2 <- 1/((predict(invgauslin1)))^.5
pred3 <- predict(lin1)

sum((pred1-timesm)^2)

sum((pred2-timesm)^2)
sum((pred3-timesm)^2)
sum((predict(lin2)-timesm)^2)
1
0.00904503120484234
2
0.00851640593952094
3
0.00798778067419954
4
0.00745915540887814
5
0.00693053014355673
6
0.00640190487823533
7
0.00587327961291393
8
0.00534465434759253
9
0.00481602908227113
10
0.00428740381694972
11
0.00375877855162832
12
0.00323015328630692
13
0.00217290275566411
6398.61022699518
5598.12999367698
44389.3243635469
9086.8812639335

Here, we can see the sum squared error of the transformed predictions from the models. We can see that the model using the gamma distribution does an exceptional job at predicting the response variable, time. It is better than both the linear and quadratic simple linear regression models. The inverse-gaussian performs the best out of the three candidate distributions.

Below, I do the same thing, except using the identity function. This produces more easily interpretable coefficients, but they are less efficient at predicting the model. That is, when we take the sum of squares, the identity links are less successful at determining the response.

In [9]:
gammalin2 <- glm(timesm~agem,family="Gamma"(link="identity"))
summary(gammalin2)

invgauslin2 <- glm(timesm~agem,family="inverse.gaussian"(link=identity))
summary(invgauslin2)
predict(gammalin2)

predict(invgauslin2)

sum((predict(gammalin2)-timesm)^2)

sum((predict(invgauslin2)-timesm)^2)

Call:
glm(formula = timesm ~ agem, family = Gamma(link = "identity"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.21209  -0.18853  -0.09241   0.08858   0.54268  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.3100    38.5463  -0.216 0.833256    
agem          3.1548     0.6898   4.574 0.000799 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.06523248)

    Null deviance: 2.42365  on 12  degrees of freedom
Residual deviance: 0.60806  on 11  degrees of freedom
AIC: 138.73

Number of Fisher Scoring iterations: 9

Call:
glm(formula = timesm ~ agem, family = inverse.gaussian(link = identity))

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.012827  -0.011721  -0.006183   0.004403   0.032956  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.6841    32.6336   0.327 0.749516    
agem          2.7958     0.6252   4.472 0.000944 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for inverse.gaussian family taken to be 0.000289055)

    Null deviance: 0.0111102  on 12  degrees of freedom
Residual deviance: 0.0024039  on 11  degrees of freedom
AIC: 134.47

Number of Fisher Scoring iterations: 10
1
102.107169679271
2
117.881054509177
3
133.654939339083
4
149.428824168989
5
165.202708998895
6
180.976593828801
7
196.750478658707
8
212.524363488613
9
228.29824831852
10
244.072133148426
11
259.846017978332
12
275.619902808238
13
307.16767246805
1
108.535713328251
2
122.514520425568
3
136.493327522885
4
150.472134620202
5
164.45094171752
6
178.429748814837
7
192.408555912154
8
206.387363009471
9
220.366170106788
10
234.344977204105
11
248.323784301422
12
262.302591398739
13
290.260205593373
52728.0806590428
58577.3835961494
Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.
Comments
comments powered by Disqus

Articles in Linear Models

  • « Determining Sampling Schemes for Hierarchies A Simulated Approach using R
  • Constrained Least Squares »

All Articles

  • « MANOVA
  • Hotelling's T-Test Example »

Published

Nov 5, 2018

Category

Linear Models

Tags

  • gamma distribution 1
  • generalized linear models 3
  • linear regression 6
  • link function 1
  • marathon 1
  • theory 6

Contact

  • Stats Works - Making Statistics Work for You