Stats Works
  • About This Website

Mixed Models vs GEE Comparing the Two Hierarchical Models

This is an example using 'fake' data for the difference between lmer (which assumes a constant (independent) covariance matrix for the error terms) and geepack (general estimating equations, which can handle an unknown convariance structure))

This example highlights the power of using a GEE model over a HLM model when the covariance structure differs at different levels of the data. The GEE model can be seen as a population estimating model -- i.e., it is preferable when we are trying to make inferences about the population rather than the individuals in the sample. We do not explicitly parametrize over the groups -- instead, we assign some covariance structure between members of each group and find the marginal distribution.

The problem is formulated as follows: consider 6 athletes with characteristics given by:

In [1]:
library(lme4)
library(geepack)
Loading required package: Matrix
In [2]:
athlete<-c(1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,6,6,6)
age<-c(38,40,43,53,55,56,58,37,40,42,41,45,46,54,58,57,60,62)
club<-c(0,0,1,1,1,1,1,1,1,0,0,1,0,1,1,1,1,1)
time<-c(95,94,93,96,98,91,93,83,82,82,91,94,99,105,111,90,89,95)
logtime<-log(time)

This is an example of a repeated measures problem.

We create a dataframe from these elements:

In [3]:
df<-data.frame(athlete,age,club,time,logtime)
df
athleteageclubtimelogtime
1 38 0 95 4.553877
1 40 0 94 4.543295
1 43 1 93 4.532599
2 53 1 96 4.564348
2 55 1 98 4.584967
2 56 1 91 4.510860
2 58 1 93 4.532599
3 37 1 83 4.418841
3 40 1 82 4.406719
3 42 0 82 4.406719
4 41 0 91 4.510860
4 45 1 94 4.543295
4 46 0 99 4.595120
5 54 1 105 4.653960
5 58 1 111 4.709530
6 57 1 90 4.499810
6 60 1 89 4.488636
6 62 1 95 4.553877

Now, we want to estimate the (log) time based on age and club. However, the data is clearly correlated -- each individual will have different specifications. Consider that we believe that the average (log) time will be different for each individual, as a baseline. This is a mixed model more specifically a Random Intercept model. This is highlighted by the following equations.

In [4]:
mod1<-lmer(logtime~club+log(age)+(1|athlete),data=df,REML=FALSE)
mod2<-lmer(logtime~log(age)+(1|athlete),data=df,REML=FALSE)

summary(mod1)
summary(mod2)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: logtime ~ club + log(age) + (1 | athlete)
   Data: df

     AIC      BIC   logLik deviance df.resid 
   -49.2    -44.7     29.6    -59.2       13 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3445 -0.7551 -0.1840  0.7895  1.2651 

Random effects:
 Groups   Name        Variance  Std.Dev.
 athlete  (Intercept) 0.0042173 0.06494 
 Residual             0.0008858 0.02976 
Number of obs: 18, groups:  athlete, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.78158    0.45714   8.272
club        -0.01049    0.02101  -0.499
log(age)     0.19753    0.11837   1.669

Correlation of Fixed Effects:
         (Intr) club  
club      0.184       
log(age) -0.998 -0.216
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: logtime ~ log(age) + (1 | athlete)
   Data: df

     AIC      BIC   logLik deviance df.resid 
   -50.9    -47.4     29.5    -58.9       14 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3343 -0.6924 -0.3215  0.7936  1.2927 

Random effects:
 Groups   Name        Variance  Std.Dev.
 athlete  (Intercept) 0.0042948 0.06554 
 Residual             0.0008962 0.02994 
Number of obs: 18, groups:  athlete, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)   3.8241     0.4527   8.448
log(age)      0.1846     0.1164   1.586

Correlation of Fixed Effects:
         (Intr)
log(age) -0.998

However, this assumes that the covariance structure of the errors at each level are independent. This assumption is too simplifying -- it is likely that the variables are correlated across individuals (for instance, it appears that the clubs are for particular age ranges). To account for this, we could add more random effects to the model (which would subsequently break the data down into even smaller groups), or we could use a GEE to forgo the subject-level estimations and estimate on the populations, assuming some covariance structure between groups.

In [5]:
mod3<-geeglm(logtime~club+log(age),data=df,id=athlete,family=gaussian,corstr="exchangeable")
mod4<-geeglm(logtime~log(age),data=df,id=athlete,family=gaussian,corstr="exchangeable")

summary(mod3)
summary(mod4)

Call:
geeglm(formula = logtime ~ club + log(age), family = gaussian, 
    data = df, id = athlete, corstr = "exchangeable")

 Coefficients:
            Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)  3.67345  0.42962 73.109   <2e-16 ***
club        -0.01429  0.01422  1.010   0.3149    
log(age)     0.22584  0.11276  4.011   0.0452 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
            Estimate  Std.err
(Intercept) 0.004443 0.001609

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.6112  0.1829
Number of clusters:   6   Maximum cluster size: 4 

Call:
geeglm(formula = logtime ~ log(age), family = gaussian, data = df, 
    id = athlete, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err  Wald Pr(>|W|)    
(Intercept)    3.742   0.488 58.84  1.7e-14 ***
log(age)       0.206   0.127  2.61     0.11    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
            Estimate Std.err
(Intercept)  0.00458 0.00154

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.624   0.181
Number of clusters:   6   Maximum cluster size: 4 

The main shift in inference here is from likelihood models (which are generally prefered) to a semi-parametric model that depends only on the first two moments. GEEs are better estimators of a population-level variance -- they cannot account for individual differences explicitly, as a GLMM could. In this example, however, we see that the GEE finds the log(age) value significant where the previous model had not. Depending on the goal of the analysis, the GEE will produce better estimates asymptotically at the cost of sample-level inference.

Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.
Comments
comments powered by Disqus

Articles in Linear Models

  • ANOVA Robustness to Non-Normality »

All Articles

  • « NYPD Stop and Frisk
  • Bootstrapping Exercise An example using the GLM parameters »

Published

Jan 4, 2018

Category

Linear Models

Tags

  • general estimating equations 2
  • generalized linear models 3
  • generalized mixed models 1
  • linear regression 6
  • mixed models 2
  • theory 6

Contact

  • Stats Works - Making Statistics Work for You