This is an example using 'fake' data for the difference between lmer (which assumes a constant (independent) covariance matrix for the error terms) and geepack (general estimating equations, which can handle an unknown convariance structure))
This example highlights the power of using a GEE model over a HLM model when the covariance structure differs at different levels of the data. The GEE model can be seen as a population estimating model -- i.e., it is preferable when we are trying to make inferences about the population rather than the individuals in the sample. We do not explicitly parametrize over the groups -- instead, we assign some covariance structure between members of each group and find the marginal distribution.
The problem is formulated as follows: consider 6 athletes with characteristics given by:
Loading required package: Matrix
athlete<-c(1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,6,6,6) age<-c(38,40,43,53,55,56,58,37,40,42,41,45,46,54,58,57,60,62) club<-c(0,0,1,1,1,1,1,1,1,0,0,1,0,1,1,1,1,1) time<-c(95,94,93,96,98,91,93,83,82,82,91,94,99,105,111,90,89,95) logtime<-log(time)
This is an example of a repeated measures problem.
We create a dataframe from these elements:
Now, we want to estimate the (log) time based on age and club. However, the data is clearly correlated -- each individual will have different specifications. Consider that we believe that the average (log) time will be different for each individual, as a baseline. This is a mixed model more specifically a Random Intercept model. This is highlighted by the following equations.
mod1<-lmer(logtime~club+log(age)+(1|athlete),data=df,REML=FALSE) mod2<-lmer(logtime~log(age)+(1|athlete),data=df,REML=FALSE) summary(mod1) summary(mod2)
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: logtime ~ club + log(age) + (1 | athlete) Data: df AIC BIC logLik deviance df.resid -49.2 -44.7 29.6 -59.2 13 Scaled residuals: Min 1Q Median 3Q Max -1.3445 -0.7551 -0.1840 0.7895 1.2651 Random effects: Groups Name Variance Std.Dev. athlete (Intercept) 0.0042173 0.06494 Residual 0.0008858 0.02976 Number of obs: 18, groups: athlete, 6 Fixed effects: Estimate Std. Error t value (Intercept) 3.78158 0.45714 8.272 club -0.01049 0.02101 -0.499 log(age) 0.19753 0.11837 1.669 Correlation of Fixed Effects: (Intr) club club 0.184 log(age) -0.998 -0.216
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: logtime ~ log(age) + (1 | athlete) Data: df AIC BIC logLik deviance df.resid -50.9 -47.4 29.5 -58.9 14 Scaled residuals: Min 1Q Median 3Q Max -1.3343 -0.6924 -0.3215 0.7936 1.2927 Random effects: Groups Name Variance Std.Dev. athlete (Intercept) 0.0042948 0.06554 Residual 0.0008962 0.02994 Number of obs: 18, groups: athlete, 6 Fixed effects: Estimate Std. Error t value (Intercept) 3.8241 0.4527 8.448 log(age) 0.1846 0.1164 1.586 Correlation of Fixed Effects: (Intr) log(age) -0.998
However, this assumes that the covariance structure of the errors at each level are independent. This assumption is too simplifying -- it is likely that the variables are correlated across individuals (for instance, it appears that the clubs are for particular age ranges). To account for this, we could add more random effects to the model (which would subsequently break the data down into even smaller groups), or we could use a GEE to forgo the subject-level estimations and estimate on the populations, assuming some covariance structure between groups.
mod3<-geeglm(logtime~club+log(age),data=df,id=athlete,family=gaussian,corstr="exchangeable") mod4<-geeglm(logtime~log(age),data=df,id=athlete,family=gaussian,corstr="exchangeable") summary(mod3) summary(mod4)
Call: geeglm(formula = logtime ~ club + log(age), family = gaussian, data = df, id = athlete, corstr = "exchangeable") Coefficients: Estimate Std.err Wald Pr(>|W|) (Intercept) 3.67345 0.42962 73.109 <2e-16 *** club -0.01429 0.01422 1.010 0.3149 log(age) 0.22584 0.11276 4.011 0.0452 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.004443 0.001609 Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.6112 0.1829 Number of clusters: 6 Maximum cluster size: 4
Call: geeglm(formula = logtime ~ log(age), family = gaussian, data = df, id = athlete, corstr = "exchangeable") Coefficients: Estimate Std.err Wald Pr(>|W|) (Intercept) 3.742 0.488 58.84 1.7e-14 *** log(age) 0.206 0.127 2.61 0.11 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.00458 0.00154 Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0.624 0.181 Number of clusters: 6 Maximum cluster size: 4
The main shift in inference here is from likelihood models (which are generally prefered) to a semi-parametric model that depends only on the first two moments. GEEs are better estimators of a population-level variance -- they cannot account for individual differences explicitly, as a GLMM could. In this example, however, we see that the GEE finds the log(age) value significant where the previous model had not. Depending on the goal of the analysis, the GEE will produce better estimates asymptotically at the cost of sample-level inference.