Here we will be doing MANOVA for the dataset below. We want to know if the means of tear, gloss and opacity are statistically significantly different from one another depending on the rate groups -- Low or High.
Data from manova example: Krzanowski (1998, p. 381)
In evaluating MANOVA (Multivariate Analysis of Variance), we can consider a few different approaches to estimating the "size" of deviance from the between (the matrix $H$) and the within (the matrix $E$) distances. Each matrix represents something similar to the univariate ANOVA example, but we are now comparing the differences in matrices rather than in vectors. This makes the task more difficult, as we cannot simply use the F-distribution as a comparison of squared differences -- our results will also look like matrices, not single numbers.
There have been four proposed solutions to this problem, named after Wilks, Roys, Pillai and Hotelling respectively. Each of these attempts to quantify the magnitude of the between matrix divided by the within matrix using the determinant or the eigenvalues of the product of these two matrices. If this value is greater than a certain threshold, we conclude that the groups are different, via a similar intuition with ANOVA. However, unlike ANOVA, no such simple distribution exists to describe the two error matrices.
We also implement here the F approximations for each of these statistics to quantify our results.
return_MANOVA_results <- function(dataset, group_column) {
F_dists <- find_approx_F_dist(dataset, group_column)
results <- 1 - sapply(X = F_dists[c('Wilks','Roy','Pillai','Hotelling')], FUN = function(obj) pf(obj[1], obj[2], obj[3]))
results <- rbind(p_f = results, statistic = F_dists$untransformed[1:4],
data.frame(F_dists[c('Wilks','Roy','Pillai','Hotelling')],row.names = c('F Approx.', 'df1', 'df2')))
return(results)
}
find_approx_F_dist <- function(dataset, group_column) {
# dataset (dataframe): dataset with groupings as integer values.
# group_column (string): string name of column with group entries, integer values.
MANOVA_stat_lst <- find_MANOVA_statistics(dataset,group_column)
Vh <- MANOVA_stat_lst['n_groups'] - 1
p <- ncol(dataset) - 1
Ve <- nrow(dataset) - Vh - 1
s <- min(Vh, p)
Wilks_approx_F <- FforWilks(Lambda = MANOVA_stat_lst['Wilks'], Vh = Vh, p = p, Ve = Ve)
Roy_approx_F <- FBoundforRoy(R = MANOVA_stat_lst['Roys'], Vh = Vh, p = p, Ve = Ve)
Pillai_approx_F <- FapproxforPillai(P = MANOVA_stat_lst['Pillai'], Vh = Vh, p = p, Ve = Ve, s = s)
Hotelling_approx_F <- FapproxforHotelling(H = MANOVA_stat_lst['Hotelling'], Vh = Vh, p = p, Ve = Ve)
return(list(untransformed = MANOVA_stat_lst, Wilks = Wilks_approx_F,Roy = Roy_approx_F, Pillai = Pillai_approx_F, Hotelling = Hotelling_approx_F))
}
find_MANOVA_statistics <- function(dataset, group_column){
# dataset (dataframe): dataset with groupings as integer values.
# group_column (string): string name of column with group entries, integer values.
grouped_ds_lst <- lapply(1:length(unique(dataset[,group_column])), function(i) dataset[dataset[,group_column] == i,!names(dataset) %in% c(group_column)])
H <- Reduce("+", lapply(grouped_ds_lst,function(g) hypothesis_iv(g, dataset)))
E <- Reduce("+", lapply(grouped_ds_lst,function(g) error_iv(g, dataset)))
Einv <- solve(E)
HoverE <- H %*% Einv
eigen_det <- eigen(HoverE)
ev <- eigen_det$values
Wilks <- 1/(prod(1 + ev))
Roys <- (max(ev))/(1 + max(ev))
Pillai <- sum(ev/(1 + ev))
Hotelling <- sum(ev)
return(c(Wilks = Wilks,Roys = Roys,Pillai = Pillai,Hotelling = Hotelling, n_groups = length(grouped_ds_lst)))
}
hypothesis_iv <- function(group,dataset) {
A <- sapply(group,mean)
B <- sapply(dataset[, 1:(ncol(dataset) - 1)],mean)
dif <- A - B
return((dif %*% t(dif))*nrow(group))
}
error_iv <- function(group,dataset) {
A <- sapply(group,mean)
mattimestranspose <- function(A)
return(A %*% t(A))
mylist <- lapply(1:nrow(group), function(i) mattimestranspose(t(as.matrix(group[i,] - A))))
a1 <- Reduce('+',mylist)
return(a1)
}
FforWilks <- function(Lambda, Vh, p, Ve) {
t <- (((p^2)*(Vh^2) - 4)/((p^2) + (Vh^2) - 5))^.5
df1 <- p*Vh
w <- Ve + Vh - ((p + Vh + 1)/2)
df2 <- w*t - (((p*Vh) - 2)/2)
Fapprox <- ((1 - (Lambda^(1/t)))/(Lambda^(1/t)))*(df2/df1)
return(c(Fapprox = Fapprox, df1 = df1, df2 = df2))
}
FBoundforRoy <- function(R, Vh, p, Ve) {
v1 <- (Ve - p - 1)/2
v2 <- (p - Vh - 1)/2
return(c(Fapprox = R*v1/v2, df1 = 2*v1 + 2, df2 = 2*v2 + 2))
}
FapproxforPillai <- function(P, Vh, p, Ve, s) {
N <- (Ve - Vh + s)/2
m <- (abs(Vh - p) - 1)/2
num <- (2*N + s + 1)*P
den <- (2*m + s + 1)*(s - P)
Fapprox <- num/den
df1 <- s*(2*m + s + 1)
df2 <- s*(2*N + s + 1)
return(c(Fapprox = Fapprox, df1 = df1, df2 = df2))
}
FapproxforHotelling <- function(H, Vh, p, Ve) {
a <- p*Vh
B <- ((Ve + Vh - p - 1)*(Ve - 1))/((Ve - p - 3)*(Ve - p))
b <- 4 + ((a + 2)/(B - 1))
cnum <- a*(b - 2)
cden <- b*(Ve - p + 1)
Fapprox <- (H*cden)/cnum
return(c(Fapprox = Fapprox, df1 = a, df2 = b))
}
tear <- c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,
6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
gloss <- c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4,
9.1, 9.3, 8.3, 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2)
opacity <- c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7,
2.8, 4.1, 3.8, 1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9)
rate <- gl(2,10, labels = c("Low", "High"))
Y <- data.frame(cbind(tear,gloss,opacity,rate))
return_MANOVA_results(Y, 'rate')
We see that all four methods approximately agree that this would be unlikely if the two groups followed the same Multivariate Normal Distribution.