Frequentist and Bayesian Approaches to Contingency Tables¶

Chi-Squared Test¶

Here we will be investigating a few ways of dealing with contingency tables.

The first two are standard procedures in frequentist statistics, the chi-squared test and the Fischer's Exact test.

Consider a contingency table that looks like this:

In [4]:

row_names = c('Has traveled outside US','Has not traveled outside US')
column_names = c('Non-Artist','Artist')

cont_table <- matrix(c(5, 15, 30, 20), 
                     nrow = 2,
                     dimnames = list(row_names, column_names))
cont_table

	Non-Artist	Artist
Has traveled outside US	5	30
Has not traveled outside US	15	20

Consider if we ask the question: are artists and non-artists equally likely to travel outside of the US?

The first approach we can take is the chi-squared test, which is an asymptotically accurate. If each was equally likely to travel, we would expect a cont_table like this :

In [7]:

cont_table_exp <- matrix(c(10, 10, 25, 25), 
                         nrow = 2,
                         dimnames = list(row_names, column_names))

cont_table_exp

	Non-Artist	Artist
Has traveled outside US	10	25
Has not traveled outside US	10	25

If we assume that artistry and travel likelihood are independent, our sum of the differences from this expectation divided by the square root of the expectation should be roughly normally distributed and so the squared differences will be approximately $\chi ^{2}\left ( 1 \right )$.

Here, we assume that the observed counts are Poisson distributed, so that their expectation and variance are defined by $E_{i}$ and $\frac{1}{\sqrt{E_{i}}}$. As $n \to \infty$, this is asymptotically normal, given the hypothesis of independence.

Because we know the row and column totals, in this case we have a degree of freedom of 1: when one value is known on this contingency table, given the row and column totals we can deduce the others. In order to calculate an (approximate) probability of this example given independence (the null hypothesis), we perform the following calculation.

In [8]:

zsq <- sum((cont_table - cont_table_exp)^2 / cont_table_exp)
1 - pchisq(zsq,1)

0.00815097159350264

So we would reject the notion of independence based on these observations.

More generally:

In [15]:

chi_squared_test <- function(cont_table,
                             cont_table_exp = NULL){
  if(cont_table_exp == NULL) {
    cont_table_exp = t(as.matrix(rep(apply(cont_table,2,mean))))
}
  cont_table_exp
  df <- (nrow(cont_table) - 1)*(ncol(cont_table) - 1)
  zsq <- sum((cont_table - cont_table_exp)^2 / cont_table_exp)
  probability <- 1 - pchisq(zsq, df)
  return(probability)
}

Fischer Exact Test¶

However, the chi-squared test is only asympotically accurate. If we consider the counts to be distributed as a hyper-geometric distribution, we can use Fisher's Exact Test to calculate the probability of independence.

Given some values in the contingency table, the probability of any distribution of values given independence is as follows:

In [16]:

probfun <- function(i, j, n, m){
  # i: particular coordinate (for instance, non-artist, has not traveled)
  # j: the column total for the particular coordinate (for instance, non-artists)
  # n: the row total for particular coordinate (for instance, people who have traveled)
  # m: the row total for the other coordinate (for instance, people who have not traveled)

  choose(m,i)*choose(n,(j - i))/choose(m + n,j)
}

This will give us the exact probability of each coordinate. If we wanted to find the probability that non-artist, people who traveled is less than or equal to five, we would do the following in this case:

In [17]:

sum(sapply(0:5,function(x) probfun(x, 20, 35, 35)))

0.008027370811152

This shows that the probability of what we have observed is quite small if the columns and rows are independent. We again reject independence with 1% significance.

The results of this test and the chisquared test are also quite close, which is as expected. As n goes to infinity, the chisquared test converges to the Fischer's Exact test.

This is the hyper geometric distribution, which is included in R. The results are the same:

In [18]:

phyper(5,35,35,20)##same result

0.00802737081115199

These are well known tests we can make. However, we can add some bayesian analysis to the mix here!

Bayesian Approach -- using Beta distributions¶

Let us assume that the probability of having traveled for artists and non-artists can be modeled using a Beta distribution. That is [P(travel \mid nonartist) = p \sim \ {Beta}(\alpha {1}, \beta {1})] and [P(travel \mid artist) = q \sim \ {Beta}(\alpha {2}, \beta {2})]

We can now make an approximation to determine the probability that $p < q$. That is, we want the posterior distribution of $p$ and $q$ given our contingency table. Because our likelihood is multinomial, we know that the posterior distribution is also Beta distributed. After doing some manipulations, we can write the posterior distribution of $p$ and $q$.

If we assume uninformative priors, we set $\alpha = \beta = 1$ to get the following:

In [50]:

posteriorb <- function(p,q, cont_table, num = FALSE){
  if (num == TRUE) {
    if (p >= q) {
      return(0)
    }
  }

  obs_row_totals = apply(cont_table, 1, sum)
  obs_coordinates = c(cont_table)
  obs_rows_w_prior = cont_table - 1

  constant <- (prod(gamma(obs_row_totals)) / (prod(gamma(obs_coordinates))))
  p_mat <- matrix(c(p, q, 1-p, 1-q), nrow=2)
  ret <- constant*(prod(p_mat ^ obs_rows_w_prior))
  return(ret)
}

Note that this will only work for 2x2 contingency tables and assuming an (invalid) prior of $\alpha = \beta = 1$ for each prior. If we wanted to include prior information, we could change these for each prior, which would result in a different number being added/subtracted above.

This joint distribution is somewhat difficult to determine analytically. There are alternative here to estimate this probability (gtools, MCMCpack, etc), but I have included a numerical approximation.

In [53]:

approx <- function(f,num=FALSE, ...) {
  sum(sapply(seq(0.01,0.99,0.01),function(p) sapply(seq(0.01,0.99,0.01), function(q) f(p,q,num,...))))
}

estim <- approx(posteriorb,TRUE, cont_table = cont_table)/
         approx(posteriorb, cont_table = cont_table)
estim

0.996557074981106

We estimate ${P}(p< q) = .9966$. Here, we individual values of p,q and compare their pdf evaluations at each point.

Bayesian Approach -- Assuming Dependence Structure¶

Let's use another formulation that allows $p$ and $q$ to depend on one another. That is, it is unlikely that $p$ and $q$ are distributed independently -- there is an underlying 'willing to travel' factor that is prevalent in both groups. We saw this when we considered the earlier tests. If we assume then that $\ln \frac{p}{1-p} - \ln \frac{q}{1-q}$, the difference in their log odds ratios is normally distributed with mean 0 and some arbitrary variance, we are allowing for some dependencies based on the value of sigma squared that we select. Selecting a small value of sigmasquared introduces a higher dependency structure. We can then test, for a certain level of expected variance in their difference, whether they are independent or not.

In [42]:

posteriorc <- function(p,q, cont_table, num = FALSE, sigmasq = 1){
  if (num == TRUE) {
    if (p >= q) {
      return(0)
    }
  }

  obs_rows_w_prior = cont_table - 1

  logit1 <- log((p/(1 - p)))
  logit2 <- log((q/(1 - q)))

  expo <- exp((-(logit1 - logit2)^2)/(2*sigmasq))
  p_mat <- matrix(c(p, q, 1-p, 1-q), nrow=2)
  ret <- expo*prod((p_mat ^ obs_rows_w_prior))
  return(ret)
}

We then approximate ${P}(p< q)$ given various assumed sigmas in the prior.

In [52]:

estimb1 <- approx(posteriorc, num = TRUE, 
                  cont_table = cont_table, sigmasq = 1)/
           approx(posteriorc,  num = FALSE, 
                  cont_table = cont_table, sigmasq = 1)
estimb2 <- approx(posteriorc, num = TRUE, 
                  cont_table = cont_table, sigmasq = 0.25)/
           approx(posteriorc, num = FALSE, 
                  cont_table = cont_table, sigmasq = 0.25)
estimb3 <- approx(posteriorc, num = TRUE, 
                  cont_table = cont_table, sigmasq = 4)/
           approx(posteriorc, num = FALSE, 
                  cont_table = cont_table, sigmasq = 4)

# sigmasq = 1
estimb1
# sigmasq = .25
estimb2
# sigma sq = 4
estimb3

# sigma sq = 400000 (close to complete independence)
approx(posteriorc, num = TRUE, cont_table = cont_table, sigmasq = 4e+05)/approx(posteriorc, 
    num = FALSE, cont_table = cont_table, sigmasq = 4e+05)

0.990231766938391

0.961061419233541

0.995239851507228

0.996557062782334

We see that the larger the variance, the higher ${P}(p< q)$ is. When we condsidered them to be independent, we got an estimate of .9965571. It makes sense, then, that as we increase the variance, and therefore the two become closer to being what we would consider as independent from one another, it approaches the value when they are assumed to be independent explicitly.

Considering them to have a small variance (.25) will lead to the probability that $p < q$ to be smaller than any other method, including the Chi-Squared and Fischer Exact test. When we have a high variance (even at sigma sq = 4), we obtain a higher probability that the two probabilities are different. The conclusion then is that the two groups are different from one another.

When the variance is small, we are less confident that the two groups are different from one another, both in prior and posterior distribution.

Sean Ammirati - creator of Stats Works. He can be reached on Github, LinkedIn and email.

Comments

Approaches to Contingency Tables A Bayesian Approach

Frequentist and Bayesian Approaches to Contingency Tables¶

Chi-Squared Test¶

Fischer Exact Test¶

Bayesian Approach -- using Beta distributions¶

Bayesian Approach -- Assuming Dependence Structure¶

Published

Category

Tags

Contact