Inference for many proportions

Introducing \(\chi^2\)

Ex: Diversity at Reed

In terms of ethnic diversity, how does the first year student body compare to the general population of Oregon?

The data

Ethnicity	Asian	Black	Hispanic	White	Other	Total
Reed count	49	10	34	206	55	354
Oregon %	.043	.02	.125	.77	.042	1

If the students at Reed were drawn from a population with these proportions, how many counts would we expect in each group?

\[\textrm{exp. count} = n \times p_i\]

The data

Ethnicity	Asian	Black	Hispanic	White	Other	Total
Obs. count	49	10	34	206	55	354
Exp. count	15.22	7.08	44.25	272.58	14.87	354

Some sampling variability is expected, but how far from expected is too far?

Simulating Oregonian Reedies

n <- 354
p <- c(.043, .02, .125, .77, .042)
samp <- factor(sample(c("asian", "black", "hispanic", "white", "other"), 
       size = n, replace = TRUE, prob = p), 
       levels = c("asian", "black", "hispanic", "white", "other"))
table(samp)

## samp
##    asian    black hispanic    white    other 
##       20       11       39      272       12

obs <- c(49, 10, 34, 206, 55)

Simulating Oregonian Reedies

obs <- c(49, 10, 34, 206, 55)
samp <- factor(sample(c("asian", "black", "hispanic", "white", "other"), 
       size = n, replace = TRUE, prob = p), 
       levels = c("asian", "black", "hispanic", "white", "other"))
table(samp)

## samp
##    asian    black hispanic    white    other 
##       15        8       40      276       15

samp <- factor(sample(c("asian", "black", "hispanic", "white", "other"), 
       size = n, replace = TRUE, prob = p), 
       levels = c("asian", "black", "hispanic", "white", "other"))
table(samp)

## samp
##    asian    black hispanic    white    other 
##       17        7       38      278       14

Simulating Oregonian Reedies

reed_demos %>%
  specify(response = ethnicity) %>%
  hypothesize(null = "point", p = c("asian"    = .043, 
                                    "black"    = .02, 
                                    "hispanic" = .125, 
                                    "white"    = .77, 
                                    "other"    = .042)) %>%
  generate(reps = 500, type = "simulate")

Inference on many ps

We could do a tests/CIs on \(p_{reed} - p_{oregon}\) for each group, however:

We have the whole population of Oregon.
Beware of multiple comparisons!

Creating a statistic

For each of \(k\) categories:

Calculate the difference between observed and expected counts.
Scale each difference by an estimate of the SE (\(\sqrt(exp)\)).
Square the scaled difference to get rid of negatives.

Then add them all up.

\[\chi^2 = \sum_{i = 1}^k \frac{(obs - exp)^2}{exp}\]

Ethnicity	Asian	Black	Hispanic	White	Other	Total
Obs. count	49	10	34	206	55	354
Exp. count	15.22	7.08	44.25	272.58	14.87	354

\[ Z_{asian}^2 = (49 - 15.22)^2/15.22 = 74.97 \\ Z_{black}^2 = (10 - 7.08)^2/7.08 = 1.20 \\ Z_{hispanic}^2 = (34 - 51.5)^2/51.5 = 5.95 \\ Z_{white}^2 = (206 - 272.58)^2/272.58 = 16.26 \\ Z_{other}^2 = (55 - 14.87)^2/14.87 = 108.30 \]

\[ Z_{asian}^2 + Z_{black}^2 + Z_{hispanic}^2 + Z_{white}^2 + Z_{other}^2 = 206.68 = \chi^2_{obs} \]

Simulating \(\chi^2\) under \(H_0\)

null_dist <- reed_demos %>%
  specify(response = ethnicity) %>%
  hypothesize(null = "point", p = c("asian"    = .043, 
                                    "black"    = .02, 
                                    "hispanic" = .125, 
                                    "white"    = .77, 
                                    "other"    = .042)) %>%
  generate(reps = 500, type = "simulate") %>%
  calculate(stat = "Chisq")

The null distribution

What is the probability of observing our data or more extreme (\(\chi^2 = 206.68\)) under the null hypothesis that Reedies share the same ethnicity proportions as Oregon?

About zero.

An alternate path to the null

If…

Indepedent observations
Each cell count has a count \(\ge\) 5
\(k \ge 3\)

then our statistic can be well-approximated by the \(\chi^2\) distribution with \(k - 1\) degrees of freedom.

The null distribution

1 - pchisq(206.68, df = 4)

## [1] 0