Chi-square Tests on Categorical Data

 The Iris flower data set, also known as Fisher's Iris data set, is among the most renowned collections used in statistics and machine learning. It details 150 iris flowers from three different species: Iris Setosa, Iris Versicolor, and Iris Virginica. Each flower entry includes four measurements - sepal length, sepal width, petal length, and petal width - along with a corresponding label identifying the species of the flower.

The boxplot above describes how sepal lengths vary among different Iris species. Now, suppose that you want to test if there is no difference in average sepal lengths across all three Iris species. How would you address this problem?

  • Null Hypothesis (\(H_0\)): The average sepal lengths of all three Iris species are equal.
  • Alternative Hypothesis (\(H_1\)): At least one of the mean sepal length of Iris Setosa, Iris Versicolor, and Iris Virginica is different from others.

Previously, we discussed how to compare means from two different groups using t-test. Assuming that all three Iris species have normally distributed sepal lengths with equal variance, one might consider conducting three different t-tests for each pair of species (Setosa vs. Versicolor, Setosa vs. Virginica, Versicolor vs. Virginica). However, this approach is not appropriate to test in this context.

To simplify the current discussion, let's assume that the mean sepal length is truly the same across the three Iris species, excluding any possibilities of making Type II errors. Conducting each t-test at \(\alpha = 0.05\) implies a 5% chance of making a Type I error for each pair of comparisons. Put differently, we can avoid making Type I error with 95% for each t-test comparison. Thus, in this settings, if we proceed with the three separate t-tests to determine if at least one species has a different mean sepal length, the overall Type I error rate would be approximately \(1 - 0.95^3 \approx 0.143\). This shows that the chance of making Type I error inflated about three times!

F-Distribution

Let \(U_1 \sim \chi_{\nu_1}^2\) and \(U_2 \sim \chi_{\nu_2}^2\) and are independent of each other. Then the ratio \(F = \frac{U_1/\nu_1}{U_2/\nu_2}\) follows a probability distribution known as \(F_{\nu_1, \nu_2}\).

F-distribution takes two parameters \(\nu_1\) and \(\nu_2\) that are both greater than 0. Possible values of F-distribution is \(x \in (0, \infty)\), if \(\nu_1 = 1\), otherwise \(x \in [0, \infty)\). 

One important property of F-distribution is that, for \(X \sim F_{\nu_1, \nu_2}\) and \(Y = \frac{1}{X} = F_{\nu_2, \nu_1}\):

\(F_{\nu_1, \nu_2, \alpha} = \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}\)

Where \(\alpha = P(X \ge F_{\nu_1, \nu_2, \alpha})\) and \(1-\alpha = P(Y \ge F_{\nu_2, \nu_1, 1-\alpha})\). This can be easily proven as follows:

\(P(Y \ge F_{\nu_2, \nu_1, 1-\alpha}) = 1 - \alpha \\ \Rightarrow P(\frac{1}{Y} \le \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}) = 1 - \alpha \\ \Rightarrow P(\frac{1}{Y} \ge \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}) = \alpha \\ \Rightarrow P(X \ge \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}) = \alpha\)

Since \(P(X \ge F_{\nu_1, \nu_2, \alpha}) = P(X \ge \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}) = \alpha\):

\(\Rightarrow F_{\nu_1, \nu_2, \alpha} = \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}\)

This establishes a reciprocal relationship between the upper and lower tail probabilities of the F-distribution with specific degrees of freedom. This means that if you know the probability \(alpha\) of finding a value in the upper tail of an F-distribution with df (\(nu_1, \nu_2\)), you can easily determine the probability \(1-\alpha\) of finding its reciprocal in the lower tail of the F-distribution with reversed degrees of freedom (\(nu_2, \nu_1\)).

You can take advantage of this property, particularly when you want to find the critical value on a F-distribution for its upper tail. Typically, statistical software including SAS, gives the complete PDF of a distribution. For example, suppose \(X \sim F_{5, 10}\) and you want to 


Another interesting fact about F-distribution is that:

\(t = \frac{Z}{\sqrt{U/n}} \sim t_{n} \rightarrow t^2 = \frac{Z^2/1}{U/n} \sim F_{1, n}\)

Where \(Z \sim N(0,\;1)\) and \(U \sim \chi_{n-1}^2\). Remember that the t-statistic assesses how far the observed \(\bar{X}\) is away from \(\mu\) in the unit of estimated standard error. So, we can consider this as the ratio between the sample mean and an estimated sample std. dev. Thus, it makes sense that the squared value of t-statistic follows the F-distribution whose df are 1 and \(n\).

Variance Ratio Test

In statistics, the variance ratio test, also known as the F-test, is a statistical test used to compare the variances of two normal populations. It helps determine whether the variances are statistically different from each other.

For two chi-squared random variables \(U_1 = \frac{(n_1 - 1)S_1^2}{\sigma_1^2} \sim \chi_{n_1 -1}^2\) and \(U_2 = \frac{(n_2 - 1)S_2^2}{\sigma_2^2} \sim \chi_{n_2 - 1}^2\):

\(\begin{aligned}F & = \frac{U_1/(n_1 - 1)}{U_2/(n_2 - 1)} = \frac{\frac{(n_1 - 1)S_1^2}{\sigma_1^2 (n_1 - 1)}}{\frac{(n_2 - 1)S_2^2}{\sigma_2^2 (n_2 - 1)}}\\ & = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} = \frac{S_1^2 / S_2^2}{\sigma_1^2 / \sigma_2^2} \sim F_{n_1 - 1, n_2 - 1}\end{aligned}\)

Under the null hypothesis, we assume that the variances of the two populations are equal. Thus, \(\sigma_1^2 / \sigma_2^2 = 1\). So, if the ratio between the two sample variances is significantly larger than the critical value from F-distribution, we reject the null hypothesis.

Basic Idea behind ANOVA

As we saw earlier, conducting multiple t-tests for each possible pair of groups to test equality of means increases the overall risk of incorrectly rejecting the null hypothesis. To address this issue, R.A. Fisher, a pioneering British statistician and biologist, developed Analysis of Variance (ANOVA). 

In the early 20th century, Fisher was working on agricultural experiments. He was interested in comparing the effects of different treatments (such as varying levels of fertilizer or different breeding techniques) on crop yields. To draw reliable conclusions from such experiments, he needed a statistical method that could handle the comparison of means across more than two groups.

What Fisher recognized is that, in any experiment, the total variability in the data could be decomposed into different components:

  • Within-group variability: Variation within each group. This includes random variation and measurement error.
  • Between-group variability: Variation among the group means. This reflects whether the treatments (or groups) have different effects on the outcome variable.

Fisher formulated a statistical model to describe the data:

\(Y_{ij} = \mu + \tau_i +\epsilon_{ij}\)

Where:

  • \(Y_{ij}\): The \(j\)-th observed outcome from the \(i\)-th group
  • \(\mu\): Overall mean across the groups
  • \(\tau_i\): Effect of the \(i\)-th treatment or group
  • \(\epsilon\): Random error associated with the \(j\)-th observation from the \(i\)-th group.

Fisher proposed comparing ratio of the between-group variability to the within-group variability (mean squares) to determine whether the group means differ significantly from each other:

\(F = \frac{\text{Between-group variability}}{\text{Within-group variability}}\)

This F-statistic follows an F-distribution under the null hypothesis of no group differences. To sum up, Fisher's insight was to use the ratio between the two variabilities to test whether there are statistically significant differences among the means of the groups, when it is compared to random errors[1].

So essentially, ANOVA is one application of the variance ratio test.


Assumptions on ANOVA

The F-test assumes that all the groups are distributed with the same variance \(\sigma^2\). One way to 


One-Way ANOVA

The ANOVA procedure is one of many in SAS that perform analysis of variance. PROC ANOVA is specifically designed for balanced data - data where there are equal numbers of observations in each combination of the classification factors. An exception is for one-way analysis of variance where the data do not need to be balanced. If you are not doing one-way analysis of variance where the data do not need to be balanced, then you should use the GLM procedure, whose statements are almost identical to those of PROC ANOVA. Although we are discussing only simple one-way analysis of variance in this section, PROC ANOVA can handle multiple classification variables and models that include nested and crossed effects as well as repeated measures. If you are unsure of the appropriate analysis for your data, or are unfamiliar with basic statistical principles, we recommend that you seed advice from trained statistician or consult a good statistical textbook.

The ANOVA procedure has two required statements: the CLASS and MODEL statements. The following is the general form of the ANOVA procedure:

PROC ANOVA;

    CLASS variable-list;

    MODEL dependent = effects;

RUN;

The CLASS statement must come before the MODEL statement and defines the classification variables. For one-way analysis of variance, only one variable is listed. The MODEL statement defines the dependent variable and the effects. For one-way analysis of variance, the effect is the classification variable.

As you might expect, there are many optional statements for PROC ANOVA. One of the most useful is the MEANS statement, which calculates means of the dependent variable for any of the main effects in the MODEL statement. In addition, the MEANS statement can perform several types of multiple comparison tests including Bonferroni t tests (BON), Duncan's multiple-range test (DUNCAN), Scheffe's multiple-comparison procedure (SCHEFFE), pairwise t tests (T), and Tukey's studentized range test (TUKEY). The MEANS statement has the following general form:

MEANS effects / options;

The effects can be any effect in the MODEL statement, and options include the name of the desired multiple comparison test (DUNCAN for example.)


PROC ANOVA DATA = heights;

    CLASS Region;

    MODEL Height = Region;

    MEANS Region / SCHEFFE;

RUN;

Interpreting ANOVA Table

The tabular output from PROC ANOVA has at least two parts. First ANOVA procedures a table giving information about the classification variables: number of levels, values, and number of observations. Next it produces an analysis of variance table. If you use optional statements like MEANS, then their output will follow.

The example from the previous section used the following PROC ANOVA statements: 

PROC ANOVA DATA = heights;

    CLASS Region;

    MODEL Height = Region;

    MEANS Region / SCHEFFE;

RUN;

The graph produced by the ANOVA procedure is shown in the previous section. The first page of the tabular output (shown below) gives information about the classification variable Region. It has four levels with values East, North South, and West; and there are 64 observations.


The second part of the output is the analysis of variance table:

Highlights of the output are

  • Source: Source of variation.
  • DF: Degrees of freedom for the model, error, and total.
  • Sum of Squares: Sum of squares for the portion attributed to the model, error, and total
  • Mean Square: Mean square (sum of squares divided by the degrees of freedom)
  • F Value: F value (mean square for model divided by the mean square for error)
  • Pr > F: significance probability associated with the F statistic
  • R-Square: R-square
  • Coeff Var: coefficient of variation
  • Root MSE: Root mean square error
  • Height Mean: mean of the dependent variable.
Because the effect of Region is significant (p = 0.0051), we conclude that there are differences in the mean heights of girls from the four regions. The SCHEFFE option in the MEANS statement compares the mean heights between the regions. Letters are used to group means, and means with the same letters are not significantly different from each other at the 0.05 level. The following results show that your friend's doughter is partially correct - one region (South) has taller girs than her region (West) but no other two regions differ significantly in mean height.


[1] This is why we name it "Analysis of Variance", for comparing group means; we are testing if the sample group means are similar to each other by comparing the relative sizes of the between-group variance and within-group variance using F-statistic
[2] "Student" was the pen name of Gosset. His employer did not allow him to publish his findings under his real name due to company policy.  

Post a Comment

0 Comments