Student's t-Test

Previously, we discussed how to test if the population mean \(\mu\) equals to its known value \(\mu_{H_0}\), for a given the population std. dev \(\sigma\). According to the central limit theorem (CLT), when the sample size is sufficiently large, the z-test statistic follows the standard normal distribution, regardless of the population's underlying distribution. Even if the sample size is not large, if the population from which the sample is drawn is normally distributed, the z-test statistic will still approximate the normal distribution with known \(\mu_{H_0}\) and \(\sigma / \sqrt{n}\).

However, one critical issue arises here: in reality, \(\sigma\) is usually unknown. Indeed, in most cases, the population standard deviation \(\sigma\) is not known; it is only known in a very rare cases. How could address this issue? If the sample size \(n\) is large enough, we could use the sample std. dev \(S\), as a substitute of \(\sigma\). The sample std. dev based on a large sample will provide a reliable estimate for \(\sigma\). In fact, around the turn of the 20th century, statistical analysis primarily revolved around concepts related to populations and very large sample sizes. 

What if the sample size is not large enough? As mentioned earlier, even for the case, if the population is normally distributed, \(\bar{X}\) will follow a normal distribution. However, in this case, substituting \(\sigma\) with \(S\) could underestimate the true standard error[1], which can inflate the chance of making Type I error!

William Sealy Gosset, head experimental brewer of Guinness back in the early 20th century, faced the same challenge: in a small batch of hops with unknown variance, how to check if the sugar concentration remains constant? To get around this, Gosset introduced an alternative to the standard normal distribution: t-distribution. This distribution, originally called "Student's z,[2]" accounts for the uncertainty inherent in small samples. 

Student's t-Test Statistic

Chi-square Distribution

Before we delving into the details of t-test statistic, let's briefly review the chi-square distribution. Consider a set of independent and identically distributed random variables, denoted by \(Z_1, Z_2, ..., Z_{\nu}\). If each variable follows the standard normal distribution, the sum of their squares will follow a chi-squared distribution with \(\nu\) degrees of freedom, which is denoted by \(\chi_{\nu}^2\). 

\(X_i \stackrel{i.i.d}{\sim}N(\mu,\;\sigma) \Rightarrow Z_i = \frac{X_i - \mu}{\sigma} \stackrel{i.i.d}{\sim}N(0, 1)\)

\(\sum_{i=1}^{\nu} Z_i^2 = \sum_{i=1}^{\nu}\begin{pmatrix}\frac{X_i - \mu}{\sigma}\end{pmatrix}^2 = \frac{\sum_{i=1}^{\nu}(X_i - \mu)^2}{\sigma^2} \sim \chi_{\nu}^2\)

Because the chi-squared (\(\chi_{\nu}^2\)) random variable is constructed from the squared sum of continuous random variables, it inherently possesses the property of being continuous itself; possible values are greater than or equals to 0. 

Let \(U \sim \chi^2_{\nu}\). The expected value of \(U\) can be obtained:

\(\begin{aligned}\Rightarrow E(U) & = E(Z_1^2 + Z_2^2 + ... + Z_{\nu}^2) \text{, where } Z_i \stackrel{i.i.d}{\sim} N(0, 1) \\ & = E(Z_1^2) + E(Z_2^2) + ... + E(Z_{\nu}^2)\\ & = [Var(Z_1) + E(Z_1)^2] + [Var(Z_2) + E(Z_2)^2] + ... + [Var(Z_{\nu}) + E(Z_{\nu})^2] \\ & = \underbrace{1 + 1 + ... + 1}_{\nu \text{ times}} = \nu \end{aligned}\)

The variance of \(U\) can be obtained:

\(\begin{aligned}\Rightarrow Var(U) & = Var(Z_1^2 + Z_2^2 + ... + Z_{\nu}^2) \text{, where } Z_i \stackrel{i.i.d}{\sim} N(0, 1) \\ & = Var(Z_1^2) + Var(Z_2^2) + ... + Var(Z_{\nu}^2) \\ & = [E(Z_1^4) - E(Z_1^2)^2] + [E(Z_2^4) - E(Z_2^2)^2] + ... + [E(Z_{\nu}^2) - E(Z_{\nu}^2)^2]\\ & = \underbrace{[3 - 1] + [3 - 1] + ... + [3 - 1]}_{\nu \text{ times}} = 2\nu\end{aligned}\)

Distribution of Sample Variance \(S^2\)

Let \(X_i\) denote a random variable representing the \(i\)-th sample data point. For any \(i\) from 1 to \(n\), if \(X_i\) identically follows a normal distribution and is independent of each other, then the sample variance calculated by \(\frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\) will follow the chi-square distribution after appropriate scaling, i.e., for \(i = 1, 2, ..., n\):

\(X_i \stackrel{i.i.d.}{\sim} N(\mu, \sigma^2) \Rightarrow \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2\)

Put simply, if the population is normally distributed, the sample variance \(S^2\) times a constant value of \(\frac{(n-1)}{\sigma^2}\) will follow a chi-square distribution with df of \(n-1\)! Let's see how this works:

\(\begin{aligned} \sum_{i=1}^n (X_i - \mu)^2 & = \sum_{i=1}^n \begin{bmatrix}(X_i - \bar{X}) + (\bar{X}_n - \mu)\end{bmatrix}^2 \\ & = \sum_{i=1}^n (X_i - \bar{X}_n)^2 + \sum_{i=1}^n (\bar{X}_n - \mu)^2 + 2(\bar{X} - \mu)\sum_{i=1}^n (X_i - \bar{X}_n) \\ & = (n-1) \frac{\sum_{i=1}^n(X_i-\bar{X}_n)^2}{n-1} + n(\bar{X}_n - \mu)^2 + 2(\bar{X} - \mu)\sum_{i=1}^n(X_i - \bar{X}) \\ & = (n-1)S^2 + n(\bar{X} - \mu)^2 \end{aligned}\)

Dividing both sides by \(\sigma^2\):

\(\frac{\sum_{i=1}^n (X_i - \mu)^2}{\sigma^2} = \frac{(n-1)S^2}{\sigma^2} + \frac{n(\bar{X} - \mu)^2}{\sigma^2}\)

Notice that \(\frac{\sum_{i=1}^n (X_i - \mu)^2}{\sigma^2} \sim \chi_{n}^2\) and \(\frac{n(\bar{X} - \mu)^2}{\sigma^2} \sim \chi_{1}^n\). Therefore, we can confirm that:

\(\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2\).

Taking advantage of this, you can perform hypothesis testing on \(\sigma^2\), based on your observed sample variance \(S^2\). This ability to estimate population variance based on sample variance places the chi-square distribution in a crucial role in statistics.

Student's t-Distribution

In statistics, the standard normal distribution finds the location of \(\bar{X}\), given \(\mu\) and \(\sigma\). In the process, by the CLT, the z-test statistic will follow a normal distribution with expected value of \(\mu\) and std. dev of \(\sigma/\sqrt{n}\) (which is known as the standard error). So, it provides the probability of observing the current result of \(\bar{X}\) or more extreme, under the assumption of \(H_0\).

The t-test statistic, analogous to the z-statistic but substituting \(\sigma\) with \(S\), follows a t-distribution with degrees of freedom \(n-1\). As we saw earlier, if you have \(n\) sample data points that are drawn from a normally distributed population, then the sample variance \(S^2\) will follow a chi-square distribution with \(n-1\) degrees of freedom. Gosset's insight was that by replacing the unknown population std. dev \(\sigma\) with \(S\), the resulting test statistic needs another distribution. The new distribution should have a bell-shaped curve similar to that of the standard normal, but with heavier tail, reflecting the use of an estimated value of \(\sigma\) rather than the exact population parameter.

\(\frac{\bar{X} - \mu}{S / \sqrt{n}} = \sim t_{n-1}\)

This is called the t-test statistic[3] and will follows the t-distribution. The expected value of \(t\) is: \(E(t) = 0\). The variance of \(t\) is: \(Var(t) = \frac{n}{n - 2}\), where \(n > 2\). When it is compared to the standard normal distribution, the t-distribution is centered at 0, but has larger variance (because \(\frac{n}{n - 2} > 1\)). However, if the value of \(n\) is sufficiently large enough, then \(\frac{n}{n - 2} \approx 1\), making the t-distribution similar to the standard normal distribution.

The t-distribution takes one parameter \(\nu = n-1\), which is known as the degrees of freedom (df). In statistics, df represents the number of values in a calculation that are free to vary without constraints. To illustrate, imagine you have 10 data points (\(X_1, X_2, ..., X_{10}\)). If you know the average of them and all but one of the data points, say (\(X_1, X_2, ..., X_9\)), the value of remaining data point \(X_{10}\) is already determined by: \(X_{10} = 10 \times \mu - (X_1 + X_2 + ... + X_9)\). This is because the sum of all data points must be equal to 10 times the mean. Thus, in this case, with 10 data points and the constraint of knowing the mean, only 9 data points are free to vary independently. Parameters from a population are unknown values, but are fixed. Thus, to enforce the sample statistics to be equal to their associated parameters, the degrees of freedom is determined by the number of samples \(n\) minus the number of parameters. 

Back to the context of hypothesis testing when \(\sigma\) is unknown, instead of using the standard normal distribution, we use the t-distribution with the df of \(n-1\). This reflects the fact that there is one estimated parameter is in the calculation. Just like the standard normal distribution, t-distribution is centered at zero and bell-shaped. However, it has heavier tails, and the amount of probability mass on each side of tail is controlled by the df \(\nu\): for \(\nu \rightarrow \infty\), it approaches \(N(0,\;1)\). This property of t-distribution helps in accounting for the increased uncertainty when using \(S\) as an estimate of \(\sigma\), while avoiding inflation of possibilities for making Type I errors.

Assumptions on the t-Tests

As we've discussed so far, the fact that \(S^2 \sim \chi_n^2\) relies on an assumption: \(X_i \overset{i.i.d.}{\sim} N(\mu. \sigma)\). Since the t-test statistic uses sample std. dev as an estimate for \(\sigma\) based on the chi-square distribution, the t-test has an assumption on the population. That are:

  • Normality: The t-test assumes that the underlying populations from which the samples are drawn are normally distributed.
  • Independence: The observations within each sample are independent of each other. This can be assured if your sample is obtained through random sampling.
  • Homegeneity of Variance: In the case where you are comparing two samples from different populations, variances of the two populations being compared are equal. If they are not, adjustments will be necessary, as we will discuss later in this article.

It is worth noting that the t-tests are fairly robust to violation of normality. Particularly, when the sample size is greater than 30, as in most practical cases. The t-distribution quickly approximates the standard normal distribution as the degrees of freedom increase, making it suitable for larger sample sizes even if the underlying population distribution deviates from normality.

One Sample t-Test

Calculation of a t-test statistic is very simple; all you need to do is just plugging in the sample standard deviation to the unknown value of \(\sigma\):

\(\frac{\bar{X} - \mu_{H_0}}{S / \sqrt{n}} \sim t_{\nu}\)

In the formula, the new denominator (\(S / \sqrt{n}\)) is an estimator for SE (\(\sigma/\sqrt{n}\)) of z-statistic. This test statistic follows a t-distribution with \(\nu = n-1\), rather than the standard normal distribution. 

Testing\(H_0\)\(H_1\)p-value
Two-sided\(\mu = \mu_{H_0}\)\(\mu \ne \mu_{H_0}\)\(P(t_{\nu} > |\text{Observed value}|)\)
Right-tailed\(\mu \le \mu_{H_0}\)\(\mu > \mu_{H_0}\)\(P(t_{\nu} > \text{Observed value})\)
Left-tailed\(\mu \ge \mu_{H_0}\)\(\mu < \mu_{H_0}\)\(P(t_{\nu} < \text{Observed value})\)

In SAS, one-sample tests for central location can be conducted using PROC UNIVARIATE. As its name implies, the UNIVARIATE procedure is designed to analyze and summarize the distribution of individual numeric variables. The syntax of PROC UNIVARIATE is fairly simple: after the PROC statement, you specify one or more numeric variables in a VAR statement[3]. For example, let's consider the following data set:

This data set summarizes results from a well-known experiment in social science. In the experiment, researchers tested the deterrence hypothesis, which suggests that imposing a penalty can decrease the occurrence of a specific behavior. For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late. Out of these, 6 centers were randomly selected to implement a considerable amount of fine for late pickups (where group = 'test' in the data set). The remaining 4 centers (where group = 'control' in the data set) did not introduce any fine. 

The experiment spanned 8 weeks: 4 weeks before and after the fine. So, in the data set, the variable before marks the average late pickup rates for each daycare center over the last four weeks prior to introducing the fine. On the other hand, after marks the average late pickup rates for each daycare center over the last four weeks after introducing the fine.

Now, let's suppose that we're interested in the average late pickup rates among the 10 daycare centers, before the intervention (or treatment.) In SAS, you can perform t-test for a single mean using PROC TTEST as follows:

PROC TTEST DATA=DaycareFines_BeforeAfter H0=0.2 SIDES=2 ALPHA=0.1; VAR before; RUN;

In the PROC statement above, the H0= option specifies \(\mu_{H0}\), which defaults to 0 if omitted. The SIDES= option specifies which type of test it performs. Here, we see that the option is set as SIDES=2, meaning we are performing two-sided tests[4]. The ALPHA=0.1 option specifies the significance level at 0.1. If omitted, it defaults to 0.05. Subsequently, the VAR statement specifies on which variable it will perform the t-test.

In the "RESULTS" tab, we see that the p-value is about 0.0071 for the test. So, at the significance level \(\alpha = 0.1\), we can reject the null hypothesis, \(H_0: \mu = 0.2\). 

Additionally, observe that SAS also calculates the 90% (\((1-\alpha) \times 100\)) confidence interval for the mean. This interval is calculated by the formula \(\bar{X} \pm SE \times t_{\vu, \alpha/2}\), which provides a range where we can expect to capture the true population mean \(\mu\) in \(\alpha \times 100)% of such intervals, assuming repeated random samplings and infinite calculations of sample means. The concept of hypothetical infinite replications and resulting confidence level[5] is the groundwork for the frequentist statistics.

Paired Sample t-Test

PROC PRINT DATA=DaycareFines_BeforeAfter;
TITLE "DaycareFines Data: BY group"; BY group;
VAR center before after; RUN;

Let's take a look at the data set again. If we specify BY statement in the PRINT procedure, SAS prints the data set, separating it according to the specification of the BY variables. Within each group of data set, we have average late pickup rates observed before and after imposing fines, for each daycare center. These rates come in pairs, meaning the before and after rates correspond to the same daycare center.

Now, let's consider testing whether there was any change in late pickup rates before and after the intervention, for the test group. As mentioned earlier, the data points are paired because the before and after variables are derived from the same sampling unit (daycare center). Thus, to assess changes for each daycare center, we can calculate the differences within each paired observations.

DATA DaycareFines_Test;
SET DaycareFines_BeforeAfter; WHERE group = 'test'; Effect = after - before; RUN;

In the DATA step above, new variable Effect is defined by after - before. Subsequently, we can conduct a one-sample t-test on these differences to determine if, on average, they differ from zero. If the sample mean of the differences falls within a reasonable range around zero, we failed to reject the null hypothesis \(H_0\), stating that the fine does not have any significant effect[6] on the average late pickup rates. Conversely, if the sample mean of differences significantly deviates from zero, we can reject the null (\(H_0\)), indicating that the fine has a significant effect on the average late pickup rates.

PROC TTEST DATA=DaycareFines_Test;
VAR Effect; RUN;

In the RESULTS tab, we obtained fairly small p-value; at the significance level of 0.05, we can reject the null. Thus, we can conclude that, on average, imposing a fine has an effect in the test group.

This kind of test is called paired t-test. In fact, you can directly perform the paired t-test using the TTEST procedure, as follows:

PROC TTEST DATA=DaycareFines_Test;
PAIRED before * after;
RUN;

In the procedure above, the PAIRED statement lists the two paired variables to be compared, separated by an asterisk. If you run the procedure, you'll obtain the same result as before:

Two Sample t-Test

This time, let's compare the late pickup rates after imposing fines between the test and control groups. That is, we are testing if, on average, there is any significant difference in the late pickup rates after intervention, between the two different groups. Since each observed after variable value is from different sampling unit (daycare center), and thus, we cannot subtract the after values of one group from another. This type of test is known as the two independent sample comparison.

One commonly employed data visualization tool for the purpose is the boxplot. It compares the distributional aspects of a variable, including central tendency and five number summaries, depending on another categorical variable. For example:

PROC SGPLOT DATA=DaycareFines_BeforeAfter;
TITLE "Late pickup rates after imposing fines"; VBOX after / CATEGORY = group; YAXIS LABEL = 'Late pickup rate'; RUN;

We see that on average there is a difference between the control and test groups[7]. To make sure, let's perform a confirmatory data analysis:

PROC TTEST DATA=DaycareFines_BeforeAfter;
CLASS group;
VAR after; RUN;

When comparing two independent groups, you use a CLASS and VAR statement in the TTEST procedure. In the CLASS statement, you list the variable that distinguishes the two groups, which, in this example, is the variable group. In the VAR statement, you list the response variable, which, in this example, is the variable after.


In the RESULTS tab, we obtained two different set of outputs depending on the methods used: Pooled vs Satterthwaite. The pooled variance method calculates sample variance, \(S^2\), assuming that the two independent groups have the same population variance (\(\sigma_1^2 = \sigma_2^2\)). It is calculated by taking the average of the two sample variances, weighting on their degrees of freedom.

\(S_P^2 = \frac{(n_1 - 1) \times S_1^2 + (n_2 - 1) \times S_2^2}{n_1 + n_2 - 2}\)

Then, take the square root on the pooled variance and calculate the t-test statistic as below:

\(t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{S_P \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\) 

The Pooled SE, \(S_P \sqrt{1/n_1 + 1/n_2}\), can only be used when your variances are equal - which almost never happens in real life. When you cannot reasonably assume that the population variances of the two groups are equal, use the Satterthwaite approximation. It approximates the degrees of freedom by the following formula:

 \(df = \frac{(\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2})^2}{\frac{S_1^4}{n_1^2 (n_1 - 1)} + \frac{S_2^4}{n_2^2 (n_2 - 1)}}\)

Next, calculate the t-test statistic:

\(\frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\)

In Satterthwaite method, we compare this t-test statistic to the t-distribution with the df of calculated earlier.


[1] Just as \(\bar{X}\), the sample variance \(S^2\) is also a random variable whose outcome is determined by the random sampling process. However, the calculation of the sample variance involves in sum of squared deviations from the sample mean, which makes its distribution inherently skewed to the right. This property of variance makes its median less than its expected value (\(sigma\)). Consequently, the sample variance tend to be smaller than the true value (\(sigma\)), making it likely that \(\frac{S}{\sqrt{n}} < \frac{\sigma}{\sqrt{n}}\).  
[2] "Student" was the pen name of Gosset. His employer did not allow him to publish his findings under his real name due to company policy.  
[3] Here, \(\bar{X}\) and \(S\) are independent to each other. This can be proved by the Basu's theorem.  
[4] If no VAR statement is provided, the procedure applies to all numeric variables in the data set by default.  
[5] For left-tail test (\(H_0: \mu \ge \mu_{H_0}\) vs \(H_1: \mu < \mu_{H_0}\)), you should set SIDES=L. For right-tail test (\(H_0: \mu \le \mu_{H_0}\) vs \(H_1: \mu > \mu_{H_0}\)), you should set SIDES=U.  
[6] Please don't get confused the concept of confidence with probability! Probability measures how likely to observe a specific event occurring in a single instance. On the other hand, confidence refers to the reliability or trustworthiness of an estimation process. It's not about a single event, but about how sure we are that our estimation process will capture the true value. In the context of confidence interval, the 90% confidence interval, for example, tells us that the 90% of those intervals produced in the same manner will capture the true parameter; there is no probabilistic event involves here. For each interval, the event that it captures the true parameter is a Bernoulli random variable (0: not capture, 1: capture) with unknown rate of \(p\).  
[7] Here, we are performing two-sided test. So, effect mean by both positive (increase in late pickup rate) and negative (decrease in late pickup rate) 
[8] For who you wonder how imposing fines actually "encouraged" parents to pick up their kids late: The Fines That Create Unintended Consequences (econlife.com) 

Post a Comment

0 Comments