Statistical Hypothesis Testing

Before we get into the details of statistical hypothesis testing, let's get an intuitive feel for how it works. The YouTube video displayed above went viral because of a young boy's adorable little lie. In the video, his mom asks him if he had sprinkles, and he denies it.

Let's assume that the kid didn't eat the sprinkles. Mom points to a half-empty sprinkle jar as evidence against this presumption. How strong is this evidence in countering his innocence? If you have a cat at your home, you might not find this evidence strong enough; a cat could have knocked it over. Then she mentions another piece of evidence: her son's face is covered in sprinkles. Can this evidence break our presumption?

Statistical hypothesis testing is a method to make decisions or infer conclusions on a population of interest based on the sample data at hand. It starts by formulating two competing hypotheses: the null hypothesis (\(H_0\)), which typically represents the "status quo" or "no effect," and alternative hypothesis (\(H_1\)), which represents that there is a different chance process generating the observed sample data. We can think of the video watched earlier as an analogy for hypothesis testing.

  • Null Hypothesis \(H_0\): The boy is innocent (no effect)
  • Alternative Hypothesis \(H_1\): The boy is guilty (effect exists)
  • Sample Data: The collected pieces of evidence (half-empty jar, sprinkles on his face etc.)
  • Statistical Test: Evaluation on the evidence

Let's start under the presumption of innocence (null hypothesis is true.) Essentially, what statistical hypothesis testing does is measuring how likely we could observe the evidence at hand or even more extreme, if the presumption were true. If the observed evidence falls within a range where we can reasonably imagine, we cannot reject the null hypothesis. On the other hand, if the evidence is sufficiently strong and falls beyond the reasonable range, we can reject the null hypothesis because the evidence becomes highly unlikely under the presumption.

One key takeaway is that, just like the mom in the video, hypothesis testing do not test on the alternative hypothesis. It never proves or disproves if the alternative testing is true[1]. Instead, hypothesis testing aims to see if the evidence is strong enough to reject the null hypothesis. 

Another important point here is that failed to reject the null hypothesis not necessarily means that the null hypothesis is actually true. No one knows if it is true or not. Rather, failed to reject the null means we just don't have enough evidence to reject it[2]

One-Sample Z-Tests

A good starting point for understanding how statistical hypothesis testing works is the one-sample z-test. The one-sample z-test is used to determine if a single sample comes from a population with a known mean[3] and standard deviation[4]. For example, according to ETS, the mean score for GRE Verbal Reasoning is 150.94 with a standard deviation of 8.48. To test if this is the true population mean, let's suppose that you collected 30 samples[5] and calculated sample mean as follows:

PROC MEANS DATA=GREScores; TITLE "GRE Score Sample Statistics"; VAR GRE_V; RUN;

According to the central limit theorem, the sample mean calculated above is a realized value of a random variable from the sampling distribution: \(N(\mu,\;\sigma^2/n)\). To test if the \(\mu\) is truly known as usual under the null hypothesis (denoted by \(\mu_{H0}\),) we calculate the z-test statistics as follows:

\(Z = \frac{\text{Observed} - \text{Expected}_{H0}}{ SE} = \frac{\bar{X} - \mu_{H_0}}{\sqrt{\sigma^2/n}} = \frac{155.8 - 150.94}{8.48 / \sqrt{30}}\)

This standardization transforms the random variable \(\bar{X}\) into another random variable \(Z\): given \(\bar{X} \sim N(\mu,\;\sigma^2/n)\), the z-test statistic will follow a normal distribution with mean of 0 and standard deviation of 1. 

This statistic represents the standardized difference between the observed sample mean, \(\bar{X}\), and the population mean under the null hypothesis, \(\mu_{H_0}\), measured in units of the standard error[6], \(\sigma / \sqrt{n}\). Put simply, the z-test statistic indicates how many standard deviations the sample mean is away from the known population mean. Using SAS, let's calculate the value of the z-statistic:

/* Z-test statistic */ DATA _NULL_; Z = (155.8 - 150.94) / (8.48 / SQRT(30)); CALL SYMPUTX('z_statistic', Z); RUN; %PUT &z_statistic;

Based on the sample mean of 155.8 and the known population standard deviation of 8.48, the calculated test statistic is approximately 3.139. The next question is how strong this evidence is against the null hypothesis. 

One intuitive way to assess this is by calculating its associated p-value. The p-value[7] represents the probability of obtaining a value of as extreme or more extreme than the observed test statistic under the null hypothesis. Here, the null hypothesis \(H_0\) is \(\mu = 150.94\), thereby the alternative hypothesis \(H_1\) becomes \(\mu \ne 150.94\). So, in this context, "extreme" can be situations where sample mean is either larger than \(\mu_{H_0}\) or smaller than \(\mu_{H_0}\). Thus, we can calculate the p-value for the test statistic as follows:

\(P(z > |\text{Observed value}|) = P(z < -\text{Observed value}) + P(z > \text{Observed value})\), where \(z \sim N(0,\; 1)\)

This type of test is known as two-sided, and is used for checking if the expected value of sample mean is equals to the value under the null hypothesis. The following SAS code calculates p-value for the case:

/* Calculate p-value */ DATA _NULL_; P = CDF('NORMAL', -&z_statistic) + (1 - CDF('NORMAL', &z_statistic));
CALL SYMPUTX('p_value', P); RUN; %PUT &p_value;

In the code, the CDF function computes the cumulative probability up to a specified value of a random variable for a given probability distribution. The normal distribution has a support of \((-\infty,\;\infty)\). Thus, CDF('NORMAL', -&z_statistic) calculates the cumulative probability from \(-\infty\) to \(-3.139\), corresponding to \(\int_{-\infty}^{-3.139} f_Z(z) dz = P(z < -\text{Observed value})\). 

On the other hand, CDF('NORMAL', &z_statistic) corresponds to \(\int_{-\infty}^{3.139} f_Z(z) dz = P(z \le \text{Observed value})\). To find \(P(z > \text{Observed value})\), we take advantage of the two facts: one is that \((3.139, \; \infty)\) is the complement of \((-\infty, \; 3.139]\), another is that \(\int_{-\infty}^{\infty} f_Z(z) dz = 1\). So, \(1-P(z \le \text{Observed value})\), which is 1 - CDF('NORMAL', &z_statistic) in the code, will corresponds to \(P(z > \text{Observed value})\). Then lastly, we add these two parts up so that the P reflects extreme in both positive and negative ways. 

The calculated p-value is less than 0.002. This suggest that under the null hypothesis, the probability we would obtain a test statistics like the current one (3.139), or even more extreme in both sides, is less than 0.2%, which is fairly small. Thus, we can confidently reject the null hypothesis, stating that \(\mu = 150.94\).

This time, let's explore how to perform one-sided tests. Unlike two-sided, hypotheses of one-sided tests are:

Null Hypothesis \(H_0\)Corresponding Alternative \(H_1\)p-value
\(\mu \le \mu_{H_0}\)\(\mu > \mu_{H_0}\)\(P(\text{Test Statistic} > \text{Observed Value} \mid H_0)\)
\(\mu \ge \mu_{H_0}\)\(\mu < \mu_{H_0}\)\(P(\text{Test Statistic} < \text{Observed Value} \mid H_0)\)

For example, according to ETS, the population mean of GRE Quantitative Reasoning is 155.44, with a standard deviation of 9.78. Now, suppose that we want to test if the population mean \(\mu\) is greater than its known value (\(\mu > 155.44\)). In this setting, we start by assuming \(\mu \le 155.44\) unless the sample mean is much greater than this:

/* Calculate sample mean: GRE_Q */PROC MEANS DATA=GREScores NOPRINT; VAR GRE_Q; OUTPUT OUT=SampleStat_GRE_Q(WHERE=(_STAT_='MEAN')); RUN; /* Calculate p-value: one-sided */ DATA _NULL_; SET SampleStat_GRE_Q; Z = (GRE_Q - 155.44) / (9.78 / SQRT(30)); P = 1 - CDF('NORMAL', Z); PUT GRE_Q Z P; RUN;

Since the alternative hypothesis is asking if \(\mu > 155.44\), "extreme" in this context would be far greater than the value. Thus, the p-value in this setting would be \(P(z > \text{Observed value}) = \int_{\text{Observed value}}^{\infty}f_Z(z)dz = 1 - \int_{-\infty}^{\text{Observed value}}f_Z(z)dz\).

Based on the collected sample, we can reject the null hypothesis; under the assumption of \(\mu \le 155.44\), it is highly unlikely to observe a sample mean of 163.

Statistical Decision Making

Before collecting data and calculating the p-value, we typically establish a threshold to determine whether we can reject the null hypothesis. This threshold is known as significance level, \(\alpha\). It is worth emphasizing that \(alpha\) is set prior to data collection, so that we can ensure that decision making is not subjective and based on predetermined criteria. 

Then, after calculating p-value, we can compare it to \(\alpha\). Based on the criteria, if p-value is smaller than \(\alpha\), we cannot reject the null hypothesis. On the other hand, if p-value is greater than or equals[8] to \(\alpha\), we can reject the null hypothesis. 

The common choice for \(\alpha\) is 0.05. This choice isn't based on a strict mathematical reasoning, but rather a balance between two kinds of errors we might make:

The Null Hypothesis (H0) is:You Decide
H(0): p less alphaH(0): p not less Alpha
FalseReject False H(0)
Type I Error
Not Reject False H(0)
TrueReject True H(0)Not Reject True H(0)
Type II Error

As described in the table above, there are two different ways to make correct decisions: by rejecting a false \(H_0\) and not rejecting true \(H_0\). Conversely, there are also two different sources of errors: rejecting a true \(H_0\) and not rejecting a false \(H_0\).

When we reject a true \(H_0\), this error is called a Type I error, often referred to as a false positive. Here, if you set \(alpha = 0.05\), which is a common choice for significance level, this means that even if \(H_0\) is true, you will reject it if somehow your p-value is smaller than 0.05. Thus, in this setting, the probability of making Type I error is capped at 5%. Thus, the significance level \(\alpha\) represents your maximum permissible possibilities of Type I errors!

The second type of error, known as Type II error, occurs when the null hypothesis is false, but we fail to reject it. The possibility of making Type II errors is denoted as \(\beta\). Conversely, the probability of correctly rejecting the null hypothesis is \(1-\beta\), which is referred to as the testing power.

Statistical Significance vs Importance

As mentioned, typically, the significance level \(\alpha\) is set at 0.05. If we obtain a p-value that is smaller than 0.05, then we can reject the null and conclude that "It is statistically significant." That being said, there is a significant amount of evidence against the null hypothesis. On the other hand, if the p-value is greater than the chosen value of \(\alpha\), we failed to reject the null hypothesis and say that "The amount of evidence is not enough to reject the null hypothesis."

However, the statistical significance is not the same thing as practical importance. Suppose that the p-value is just slightly above the 0.05, say 0.051. After all, it's virtually the same as 0.05, but it may turn out that there is no significant effects on average. 

How can that be? Well, when we say it is not statistically significant, it simply means that we are not confident enough to say that the effect is there. The SE for z-statistic uses the square rooted sample size for its denominator. A small sample size results in large SE, which makes p-value larger, which could possibly lead to Type II errors. Quick remedy I would suggest for the case is "Get more data."

On the other hand, a large sample size results in small SE and with the small SE, we may quickly get a very significant test result. So, we could reject the null hypothesis, when it is actually true (Type I error). Particularly, this can be misleading if the amount of effect is too small to matter in practice

Statistical significance is a valuable tool, but it should be interpreted alongside the effect size to assess practical importance. A well-designed study will consider the desired effect size and power to determine the appropriate sample size.


[1] One very misleading phrase is "accept the alternative." If someone uses this phrase, it indicates his or her lack of basic understanding of statistical testing and inferential statistics. Please do not say this, particularly during job interviews for data scientist or statistician position!  
[2] Thus, it is wrong to say that the null (or alternative) hypothesis is true or false. Saying a hypothesis is true or false suggests that there is a binary outcome where one of the hypothesis is definitively true. However, this goes against the probabilistic natures of statistical tests and overlooks the possibilities of making Type I errors or Type II errors.  
[3] The population mean under the null hypothesis, denoted by \(\mu_{H_0}\).  
[4] The true population standard deviation, denoted by \(sigma\). The \(sigma\) is not the subject of testing, and its true value must be known for the z-test statistic.  
[5] For the purpose of explaining z-test statistic, we are using only a portion of the full data set. The full data can be found at: https://www.openicpsr.org/openicpsr/project/155721/version/V1/view   
[6] You can think of the standard error as the standard deviation of sample means.  
[7] P in p-value stands for the probability.  

[8] The normal distribution is a continuous random distribution. So, technically, including an equal sign doesn't really matter. Conventionally, though, we include the equal sign here, as the meaning of the null hypothesis itself is the status quo.

Post a Comment

0 Comments