[MathStat] 5. Hypothesis Testing
This post is an overall summary of Chapter 8 of the textbook Statistical Inference by Casella and Berger.
Table of Contents
5.1 Basics of Hypothesis Testing
- Overview of statistical testing procedure
- Type I error & Type II error
- Construction of Tests
- Evaluating Tests
5.2 Neyman-Pearson Test
- Simple hypothesis
- Power of a statistical test
- The Neyman-Pearson testing procedure
- The Neyman-Pearson Lemma
5.3. Wald Test
- Power of Wald Test
5.4 Likelihood Ratio Test (LRT)
5. Hypothesis Testing
5.1 Basics of Hypothesis Testing
5.1.1. Overview of statistical testing procedure
The typical and most clichéic setting of a statistical hypothesis testing is that from the observations $X_1, \dots, X_n \overset{i.i.d.}{\sim} f_\theta$, we want to test if $\theta = \theta_0$ or not.
A more concrete example is where we have a coin and we are interested in finding out whether the coin is fair (i.e. equal probability of heads and tails) or not. This procedure can formally be expressed as:
To generalize this problem, we have two sets of parameters $\Theta_0$ and $\Theta_1$ which are non-overlapping (i.e. $\Theta_0 \cap \Theta_1 = \emptyset$), and would like to test the hypothesis:
For the case when $\Theta_0$ is a single point, we call it as a simple null, whereas the more general case is refered to as a composite null.
As a general reminder, statistical hypothesis testing never guarentees optimality. Namely, the question is never if the null hypothesis is true or not. Rather, it is whether we have sufficient evidence to reject the null hypothesis or not, because the result of a hypothesis test is one of the two possibilities - “reject the null” or “retain the null”.
5.1.2. Type I error & Type II error
Since the hypothesis testing is based on random samples, it is always prone to result in incorrect decisions. Regarding this, there are two important concepts in hypothesis testing called as Type I error and Type II error. At a high level, the more critical error among these two are the “type I error”, so often the strategy is to first bound the type I error at a desired level ($\alpha$) and minimize the type II error subsequently.
Results of a statistical test
5.1.3. Construction of Tests
In a nutshell, the canonical way of constructing a test is:
5.1.4. Evaluating Tests
Then let’s discuss how to evaluate different tests.
Suppose our decision rule for a specific testing problem is that the null hypothesis is rejected when $T(X_1, \dots, X_n) \in R$.
In this setting, we can define the power function as:
Naturally, we would want $\beta(\theta)$ to be small under the null hypothesis $\Theta_0$ and big over $\Theta_1$.
In this sense, the Neyman-Pearson paradigm is the following:
Tests of this form are called as level-alpha tests which guarentees the boundedness of type I error.
Example: A One-sided Test
Okay, so let’s quickly go over an example.
Suppose $X_1, \dots, X_n \overset{i.i.d.}{\sim} N(\theta, \sigma^2)$, with known $\sigma^2$.
We want to test the one-sided alternative such that:
A natural test statistic we can think of is the scaled average of the samples:
Then, our strategy is to reject the null if $T_n > t$ for some threshold $t$. To choose the optimal threshold, let’s consider the power function:
, where $\Phi$ is the cdf of standard normal distribution.
Then, following the Neyman-Pearson paradigm, we can define the optimal threshold $t$ that maximizes the power subject to:
Similarly, for the two-sided alternative such that:
The decision rule becomes:
and the corresponding power function is:
which as before can be expanded as:
To implement the Neyman-Pearson paradigm under this setting, we just have to plug in the null parameter $\theta_0$ which gives us:
To summarize what we’ve been doing so far, we have discussed how to set up a statistical hypothesis testing problem formally, and how we can find the optimal rejection threshold that controls the Type I error while at the same times attains the maximum power, i.e. Neyman-Pearson paradigm.
5.2 Neyman-Pearson Test
Next, let’s discuss a general method for constructing an optimal test. For simplicity, we will focus on a simple hypothesis testing problems, but the general idea also applies to the more extended settings as well.
5.2.1. Simple hypothesis
The setting we are going to consider is where both the null and alternate hypotheses are simple:
Furthermore, let’s denote the null density as $f_0$ and the alternate density as $f_1$.
5.2.3. Power of a statistical test
To recap, the power of a test is defined as the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. This is by definition, the inverse probability of a type II error $\beta$.
On the other hand, remember that our primary goal was to bound the type I error to a certain level of significance level $\alpha$ in advance. Since type I error and type II error have tradeoff relationship, we can infer that an “optimal” test will be the case when the type I error actually attains the upper bound (i.e. significance level). We will see how we can make this happen in a second, but first let’s define some terminologies:
, where $R$ is the rejection region.
5.2.4 The Neyman-Pearson testing procedure
The key takeaway of the Neyman-Pearson procedure is that it provides a “general recipe” for creating an optimal test - to make the type I error equal to the significance level, allbeit it is only applied to the restricted class of simple hypothesis tests.
For $\mathbf{x} = (x_1, \dots, x_n)$, the Neyman-Pearson test statistic is to take the likelihood ratio such that:
and reject the null hypothesis if the value is within the rejection region.
Since we want our test to be “optimal”, we can define the threshold for the rejection region such that the type I error attains the upper bound (significance level):
Then for a simple hypothesis testing problem, we can analytically find the threshold $t^*$, which guarentees the optimality of the test. So in a nutshell, this kind of test is the best we can think of whenever it is applicable. There is a theoretical foundation of this result, which is called as the Neyman-Pearson Lemma.
5.2.5. The Neyman-Pearson Lemma
Let’s formally state the Neyman-Pearson lemma.
<Definition>
<Proof>
First, let’s clarify some notations:
Our goal is to show that the power of NP test is always greater than any level $\alpha$ tests:
Then, we use the property of set operations, namely:
Therefore,
, which completes the proof.
5.3. Wald Test
By now, we know that the Neyman-Pearson test is by its definition, the optimal test as long as it is defined. However, the NP test was very restrictive in that it cannot be defined when the test setting becomes more complex than the simple hypothesis testing.
Therefore as an alternative, we can consider other forms of testing procedures which are not the “best”, but still very useful. More specifically, these tests are guaranteed to bound the type I error while at the same time achieves reasonably sufficient test powers. In this sense, we are going to take a look at two asymptotic tests - the Wald Test and Likelihood Ratio Test.
Okay, so firstly let’s discuss the Wald Test which relies on the asymptotic normaility of the MLE. The Wald test considers hypothesis testing problems of the form:
The basic idea of this test is that we can use an estimator which is aympototically normal under the null, where one possible candidate is the MLE. Namely, the MLE satisfies (w/ mild regularity conditions):
With respect to the MLE, we can define the Wald test statistic such that:
and reject the null hypothesis if:
, which is an asymptotical level-$\alpha$ test.
<Example>
A simple example will be helpful in demonstrating the Wald testing procedure.
Suppose $X_1, \dots, X_n \overset{i.i.d.}{\sim} \text{Bernoulli}(p)$, and we would like to test the hypothesis $H_0: p = p_0$.
Then the Wald test statistic is:
and the corresponding rejection region is:
5.3.1. Power of Wald Test
Using the fact that the Wald test statistic converges to a normal distribution, we can derive an (asymptotic) power function of the Wald test in a closed form.
For a Wald test statistic:
, the probability that the Wald test correctly rejects the null hypothesis is roughly (w/o proof):
Some noteworthy points form this result are:
- If the difference between $\theta_0$ and $\theta_1$ is very small (i.e. supports the null hypothesis), the power will approach $\alpha$.
- As $n \to \infty$, the two $\Phi$ terms will approach either 0 or 1, which makes the maximum power of 1.
- As a rule of thumb, the Wald test has non-trivial power if
5.4 Likelihood Ratio Test (LRT)
The second testing procedure is the likelihood ratio test, which is one of the most commonly used test procedure for various problems.
Suppose we want to test the composite hypothesis.
Then the LRT statistic is defined as:
, where the numerator corresponds to MLE under the null, whereas the denominator uses MLE of the entire parameter space.
Then, the decision rule is to reject $H_0$ if $\lambda(X_1, \dots, X_n) \leq c$, where the threshold $c$ is determined according to the significance level $\alpha$.
One very useful property of the LRT is that the test statistic converges to the chi-square distribution, which is formally called as the Wilk’s phenomenon.
Thus, if we let $T_{LRT} = -2 \text{log} \lambda(X^n)$, then it becomes:
where $\chi^2_{\nu, \alpha}$ is the upper $\alpha$ quantile of \(\chi^2_{\nu}\) .
<proof>
For the log-likelihood $l(\theta)$, a straightforward application of the Taylor expansion is:
, because under the regularity conditions, the likelihood function is strictly concave, which makes the first derivative $l^{\prime} = 0$.
So we have
, whereby the last convergence holds because of the slutsky’s theorem with the fact that:
Reference
- Casella, G., & Berger, R. L. (2002). Statistical inference. 2nd ed. Australia ; Pacific Grove, CA: Thomson Learning.