[MathStat] 3. Large Sample Theory

This post is an overall summary of Chapter 5 of the textbook Statistical Inference by Casella and Berger.

Table of Contents

3.1. Convergence of Random Variables

  • Convergence of Sequences
  • Almost-sure convergence
  • Convergence in probability
  • Convergence in quadratic mean
  • Convergence in distribution
  • Relationships among various convergences
  • Slutsky’s Theorem

3.2. Central Limit Theorem

  • Classical CLT
  • Lyapunov CLT
  • Multivariate CLT

3.3. Delta Methods

  • First-order delta method
  • Second-order delta method
  • Multivariate delta method

3.4. Supplements

  • Dominated Convergence Theorem
  • Continuous Mapping Theorem
  • Stochastic Order Notation

 


3. Large Sample Theory

In this section, we’ll review some of the fundamental concepts and techniques in asymptotic statistics.

Statistical inference often requires the knowledge of the underlying distribution of statistics which however, cannot often be derived in a closed form.

To this end, one of the main goals in asymptotic statistics is to address this issue by observing what happens to the statistic under the assumption of infinite samples (i.e. $n \to \infty$). Thus, we want to find out the limiting distribution of a given statistic, which is often much easier to handle than the actual distribution albeit precise enough to be useful in practical scenarios.

 


3.1 Convergence of Random Variables

Convergence of Sequences

As a building block, let’s briefly review the concept of convergence in deterministic real numbers.

We say that a sequence of real numbers $a_1, a_2, \dots $ converges to a fixed real number $a$, if for every positive number $\epsilon$ there exists a natural number $N(\epsilon)$ such that for all $n \geq N(\epsilon), \quad|a_n - a| < \epsilon$.

If this is the case, then we call $a$ the limit of the sequence and denote $\underset{n \to \infty}{\text{lim} a_n = a}$.

Now we focus on how to extend of this notion to the “sequences” of random variables. Specifically, we will see the following convergences:

  • almost sure convergence
  • convergence in probability
  • convergence in quadratic mean
  • convergence in distribution

 

Almost-sure convergence

<Definition>

$$ \begin{aligned} &P\Big( \big\{\omega \in \Omega: \underset{n \to \infty}{\text{lim}}X_n(\omega) = X(\omega) \big\}\Big) = 1 \\[10pt] &\Leftrightarrow \big| X_n(\omega) - X(\omega)\big| \leq \epsilon \quad \text{holds for every } \epsilon > 0 \end{aligned} $$

Conceptually, this can be thought of as if there is some set of “exceptional” events on which the random variables can disagree, but these exceptional events have probability converging to 0 as $n \to \infty$.

 

Convergence in probability

A sequence of random variables $X_1, \dots, X_n$ converges in probability to a random variable $X$ if for every $\epsilon > 0$, we have that

$$ \begin{aligned} &\underset{n \to \infty}{\text{lim}} P\big(|X_n - X| \geq \epsilon \big) \leq \epsilon, \quad \forall\epsilon>0 \\[8pt] &\Leftrightarrow \underset{n \to \infty}{\text{lim}} P\big(|X_n - X| \geq \epsilon \big) = 0 \\[8pt] &\Leftrightarrow X_n \overset{p}{\to} X \end{aligned} $$

Conceptually, convergence in probability implies that as $n$ gets large, the distribution of $X_n$ gets more peaked around the value of convergence. Hence the random variable converges in a “probabilistic” sense.

A famous example of convergence in probability is the weak law of large numbers. Suppose that $Y_1,Y_2,\dots$ are i.i.d. with $E[Y]=\mu$ and $Var(Y_i) = \sigma^2 < \infty$, then

$$ X_n = \frac{1}{n}\sum_{i=1}^nY_i \overset{p}{\rightarrow} \mu $$

 

Convergence in quadratic mean

<Definition>

We say that a sequence converges to $X$ in quadratic mean if:

$$ \begin{aligned} &E\big(X_n-X\big)^2 \overset{p}{\to} 0 \quad\text{ as } n\to\infty \\[8pt] &\Leftrightarrow X_n \overset{qm}{\to} 0 \end{aligned} $$

This convergence is often used to prove convergence in probability as it is a stronger condition.

 

Convergence in distribution

<Definition>

We say that a sequence converges to $X$ in distribution if:

$$ \underset{n \to \infty}{\text{lim}} F_{X_n}(t) = F_X(t), \quad \text{for all points }t \text{ where the }F_X \text{ is continuous} $$

This convergence is by far, the weakest form of convergence.

A fundamental example of convergence in distribution is the Central Limit Theorem (CLT) and we will see this in a second.

 

Relationships among various convergences

  1. Convergence in probability does not imply almost sure convergence.

  2. Convergence in quadratic mean imples convergence in probability (reverse is not true).

  • Suppose that $X_1, \dots, X_n$ converges in quadratic mean to $X$. Then,
$$ P\big(|X_n - X| \geq \epsilon\big) = P\big(|X_n-X|^2 \geq \epsilon^2\big) \leq \frac{E[X_n-X]^2}{\epsilon^2} \to 0 \quad(\because \text{ Markov's inequality}) $$

3.Convergence in probability imples convergence in distribution (reverse is not true).

  • Although the mathematical detail is omitted here, this statement can be proved with the idea of “trapping” the CDF of $X_n$ by the CDF of $X$ with an interval of length converging to 0.

 

Slutsky’s Theorem

<Definition>

$$ \begin{aligned} &\text{If }Y_n \overset{p}{\to} c \text{ and } X_n \overset{d}{\to} X, \quad (c\text{ is a constant)} \\[10pt] &\Rightarrow X_n + Y_n \overset{d}{\to} X + c \\[5pt] &\Rightarrow X_nY_n \overset{d}{\to} cX \end{aligned} $$

Note that in general, $X_n \overset{d}{\to} X$ and $Y_n \overset{d}{\to} Y$ does not guarentee that the sum and product also converge.

 


3.2 Central Limit Theorem

The central limit theorm is one of the most famous and important examples of convergence in distribution.

 

Classical CLT

<Definition>

Let $X_1, \dots, X_n$ be a independent sequence of random variables from an arbitrary probability distribution with finite mean and variance $\mu, \sigma^2$. Then for the sample mean $\bar X = \sum_{i=1}^n X_i / n$,

$$ \begin{aligned} &Z_n = \frac{\sqrt{n}(\bar X_n - \mu)}{\sigma} \overset{d}{\to} Z = N(0,1) \\[10pt] &\Leftrightarrow\bar X_n \overset{d}{\approx} N\Big(\mu, \frac{\sigma^2}{n}\Big) \end{aligned} $$

The variance $\sigma^2$ can be estimated by the sample variance $\hat \sigma^2$ and the CLT still holds (CLT with estimated variance).

$$ \frac{\sqrt{n}(\hat X_n - \mu)}{\hat \sigma_n} \overset{d}{\to} N(0,1) $$

Note that the CLT is incredibly general, in that it can be applied to any kind of random variables.

Consider that we want to perform a statistical test for the unknown mean $\mu$. In order to find the confidence interval, we have to know the underlying distribution of test statistic $T$. However, since CLT guarentees normal approximation of the sample mean (or equivalent summation), the confidence interval is asymptotically defined such that:

$$ \begin{aligned} &P(\mu \in C) \geq 1-\alpha, \quad \big(C=[\hat \mu-t, \hat \mu +t] \text{ is the critical region, } \alpha\in(0,1)\big) \\[8pt] &\Rightarrow P\big(|\hat\mu - \mu| \leq t\big) = P\Big(\frac{\sqrt{n}|\hat\mu - \mu|}{\sigma} \leq \frac{\sqrt{n}t}{\sigma} \Big) = P\big(|Z| \leq \frac{\sqrt{n}t}{\sigma} \big), \quad \big(Z\sim N(0,1)\big) \\[8pt] &\Rightarrow \text{under the CDF of standard normal, } \frac{\sqrt{n}t}{\sigma} = z_{\alpha/2} \Leftrightarrow t = \frac{\sigma z_{\alpha/2}}{\sqrt{n}} \\[8pt] &\therefore C = \Big[\hat\mu - \frac{\sigma z_{\alpha/2}}{\sqrt{n}}, \; \hat\mu + \frac{\sigma z_{\alpha/2}}{\sqrt{n}} \Big] \quad\Rightarrow\quad P(\mu \in C) \to 1-\alpha \end{aligned} $$

 

<Proof>

The CLT can be proved by using the property of mgfs. Specifically, we are going to use the following facts:

$$ \begin{aligned} &1.\quad \text{If }X \text{ and }Y \text{ are independent with mgfs }M_X, X_Y, \text{ then } Z = X + Y \; \text{ has mgf }\; M_Z(t) = M_X(t)M_Y(t) \\[8pt] &2.\quad \text{If }X \text{ has mgf } M_X,\text{ then }\;Y= a + bX \;\text{ has mgf } M_Y(t) = exp(at)M_X(bt) \\[8pt] &3.\quad \text{The derivative of the mgf at 0 returns the moments,} \;\text{ i.e. }\;M_X^{(r)}(0) = E[X^r] \\[8pt] &4.\quad \text{Let }X_1, \dots, X_n \text{ be a sequence of random variables with mgfs } M_{X_1}, \dots, M_{X_n}. \\ &\quad\;\; \text{ If for all }t \text{ in an open interval around 0, } M_{X_n}(t) \to M_X(t) \text{ implies } X_n \overset{d}{\to} X \\[8pt] &5.\quad \text{If }Z \sim N(0,1), \text{ then }M_Z(t) = e^{t^2/2} \\[15pt] \end{aligned} $$ $$ \begin{aligned} &Z_n = \frac{\sqrt{n}(\bar X_n - \mu)}{\sigma} = \frac{1}{\sqrt{n}}\sum_{i=1}^nA_i, \quad (A_i = \frac{X_i- \mu}{\sigma}) \\[10pt] &\Rightarrow M_{Z_n}(t) = E\Big[exp(tZ_n)\Big] = E\Big[exp(\frac{t}{\sqrt{n}}\sum_{i=1}^nA_i)\Big] = \prod_{i=1}^n E\Big[exp\big(\frac{t}{\sqrt{n}}A_i\big)\Big] = M_A\Big(\frac{t}{\sqrt{n}}\Big)^n\\[10pt] &\Rightarrow M_A\Big(\frac{t}{\sqrt{n}}\Big) \approx M_A(0) + \frac{t}{\sqrt{n}}M_A^\prime(0) + \frac{t^2}{2n}M_A^{\prime\prime}(0) = 1 + \frac{t^2}{2n} \quad (\because\text{taylor expansion of }M_A) \\[10pt] &\Leftrightarrow M_A\Big(\frac{t}{\sqrt{n}}\Big)^n \approx \Big(1 + \frac{t^2}{2n}\Big)^n \to e^{t^2/2} = M_Z(t) \quad \text{ as }n \to \infty \quad(\because \underset{n \to \infty}{\text{lim}} \Big(1 + \frac{x}{n}\Big)^n \to e^x) \end{aligned} $$

Thus, we have shown that the mgf converges to that of the standard normal, which completes the proof.

 

Lyapunov CLT

Suppose that $X_1, X_2, \dots$ are independent, but not necessarily identically distributed.

The CLT for the sum holds as desired, but it requires some extra conditions to ensure that one or a small number of abnormal random variables do not dominate the sum.

In this sense, the Lyapunov Central Limit Theorem guarentees normal approximation for this exceptional cases.

<Definition>

Suppose $X_1, \dots, X_n$ are independent. Let:

$$ \begin{aligned} \mu_i &= \mathbb{E}\big[X_i\big] \\[5pt] \sigma^2_i &= Var(X_i) \\[5pt] s_n^2 &= \sum_{i=1}^n \sigma^2 \end{aligned} $$

Suppose that the following Lyapunov condition is satisfied:

$$ \underset{n \to \infty}{\text{lim}} \frac{1}{s_n^{2+\delta}}\sum_{i=1}^n \mathbb{E}\big[|X_i-\mu|^{2+\delta}\big] = 0, \quad \text{for some } \delta >0 $$

Then, it holds that:

$$ \frac{1}{s_n}\sum_{i=1}^n \big(X_i - \mu_i\big) \overset{d}{\to} N(0,1) $$

Intuitively, what can happen for the non-identically distributed random variables is that only one random variable can possibly dominate the sum, so we are not really averaging diverse information among random variables. To this end, the Lyapunov condition is added to prevent such extreme cases.

 

Multivariate CLT

As its name explictly suggests, multivariate central limit theorem is an extension of the classical CLT to random vectors.

<Definition>

If $X_1, \dots, X_n$ are i.i.d. random vectors with mean vector $\mu \in \mathbb{R}^d$ and covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$,

$$ \sqrt{n}\big(\hat\mu - \mu\big) \overset{d}{\to} N(0,\Sigma) $$

Although the dimension $d$ can be sufficiently larger than 1, it is taken to be fixed as $n \to \infty$.

For the cases where $d$ is also allowed to increase with respect to the sample size $n$, such high-dimensional CLTs are very complicated and are active topic of research.

 


3.3 Delta Method

Building on top of the CLT, a natural question is that what happens to the function of random variables $X_1, \dots, X_n$ which converges in distribution to the normal (i.e. $g(X_1), \dots, g(X_n)$).

 

First-order delta method

By applying the continuous mapping theorem (see section 3.4 of this post), we can derive the following (first-order) delta method:

$$ \begin{aligned} &\frac{\sqrt{n}\big(g(X_n) - g(\mu)\big)}{\sigma} \overset{d}{\to} N(0,\big[g^\prime(\mu)\big]^2), \\[10pt] &(g\text{ is continuously differentiable such that } g^\prime(\mu) \neq 0) \end{aligned} $$

 

Second-order delta method

However, there might be some instances where $g^\prime(\mu) = 0$. For these cases, the (second-order) delta method is:

$$ \begin{aligned} &\sqrt{n}\big(g(X_n) - g(\mu)\big) \overset{d}{\to} \sigma^2 \frac{g^{\prime\prime}(\mu)}{2}\chi^2_1, \\[10pt] &(g\text{ is twice continuously differentiable such that } g^\prime(\mu)=0 \text{ and }g^{\prime\prime}(\mu) \neq 0) \end{aligned} $$

 

Multivariate delta method

Suppose we have random vectors $X_1, \dots, X_n \in \mathbb{R}^d$, and $g: \mathbb{R}^d \to \mathbb{R}$ is a continuously differentiable function.

Then,

$$ \begin{aligned} &\sqrt{n}\big(g(\hat\mu) - g(\mu) \big) \overset{d}{\to} N(0, \tau^2) \\[10pt] &\text{where}\quad \hat\mu = \frac{1}{n}\sum_{i=1}^n X_i, \\[10pt] &\text{and } \tau^2 = \Delta_\mu(g)^T\Sigma\Delta_\mu(g), \quad \Delta_\mu(g) = \begin{pmatrix} \frac{\partial g(x)}{\partial x_1} \\ \vdots \\ \frac{\partial g(x)}{\partial x_d} \end{pmatrix}_{x=\mu} \end{aligned} $$

 


3.4 Supplements

Dominated Convergence Theorem

If ${f_n : \mathbb{R} \to \mathbb{R} }$ is a sequence of measureable functions which converge pointwise almost everywhere to $f$, and if there exists an integrable function $g$ which converges almost everywhere to $f$, and if there exists an integrable function $g$ such that $|f_n(x)| \leq |g(x)|$ for all $n$ and for all $x$, then $f$ is integrable and

$$ \int_{-\infty}^{\infty} \underset{n \to \infty}{\text{lim}} f_n(x)dx = \underset{n \to \infty}{\text{lim}} \int_{-\infty}^{\infty} f_n(x)dx $$

 

Continuous Mapping Theorem

If a sequence $X_1, \dots, X_n$ converges in probability to $X$, then for any continuous function $h$:

$$ h(X_1), \dots, h(X_n) \overset{p}{\to} h(X) $$

A direct consequence of this theorem in terms of convergence in probability is:

$$ \begin{aligned} &X_n \overset{p}{\to}X,\; Y_n \overset{p}{\to}Y \\[8pt] &\Rightarrow X_n + Y_n \overset{p}{\to} X+Y \\[8pt] &\Rightarrow X_nY_n \overset{p}{\to} XY \end{aligned} $$

 

Stochastic Order Notation

The stochastic order notation implies that:

$$ \begin{aligned} &1.\quad X_n = o_P(1) \quad\Leftrightarrow\quad P\big(|X_n| \geq \epsilon\big) \to 0 \\[15pt] &2.\quad X_n = O_P(1) \quad\Leftrightarrow\quad P\big(|X_n|\geq C(\epsilon)\big) \leq \epsilon \quad \quad(\text{i.e. convergence in probability}) \end{aligned} $$

The typical use case of stochastic order notation is the WLLN and CLT

$$ \begin{aligned} &\hat \mu - \mu = o_P(1), \quad \text{(WLLN)}\\[15pt] &\hat \mu - \mu = O_P\Big(\frac{1}{\sqrt{n}}\Big), \quad \text{(CLT)} \end{aligned} $$

 


Reference

  • Casella, G., & Berger, R. L. (2002). Statistical inference. 2nd ed. Australia ; Pacific Grove, CA: Thomson Learning.

You might also enjoy