[MathStat] 2. Statistical Inequalities

This post is an overall summary of Chapter 3, 4 of the textbook Statistical Inference by Casella and Berger.

Table of Contents

2.1. Inequalities for expectations

  • Jensen’s Inequality
  • Hölder’s Inequality
  • Minkowski’s Inequality
  • Association Inequality

2.2. Inequalities for variances

  • Efron-Stein Inequality

2.3 Inequalities for probabilities

  • Markov’s Inequality
  • Chebyshev’s Inequality
  • Hoeffding’s Inequality
  • Dvoretzky-Keifer-Wolfowitz (DKW) Inequality

 


2. Statistical Inequalities

One of the main goals in statistics and related fields is to make an inference about unknown parameters. For this purpose, we define an estimator that can reasonably surrogate the true parameter.

However, the problem is that there are many possible choices of estimators. So we often consider a risk function between the true parameter $\theta$ and its estimator $\hat \theta$ to evaluate its performance.

Nevertheless, the problem still exists in that computing a risk function is not so simple . Thus, we make use of various inequalities to upper bound the risk function by a function that is more easy to manipulate and study how fast the risk function goes to zero as sample sizes increase.

 


2.1 Inequalities for expectations

Jensen’s Inequality

For a convex function $g$ such that

$$ \lambda g(x) + (1-\lambda)g(y) \geq g(\lambda x + (1-\lambda)y) $$

for all $\lambda \in (0,1)$ and $x,y \in \mathbb{R}$. Then provided that expectations exist for both random variables,

$$ E\big[ g(X)\big] \geq g\big(E[X] \big) $$
  • An application of Jensen’s Inequality is the relationship between different means.
$$ \begin{aligned} &\text{Arithmetric Mean (AM)} = \frac{\sum_{i=1}^n x_i}{n} \\[5pt] &\text{Geometric Mean (GM)} = \Big(\prod_{i=1}^n x_i\Big)^{1/n} \\[5pt] &\text{Harmonic Mean (HM)} = \frac{1}{\frac{1}{n}\sum_{i=1}^n\frac{1}{x_i}} \end{aligned} $$ $$ \Rightarrow \text{HM} \leq \text{GM} \leq \text{AM} $$
  • Another famous application is the positiveness of Kullback-Leibler divergence.
$$ D_{KL}P(||Q) = \sum_x p(x)log\Big(\frac{p(x)}{q(x)}\Big) = E\Big[ log\Big(\frac{p(X)}{q(X)}\Big)\Big] = E\Big[-log\Big(\frac{q(X)}{p(X)}\Big)\Big] \geq 0 $$

 

Hölder’s Inequality

For $p, q \in (1,\infty)$ with $\frac{1}{p} + \frac{1}{q} = 1$,

$$ \big|E[XY]\big| \leq E\big[|XY|\big] \leq \big(E\big[|X|^p\big]\big)^{1/p} \big(E\big[|Y|^q\big]\big)^{1/q} $$
  • A special case of Hölder’s Inequality when $p=q=2$ is the Cauchy-Schwarz inequality such that:
$$ \big|E[XY]\big| \leq E\big[|XY|\big] \leq \sqrt{E\big[X^2\big] E\big[Y^2\big]} $$
  • By applying the Cauchy-Schwarz inequality, we have the covariance inequality such that:
$$ \begin{aligned} Cov(X, Y) = E\big[(X-\mu_X)(Y-\mu_Y) \leq \{E\big[(X-\mu_X)^2\big] \}^{1/2} \{E\big[(Y-\mu_Y)^2\big] \}^{1/2} \end{aligned} $$ $$ \Leftrightarrow \{Cov(X, Y)\}^2 \leq Var(X)Var(Y) $$

 

Minkowski’s Inequality

For two random variables $X$ and $Y$ and a constant $1 < p < \infty$,

$$ \big(E[|X+Y|^p] \big)^{1/p} \leq \big(E[|X|^p]\big)^{1/p} \big(E[|Y|^q]\big)^{1/q} $$ $$ ||X+Y||_p\; \leq \;||X||_p \;+\; ||Y||_p $$

which implies triangle inequality in p-dimensional space.

 

Association Inequality

Suppose we have a random variable $X$ and functions $f, g: \mathbf{R} \to \mathbf{R}$. Assuming that all expectations are well-defined, we have:

$$ \begin{aligned} &1.\quad f,g \text{ non-decreasing implies } E\big[f(X)g(X)\big] \geq E\big[f(X)\big]E\big[g(X)\big] \\[10pt] &2.\quad f,g \text{ non-increasing implies } E\big[f(X)g(X)\big] \geq E\big[f(X)\big]E\big[g(X)\big] \\[10pt] &3.\quad f \text{ non-decreasing and }g \text{ non-increasing implies } E\big[f(X)g(X)\big] \leq E\big[f(X)\big]E\big[g(X)\big] \\[10pt] \end{aligned} $$
  • By a direct application of association inequality, we have:
$$ E\big[X^4\big] \geq E\big[X\big]E\big[X^3\big] $$ $$ E\big[Xe^{-X}\big] \leq E\big[X\big]E\big[e^{-X}\big] $$ $$ E\big[X\mathbb{1}(X\geq a)\big] \geq E\big[X\big] P\big(X \geq a\big) $$

 


2.2 Inequalities for variances

Efron-Stein Inequality

Suppose that $X_1, \dots, X_n, X_1^\prime, X_n^\prime$ are independent with $X_i$ and $X_i^\prime$ having the same distribution for all $i \in {1,\dots,n}$. Let $X = (X_1, \dots, X_n)$ and $X^{(i)} = (X_1, \dots, X_{i-1}, X_i^\prime, X_{i+1}, \dots, X_n)$. Then,

$$ Var\big[f(X)\big] \leq \frac{1}{2}\sum_{i=1}^nE\big[\{f(X) - f(X^{(i)})\}^2\big] $$
  • As an example, we can apply the Efron-Stein Inequality to the mean function such that:
$$ \begin{aligned} &\text{For }f(X) = \frac{1}{n}\sum_{i=1}^n X_i, \\[5pt] &\Rightarrow Var\big(f(X)\big) \leq \sum_{i=1}^n \frac{1}{2n^2}E\big[(X_i - X_i^\prime)^2\big] = \frac{Var(X)}{n} \end{aligned} $$

 


2.3 Inequalities for probabilities

Markov’s Inequality

For any $t \geq 0$ and integrable nonnegative random variable $X$, we have:

$$ \begin{aligned} P(X \geq t) \leq \frac{E[X]}{t} \end{aligned} $$ $$ \Leftrightarrow E[X] \geq t P(X \geq t) $$

Markov’s Inequality is the weakest inequality, though it cannot be improved in general, unless we put some restrictions on the shape of a distribution function. For example, if we assume that the density is non-increasing (i.e. $f_X^\prime(x) \leq 0$ for all $x \geq 0$), we have:

$$ P(X \geq t) \leq \frac{E[X]}{2t} $$

 

Chebyshev’s Inequality

Let $X$ be a random variable with finite mean $\mu$ and finite variance $\sigma^2 > 0$. Then,

$$ P\big(|X-\mu|\geq t\big) \leq \frac{Var(X)}{t^2} $$
  • The proof is simple, which is just to square both sides of the equation and apply Markov’s inequality.

 

Hoeffding’s Inequality

Before digging into the Hoeffding’s Inequality, let’s recall Hoeffiding’s lemma such that:

$$ E\big[exp(tX)\big] \leq exp\big(\frac{1}{8}t^2(b-a)^2\big) $$

Based on this lemma, suppose $X_1, \dots, X_n$ are independent bounded random variables in the range of $[a_i, b_i]$ for each $X_i$. Then, for any $t > 0$, we have

$$ \begin{aligned} P\Big\{ \big|\sum_{i=1}^n(X_i - E[X_i])\big| \geq t \Big\} \leq exp\Big(\frac{-2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\Big) \end{aligned} $$

Note that the upper bound of Chebyshev’s Inequality goes up polynomially, while the bound of Hoeffding’s Inequality goes up logarithmically. Suppose that we have i.i.d. Rademacher random variables such that $E[X] = 0$ and $Var(X) = 1$, let’s apply both inequalities to the sample mean of $X_1, \dots, X_n$. Then we have:

$$ \begin{aligned} &\text{Chebyshev:} \\[8pt] & P\big(|\bar X - E[\bar X]| \geq t\big) \leq \frac{1}{nt^2} \\ &\Leftrightarrow P\big(|\bar X - E[\bar X]| \geq \sqrt{\frac{1}{n\delta}} \big) \leq \delta, \quad (\delta = \frac{1}{nt^2}) \\[30pt] &\text{Hoeffding:} \\[8pt] & P\big(|\bar X - E[\bar X]| \geq t\big) \leq 2exp(-\frac{nt^2}{2}) \\ &\Leftrightarrow P\Big(|\bar X - E[\bar X]| \geq \sqrt{\frac{2}{n}log(\frac{2}{\delta})}\;\Big) \leq \delta, \quad (\delta = 2exp(-\frac{nt^2}{2})) \end{aligned} $$

Thus, suppose we assume $\delta = 0.001$. Then with probability 0.999, the sample mean is within:

$$ \begin{aligned} &\text{Chebyshev:} \\[8pt] &\sqrt{\frac{1}{n \times 0.001}} \approx \frac{31}{\sqrt{n}} \\[30pt] &\text{Hoeffding:} \\[8pt] &\sqrt{\frac{2}{n}log(\frac{2}{0.001})} \approx \frac{2.57}{\sqrt{n}} \end{aligned} $$

Now it is clear that Hoeffing’s inequality is a huge improvement over the bound of Chebyshev’s inequality.

 

Dvoretzky-Keifer-Wolfowitz (DKW) Inequality

DKW inequality can be useful when we want to upper bound the deviation of an emphirical cdf from the true pdf.

Specifically, there exists a finite positive constant $C$ such that

$$ Pr\Big(\underset{x\in\mathbb{R}}{\text{sup}} \big|F_n(x) - F(x)\big| \geq t \Big) \leq Ce^{-2nt^2}, \quad (\text{for all } t \geq 0) $$

 


Reference

  • Casella, G., & Berger, R. L. (2002). Statistical inference. 2nd ed. Australia ; Pacific Grove, CA: Thomson Learning.

You might also enjoy