A Glimpse at Probability Inequalities

Whether you calculate the result of a poll, a model’s error rate or a system’s average response time: It might not be enough to calculate the respective statistic, you also might need to know the probability of this estimate being badly wrong.

Probability inequalities give a guaranteed ceiling on that chance, using only partial knowledge of the distribution. In a way, every such bound makes a deal:

The more you can assume about your data, the tighter the guarantee you get in return.

Setup

Think of flipping a biased coin.

$$X \sim \text{Bernoulli}(\mu)$$

Each flip lands heads with probability $\mu$. After $n$ flips, the fraction of heads $\bar{X}_n$ estimates $\mu$. What is the probability that the estimate strays far from true parameter? Formally, this means bounding

$$P(|\bar{X}_n - \mu| \ge t)$$

and for this example, we fix

$$n = 30, \mu = 0.3$$

Since $S_n = n\bar{X}_n \sim \text{Binomial}(n, \mu)$, the exact tail serves as ground truth for this example:

$$P(|\bar{X}_n - \mu| \ge t)$$

$$= P(\bar{X}_n - \mu \ge t) + P(\bar{X}_n - \mu \le -t)$$

Substituting $\bar{X}_n = S_n / n$ and multiplying by $n$:

$$= P(S_n \ge n(\mu + t)) + P(S_n \le n(\mu - t))$$

# for illustration only!
# floating-point arithmetic will cause issues in some cases
prob_exact <- function(t, n, mu) (1 - pbinom(n * (mu + t), n, mu)) +
                                       pbinom(n * (mu - t), n, mu)

Since $\bar{X}_n = S_n/n$ only takes values in $\{0, 1/n, 2/n, \ldots, 1\}$, the probability $P(|\bar{X}_n - \mu| \ge t)$ only changes when $t$ crosses a multiple of $1/n$, yielding a staircase pattern:

Markov’s Inequality

Markov’s Inequality follows from the observation that a non-negative variable cannot frequently exceed a level much larger than its mean.

$$P(\bar{X}_n \ge a) \le \frac{E[\bar{X}_n]}{a}$$

Its only assumption is that $\bar{X}_n$ is non-negative with a known mean. Note that this is a one-sided bound: it only controls the right tail. As $E[\bar{X}_n] = \mu$, substituting $a = \mu + t$ gives

$$P(\bar{X}_n \ge \mu + t) \le \frac{\mu}{\mu + t}$$

ineq_markov <- function(t, mu) mu / (mu + t)

As this bound only uses information about the mean of the distribution, it is very conservative and decays only hyperbolically. Additionally, the bound has $\mu$ in the numerator, which means it grows even looser as $\mu$ increases beyond $0.3$.

Chebyshev’s Inequality

Chebyshev’s Inequality is derived by applying Markov’s inequality to the squared deviation $(\bar{X}_n - \mu)^2$.

$$P(|\bar{X}_n - \mu| \ge t) \le \frac{\mu(1-\mu)}{nt^2}$$

This bound requires a known variance and mean.

ineq_cheby <- function(t, n, mu) mu * (1 - mu) / (n * t^2)

By using information about the variance of $\bar{X}_n$, this inequality shrinks polynomially in $t$ and is generally tighter than Markov, but remains conservative for large deviations. In our case, since the bound is proportional to $\mu(1-\mu)$, it peaks at $\mu=0.5$ and tightens toward the extremes. In contrast to Markov, it improves with $n$.

Hoeffding’s Inequality

Hoeffding’s Inequality exploits the fact that each $X_i \in [0, 1]$ is bounded.

$$P(|\bar{X}_n - \mu| \ge t) \le 2e^{-2nt^2}$$

The exponential decay in $t$ makes this far tighter than Chebyshev for large deviations. Unlike Chebyshev, it requires no knowledge of the variance.

ineq_hoeff <- function(t, n) 2 * exp(-2 * n * t^2)

Notably, the bound contains no $\mu$ and consequently applies equally, regardless of where the distribution is centered. It also improves with $n$.

Berry-Esseen Bound

The Berry-Esseen theorem quantifies how close the CLT approximation is to the true distribution. It turns the CLT into a valid upper bound by adding a correction term that limits the normal approximation error.

$$P(|\bar{X}_n - \mu| \ge t) \le 2\left(1 - \Phi\left(\frac{t\sqrt{n}}{\sigma}\right)\right) + \frac{2C\rho}{\sigma^3\sqrt{n}}$$

where $\Phi$ is the standard normal CDF, $\sigma = \sqrt{\mu(1-\mu)}$ is the standard deviation of a single draw, $\rho = E[|X - \mu|^3]$ is the third absolute moment, and $C = 0.4748$.

For Bernoulli$(\mu)$, $\rho = \mu(1-\mu)[\mu^2 + (1-\mu)^2]$. With $\mu = 0.3$ and $n = 30$ the correction term is approximately $0.22$.

bound_be <- function(t, n, mu) {
  sigma <- sqrt(mu * (1 - mu))
  rho   <- mu * (1 - mu) * (mu^2 + (1 - mu)^2)
  eps   <- 0.4748 * rho / (sigma^3 * sqrt(n))
  2 * (1 - pnorm(t * sqrt(n) / sigma)) + 2 * eps
}

One might expect the CLT-based bound to be the tightest, but the correction term acts as a floor. However for large $n$ and deviations of order $1/\sqrt{n}$, this term becomes negligible and the bound converges to the CLT approximation, the tightest any bound of this form can achieve at that deviation scale.

Setup#

Markov’s Inequality#

Chebyshev’s Inequality#

Hoeffding’s Inequality#

Berry-Esseen Bound#

Setup

Markov’s Inequality

Chebyshev’s Inequality

Hoeffding’s Inequality

Berry-Esseen Bound