Whether you calculate the result of a poll, a model’s error rate or a system’s average response time: It might not be enough to calculate the respective statistic, you also might need to know the probability of this estimate being badly wrong.
Probability inequalities give a guaranteed ceiling on that chance, using only partial knowledge of the distribution. In a way, every such bound makes a deal:
The more you can assume about your data, the tighter the guarantee you get in return.
Setup
Think of flipping a biased coin.
$$X \sim \text{Bernoulli}(\mu)$$Each flip lands heads with probability $\mu$. After $n$ flips, the fraction of heads $\bar{X}_n$ estimates $\mu$. What is the probability that the estimate strays far from true parameter? Formally, this means bounding
$$P(|\bar{X}_n - \mu| \ge t)$$and for this example, we fix
$$n = 30, \mu = 0.3$$Since $S_n = n\bar{X}_n \sim \text{Binomial}(n, \mu)$, the exact tail serves as ground truth for this example:
$$P(|\bar{X}_n - \mu| \ge t)$$$$= P(\bar{X}_n - \mu \ge t) + P(\bar{X}_n - \mu \le -t)$$Substituting $\bar{X}_n = S_n / n$ and multiplying by $n$:
$$= P(S_n \ge n(\mu + t)) + P(S_n \le n(\mu - t))$$# for illustration only!
# floating-point arithmetic will cause issues in some cases
prob_exact <- function(t, n, mu) (1 - pbinom(n * (mu + t), n, mu)) +
pbinom(n * (mu - t), n, mu)
Since $\bar{X}_n = S_n/n$ only takes values in $\{0, 1/n, 2/n, \ldots, 1\}$, the probability $P(|\bar{X}_n - \mu| \ge t)$ only changes when $t$ crosses a multiple of $1/n$, yielding a staircase pattern:
Markov’s Inequality
Markov’s Inequality follows from the observation that a non-negative variable cannot frequently exceed a level much larger than its mean.
$$P(\bar{X}_n \ge a) \le \frac{E[\bar{X}_n]}{a}$$Its only assumption is that $\bar{X}_n$ is non-negative with a known mean. Note that this is a one-sided bound: it only controls the right tail. As $E[\bar{X}_n] = \mu$, substituting $a = \mu + t$ gives
$$P(\bar{X}_n \ge \mu + t) \le \frac{\mu}{\mu + t}$$ineq_markov <- function(t, mu) mu / (mu + t)
As this bound only uses information about the mean of the distribution, it is very conservative and decays only hyperbolically. Additionally, the bound has $\mu$ in the numerator, which means it grows even looser as $\mu$ increases beyond $0.3$.
Chebyshev’s Inequality
Chebyshev’s Inequality is derived by applying Markov’s inequality to the squared deviation $(\bar{X}_n - \mu)^2$.
$$P(|\bar{X}_n - \mu| \ge t) \le \frac{\mu(1-\mu)}{nt^2}$$This bound requires a known variance and mean.
ineq_cheby <- function(t, n, mu) mu * (1 - mu) / (n * t^2)
By using information about the variance of $\bar{X}_n$, this inequality shrinks polynomially in $t$ and is generally tighter than Markov, but remains conservative for large deviations. In our case, since the bound is proportional to $\mu(1-\mu)$, it peaks at $\mu=0.5$ and tightens toward the extremes. In contrast to Markov, it improves with $n$.
Hoeffding’s Inequality
Hoeffding’s Inequality exploits the fact that each $X_i \in [0, 1]$ is bounded.
$$P(|\bar{X}_n - \mu| \ge t) \le 2e^{-2nt^2}$$The exponential decay in $t$ makes this far tighter than Chebyshev for large deviations. Unlike Chebyshev, it requires no knowledge of the variance.
ineq_hoeff <- function(t, n) 2 * exp(-2 * n * t^2)
Notably, the bound contains no $\mu$ and consequently applies equally, regardless of where the distribution is centered. It also improves with $n$.
Berry-Esseen Bound
The Berry-Esseen theorem quantifies how close the CLT approximation is to the true distribution. It turns the CLT into a valid upper bound by adding a correction term that limits the normal approximation error.
$$P(|\bar{X}_n - \mu| \ge t) \le 2\left(1 - \Phi\left(\frac{t\sqrt{n}}{\sigma}\right)\right) + \frac{2C\rho}{\sigma^3\sqrt{n}}$$where $\Phi$ is the standard normal CDF, $\sigma = \sqrt{\mu(1-\mu)}$ is the standard deviation of a single draw, $\rho = E[|X - \mu|^3]$ is the third absolute moment, and $C = 0.4748$.
For Bernoulli$(\mu)$, $\rho = \mu(1-\mu)[\mu^2 + (1-\mu)^2]$. With $\mu = 0.3$ and $n = 30$ the correction term is approximately $0.22$.
bound_be <- function(t, n, mu) {
sigma <- sqrt(mu * (1 - mu))
rho <- mu * (1 - mu) * (mu^2 + (1 - mu)^2)
eps <- 0.4748 * rho / (sigma^3 * sqrt(n))
2 * (1 - pnorm(t * sqrt(n) / sigma)) + 2 * eps
}
One might expect the CLT-based bound to be the tightest, but the correction term acts as a floor. However for large $n$ and deviations of order $1/\sqrt{n}$, this term becomes negligible and the bound converges to the CLT approximation, the tightest any bound of this form can achieve at that deviation scale.