### Theorem 10.20. Markov’s Inequality.

Let \(X\) be a random variable in a probability space \((S,P)\text{.}\) Then for every \(k>0\text{,}\)

\begin{equation*}
P(|X|\ge k) \le E(|X|)/k.
\end{equation*}

Consider the following two situations:

- Situation 1. A small town decides to hold a lottery to raise funds for charitable purposes. A total of \(10,001\) tickets are sold, and the tickets are labeled with numbers from the set \(\{0,1,2,\dots,10,000\}\text{.}\) At a public ceremony, duplicate tickets are placed in a big box, and the mayor draws the winning ticket from out of the box. Just to heighten the suspense as to who has actually won the prize, the mayor reports that the winning number is at least \(7,500\text{.}\) The citizens ooh and aah and they can’t wait to see who among them will be the final winner.
- Situation 2. Behind a curtain, a fair coin is tossed \(10,000\) times, and the number of heads is recorded by an observer, who is reputed to be honest and impartial. Again, the outcome is an integer in the set \(\{0,1,2,\dots,10,000\}\text{.}\) The observer then emerges from behind the curtain and announces that the number of heads is at least than \(7,500\text{.}\) There is a pause and then someone says “What? Are you out of your mind?”

So we have two probability spaces, both with sample space \(S=\{0,1,2,\dots,10,000\}\text{.}\) For each, we have a random variable \(X\text{,}\) the winning ticket number in the first situation, and the number of heads in the second. In each case, the expected value, \(E(X)\text{,}\) of the random variable \(X\) is \(5,000\text{.}\) In the first case, we are not all that surprised at an outcome far from the expected value, while in the second, it seems intuitively clear that this is an extraordinary occurrence. The mathematical concept here is referred to as central tendency, and it helps us to understand just how likely a random variable is to stray from its expected value.

For starters, we have the following elementary result.

Let \(X\) be a random variable in a probability space \((S,P)\text{.}\) Then for every \(k>0\text{,}\)

\begin{equation*}
P(|X|\ge k) \le E(|X|)/k.
\end{equation*}

Of course, the inequality holds trivially unless \(k> E(|X|)\text{.}\) For \(k\) in this range, we establish the equivalent inequality: \(k P(|X|\ge k)\le E(|X|)\text{.}\)

\begin{align*}
k P(|X|\ge k) \amp = \sum_{r\ge k} k P(|X|=r)\\
\amp \le \sum_{r \ge k} r P(|X|=r)\\
\amp \le \sum_{r> 0} r P(|X|=r)\\
\amp = E(|X|).
\end{align*}

To make Markov’s inequality more concrete, we see that on the basis of this trivial result, the probability that either the winning lottery ticket or the number of heads is at least \(7,500\) is at most \(5000/7500=2/3\text{.}\) So nothing alarming here in either case. Since we still feel that the two cases are quite different, a more subtle measure will be required.

Again, let \((S,P)\) be a probability space and let \(X\) be a random variable. The quantity \(E((X-E(X))^2)\) is called the variance of \(X\) and is denoted \(\var(X)\text{.}\) Evidently, the variance of \(X\) is a non-negative number. The standard deviation of \(X\text{,}\) denoted \(\sigma_X\) is then defined as the quantity \(\sqrt{\var(x)}\text{,}\) i.e., \(\sigma_X^2 =\var(X)\text{.}\)

For the spinner shown at the beginning of the chapter, let \(X(i)=i^2\) when the pointer stops in region \(i\text{.}\) Then we have already noted that the expectation \(E(X)\) of the random variable \(X\) is \(109/8\text{.}\) It follows that the variance \(\var(X)\) is:

\begin{align*}
\var(X) =\amp(1^2-\frac{109}{8})^2\frac{1}{8}+(2^2-\frac{109}{8})^2\frac{1}{4}+
(3^2-\frac{109}{8})^2\frac{1}{8}+(4^2-\frac{109}{8})^2\frac{1}{8}\\
\amp+
(5^2-\frac{109}{8})^2\frac{3}{8}\\
=\amp (108^2+105^2+100^2+93^2+84^2)/512\\
=\amp 48394/512
\end{align*}

It follows that the standard deviation \(\sigma_X\) of \(X\) is then \(\sqrt{48394/512}\approx 9.722\text{.}\)

Suppose that \(0\lt p\lt 1\) and consider a series of \(n\) Bernoulli trials with the probability of success being \(p\text{,}\) and let \(X\) count the number of successes. We have already noted that \(E(X)=np\text{.}\) Now we claim the variance of \(X\) is given by:

\begin{equation*}
\var(X)=\sum_{i=0}^n (i-np)^2\binom{n}{i}p^i(1-p)^{n-i} = np(1-p)
\end{equation*}

There are several ways to establish this claim. One way is to proceed directly from the definition, using the same method we used previously to obtain the expectation. But now you need also to calculate the second derivative. Here is a second approach, one that capitalizes on the fact that separate trials in a Bernoulli series are independent.

Let \(\cgF=\{X_1,X_2,\dots,X_n\}\) be a family of random variables in a probability space \((S,P)\text{.}\) We say the family \(\cgF\) is independent if for each \(i\) and \(j\) with \(1\le i\lt j\le n\text{,}\) and for each pair \(a,b\) of real numbers with \(0\le a,b\le 1\text{,}\) the following two events are independent: \(\{x\in S: X_i(x)\le a\}\) and \(\{x\in S:X_j(x)\le b\}\text{.}\) When the family is independent, it is straightforward to verify that

\begin{equation*}
\var(X_1+X_2+\dots+X_n)=\var(X_1)+\var(X_2)+\dots+\var(X_n).
\end{equation*}

With the aid of this observation, the calculation of the variance of the random variable \(X\) which counts the number of successes becomes a trivial calculation. But in fact, the entire treatment we have outlined here is just a small part of a more complex subject which can be treated more elegantly and ultimately much more compactly—provided you first develop additional background material on families of random variables. For this we will refer you to suitable probability and statistics texts, such as those given in our references.

Let \(X\) be a random variable in a probability space \((S,P)\text{.}\) Then \(\var(X)= E(X^2)-E^2(X)\text{.}\)

Let \(E(X)=\mu\text{.}\) From its definition, we note that

\begin{align*}
\var(X) \amp = \sum_{r} (r -\mu)^2\prob(X=r)\\
\amp = \sum_{r} (r^2 -2r \mu+\mu^2)\prob(X=r)\\
\amp = \sum_r r^2\prob(X=r) - 2 \mu\sum_r r\prob(X=r) +\mu^2\sum_r\prob(X=r)\\
\amp = E(X^2) -2\mu^2+\mu^2\\
\amp = E(X^2) - \mu^2\\
\amp = E(X^2) - E^2(X).
\end{align*}

Variance (and standard deviation) are quite useful tools in discussions of just how likely a random variable is to be near its expected value. This is reflected in the following theorem.

Let \(X\) be a random variable in a probability space \((S,P)\text{,}\) and let \(k>0\) be a positive real number. If the expectation \(E(X)\) of \(X\) is \(\mu\) and the standard deviation is \(\sigma_X\text{,}\) then

\begin{equation*}
\prob(|X- E(X)|\le k\sigma_X)\ge 1-\frac{1}{k^2}.
\end{equation*}

Let \(A=\{r\in \reals:|r-\mu|>k\sigma_X\}\text{.}\)

Then we have:

\begin{align*}
\var(X) \amp = E((X-\mu)^2)\\
\amp = \sum_{r\in \reals}(r-\mu)^2\prob(X=r)\\
\amp \ge \sum_{r\in A}(r-\mu)^2\prob(X=r)\\
\amp \ge k^2\sigma_X^2\sum_{r\in A}\prob(X=r)\\
\amp \ge k^2\sigma_X^2\prob(|X-\mu|>k\sigma_X).
\end{align*}

Since \(\var(X)=\sigma_X^2\text{,}\) we may now deduce that \(1/k^2\geq \prob(|X-\mu|)>k\sigma_X)\text{.}\) Therefore, since \(\prob(|X-\mu|\le k\sigma_X)=1-\prob(|X-\mu|> k\sigma_X)\text{,}\) we conclude that

\begin{equation*}
\prob(|X- \mu|\le k\sigma_X)\ge 1-\frac{1}{k^2}.
\end{equation*}

Here’s an example of how Chebyshev’s Inequality can be applied. Consider \(n\) tosses of a fair coin with \(X\) counting the number of heads. As noted before, \(\mu=E(X)=n/2\) and \(\var(X)=n/4\text{,}\) so \(\sigma_X=\sqrt{n}/2\text{.}\) When \(n=10,000\) and \(\mu=5,000\) and \(\sigma_X=50\text{.}\) Setting \(k=50\) so that \(k\sigma_X=2500\text{,}\) we see that the probability that \(X\) is within \(2500\) of the expected value of \(5000\) is at least \(0.9996\text{.}\) So it seems very unlikely indeed that the number of heads is at least \(7,500\text{.}\)

Going back to lottery tickets, if we make the rational assumption that all ticket numbers are equally likely, then the probability that the winning number is at least \(7,500\) is exactly \(2501/100001\text{,}\) which is very close to \(1/4\text{.}\)

In the case of Bernoulli trials, we can use basic properties of binomial coefficients to make even more accurate estimates. Clearly, in the case of coin tossing, the probability that the number of heads in \(10,000\) tosses is at least \(7,500\) is given by

\begin{equation*}
\sum_{i = 7,500}^{10,000} \binom{10,000}{i}/2^{10,000}
\end{equation*}

Now a computer algebra system can make this calculation exactly, and you are encouraged to check it out just to see how truly small this quantity actually is.