We have seen in the previous note that if we toss a coin \(n\) times with bias \(p\), then the expected number of heads is \(np\). What this means is that if we repeat the experiment multiple times, where in each experiment we toss the coin \(n\) times, then on average we get \(np\) heads. But in any single experiment, the number of heads observed can be any value between \(0\) and \(n\). What can we say about how far off we are from the expected value? That is, what is the typical deviation of the number of heads from \(np\)?

Random Walk

Let us consider a simpler setting that is equivalent to tossing a fair coin \(n\) times, but is more amenable to analysis. Suppose we have a particle that starts at position \(0\) and performs a random walk. At each time step, the particle moves either one step to the right or one step to the left with equal probability, and the move at each time step is independent of all other moves. We think of these random moves as taking place according to whether a fair coin comes up heads or tails. The expected position of the particle after \(n\) moves is back at \(0\), but how far from \(0\) should we typically expect the particle to end up at?

Denoting a right-move by \(+1\) and a left-move by \(-1\), we can describe the probability space here as the set of all sequences of length \(n\) over the alphabet \(\{\pm 1\}\), each having equal probability \(2^{-n}\). Let the r.v. \(X\) denote the position of the particle (relative to our starting point \(0\)) after \(n\) moves. Thus, we can write \[ X=X_1+X_2+\cdots +X_n,\](1) where \(X_i=+1\) if the \(i\)-th move is to the right and \(X_i = -1\) otherwise.

Now obviously we have \({\mathbb{E}}[X]=0\). The easiest way to see this is to note that \({\mathbb{E}}[X_i]=(1/2)\times 1+(1/2)\times(-1)=0\), so by linearity of expectation \({\mathbb{E}}[X]=\sum_{i=1}^n {\mathbb{E}}[X_i]=0\). But of course this is not very informative, and is due to the fact that positive and negative deviations from \(0\) cancel out.

What we are really asking is: What is the expected value of \(|X|\), the distance of the particle from \(0\)? Rather than consider the r.v. \(|X|\), which is a little difficult to work with due to the absolute value operator, we will instead look at the r.v. \(X^2\). Notice that this also has the effect of making all deviations from \(0\) positive, so it should also give a good measure of the distance from \(0\). However, because it is the squared distance, we will need to take a square root at the end.

We will now show that the expected square distance after \(n\) steps is equal to \(n\):

Proposition 1

For the random variable \(X\) defined in Equation 1, we have \({\mathbb{E}}[X^2] = n\).

Proof. We use the expression Equation 1 and expand the square: \[\begin{aligned} {\mathbb{E}}[X^2]&={\mathbb{E}}[(X_1+X_2+\cdots+X_n)^2]\cr &={\mathbb{E}}\!\!\left[\sum_{i=1}^n X_i^2 + \sum_{i\ne j} X_iX_j\right]\cr &=\sum_{i=1}^n {\mathbb{E}}[X_i^2] + \sum_{i\ne j}{\mathbb{E}}[X_iX_j]\end{aligned}\] In the last line we have used linearity of expectation. To proceed, we need to compute \({\mathbb{E}}[X_i^2]\) and \({\mathbb{E}}[X_iX_j]\) (for \(i\ne j\)). Let’s consider first \(X_i^2\). Since \(X_i\) can take on only values \(\pm 1\), clearly \(X_i^2=1\) always, so \({\mathbb{E}}[X_i^2]=1\). What about \({\mathbb{E}}[X_iX_j]\)? Well, \(X_iX_j=+1\) when \(X_i=X_j=+1\) or \(X_i=X_j=-1\), and otherwise \(X_iX_j=-1\). Therefore, \[\begin{aligned}{\mathbb{P}}[X_i X_j = 1] &= {\mathbb{P}}[(X_i=X_j=+1) \vee (X_i=X_j=-1)] \cr &= {\mathbb{P}}[X_i=X_j=+1] \:+\: {\mathbb{P}}[X_i=X_j=-1] \cr &= {\mathbb{P}}[X_i = +1] \times {\mathbb{P}}[X_j = +1] \:+\: {\mathbb{P}}[X_i = -1] \times {\mathbb{P}}[X_j = -1] \cr &= \frac{1}{4}+\frac{1}{4} = \frac{1}{2},\end{aligned}\] where in the above calculation we used the fact that the events \(X_i=+1\) and \(X_j=+1\) are independent, and similarly the events \(X_i=-1\) and \(X_j=-1\) are independent. Thus \({\mathbb{P}}[X_iX_j=-1] = 1/2\) as well, and hence \({\mathbb{E}}[X_iX_j]=0\).

Plugging these values into the above equation gives \[{\mathbb{E}}[X^2] = \sum_{i=1}^n 1 + \sum_{i\ne j} 0 = n,\] as claimed. \(\square\)

So we see that our expected squared distance from \(0\) is \(n\). One interpretation of this is that we might expect to be a distance of about \(\sqrt{n}\) away from \(0\) after \(n\) steps. However, we have to be careful here: we cannot simply argue that \({\mathbb{E}}[|X|]=\sqrt{{\mathbb{E}}[X^2]}=\sqrt{n}\). (Why not?) We will see later in the lecture how to make precise deductions about \(|X|\) from knowledge of \({\mathbb{E}}[X^2]\).

For the moment, however, let’s agree to view \({\mathbb{E}}[X^2]\) as an intuitive measure of “spread” of the r.v. \(X\). In fact, for a more general r.v. with expectation \({\mathbb{E}}[X]=\mu\), what we are really interested in is \({\mathbb{E}}[(X-\mu)^2]\), the expected squared distance from the mean. In our random walk example, we had \(\mu=0\), so \({\mathbb{E}}[(X-\mu)^2]\) just reduces to \({\mathbb{E}}[X^2]\).

Definition 1 (Variance)

For a r.v. \(X\) with expectation \({\mathbb{E}}[X]=\mu\), the variance of \(X\) is defined to be \[{\operatorname{var}}(X)={\mathbb{E}}[(X-\mu)^2].\] The square root \(\sigma(X) :=\sqrt{{\operatorname{var}}(X)}\) is called the standard deviation of \(X\).

The point of the standard deviation is merely to “undo” the squaring in the variance. Thus the standard deviation is “on the same scale as” the r.v. itself. Since the variance and standard deviation differ just by a square, it really doesn’t matter which one we choose to work with as we can always compute one from the other immediately. We shall usually use the variance. For the random walk example above, we have that \({\operatorname{var}}(X)=n\), and the standard deviation of \(X\), \(\sigma(X)\), is \(\sqrt{n}\).

The following easy observation gives us a slightly different way to compute the variance that is simpler in many cases.

Theorem 1

For a r.v. \(X\) with expectation \({\mathbb{E}}[X]=\mu\), we have \({\operatorname{var}}(X)={\mathbb{E}}[X^2]-\mu^2\).

Proof. From the definition of variance, we have \[{\operatorname{var}}(X)={\mathbb{E}}[(X-\mu)^2]={\mathbb{E}}[X^2-2\mu X+\mu^2] ={\mathbb{E}}[X^2]-2\mu{\mathbb{E}}[X]+\mu^2 ={\mathbb{E}}[X^2]-\mu^2.\] In the third step above, we used linearity of expectation. Moreover, note that \(\mu = {\mathbb{E}}[X]\) is a constant, so \({\mathbb{E}}[\mu X] = \mu {\mathbb{E}}[X] = \mu^2\) and \({\mathbb{E}}[\mu^2] = \mu^2\). \(\square\)

Another important property that will come in handy is the following: For any random variable \(X\) and constant \(c\), we have \[{\operatorname{var}}(cX)=c^2{\operatorname{var}}(X).\] The proof is simple and left as an exercise.


Let’s see some examples of variance calculations.

  1. Fair die. Let \(X\) be the score on the roll of a single fair die. Recall from the previous note that \({\mathbb{E}}[X]=7/2\). So we just need to compute \({\mathbb{E}}[X^2]\), which is a routine calculation: \[{\mathbb{E}}[X^2] = {1\over 6}\left(1^2+2^2+3^2+4^2+5^2+6^2\right)={{91}\over 6}.\] Thus from Theorem 1, \[{\operatorname{var}}(X)={\mathbb{E}}[X^2]-({\mathbb{E}}[X])^2 = {{91}\over 6} - {{49}\over 4} = {{35}\over{12}}.\]

  2. Uniform distribution. More generally, if \(X\) is a uniform random variable on the set \(\{1,\dots,n\}\), so \(X\) takes on values \(1, \ldots, n\) with equal probability \(1/n\), then the mean, variance and standard deviation of \(X\) are given by: \[\begin{aligned} {\mathbb{E}}[X] = \frac{n+1}{2}, \qquad {\operatorname{var}}(X) = \frac{n^2-1}{12}, \qquad \sigma(X) = \sqrt{\frac{n^2-1}{12}}.\end{aligned}\](2) You should verify these as an exercise.

  3. Biased coin. Let \(X\) the the number of Heads in \(n\) tosses of a biased coin with Heads probability \(p\) (i.e., \(X\) has the binomial distribution with parameters \(n,p\)). We already know that \({\mathbb{E}}[X]=np\). Writing as usual \(X=X_1+X_2+\cdots +X_n\), where \(X_i=\begin{cases} 1&\text{if $i$th toss is Head;}\\ 0&\text{otherwise}\end{cases}\) we have \[\begin{aligned}{\mathbb{E}}[X^2]&={\mathbb{E}}[(X_1+X_2+\cdots+X_n)^2]\cr &=\sum_{i=1}^n {\mathbb{E}}[X_i^2] + \sum_{i\ne j}{\mathbb{E}}[X_iX_j]\cr &=(n\times p) + (n(n-1)\times p^2)\cr &=n^2p^2 + np(1-p).\end{aligned}\] In the third line here, we have used the facts that \({\mathbb{E}}[X_i^2]=p\), and that

    \[{\mathbb{E}}[X_iX_j]= {\mathbb{P}}[X_i=X_j = 1]= {\mathbb{P}}[X_i = 1] \cdot {\mathbb{P}}[X_j = 1] = p^2,\]

    (since \(X_i=1\) and \(X_j=1\) are independent events). Note that there are \(n(n-1)\) pairs \(i,j\) with \(i\ne j\).

    Finally, we get that \({\operatorname{var}}(X)={\mathbb{E}}[X^2]-({\mathbb{E}}[X])^2=np(1-p)\). Notice that in fact \({\operatorname{var}}(X)=\sum_i{\operatorname{var}}(X_i)\), and the same was true in the random walk example. This is in fact no coincidence. We will explore for what kinds of random variables this is true later in the next lecture.

    As an example, for a fair coin the expected number of Heads in \(n\) tosses is \(n/2\), and the standard deviation is \(\sqrt{n}/2\). Note that since the maximum number of Heads is \(n\), the standard deviation is much less than this maximum number for large \(n\). This is in contrast to the previous example of the uniformly distributed random variable, where the standard deviation \[\sigma(X) = \sqrt{(n^2-1)/12} \approx n/\sqrt{12}\] is of the same order as the largest value \(n\). In this sense, the spread of a binomially distributed r.v. is much smaller than that of a uniformly distributed r.v.

  4. Poisson distribution. What is the variance of a Poisson r.v. \(X\)? \[{\mathbb{E}}[X^2] = \sum_{i=0}^\infty i^2 e^{-\lambda}{{\lambda^i}\over{i!}} = \lambda\sum_{i=1}^\infty i e^{-\lambda}{{\lambda^{i-1}}\over{(i-1)!}} = \lambda\left(\sum_{i=1}^\infty (i-1) e^{-\lambda}{{\lambda^{i-1}}\over{(i-1)!}} + \sum_{i=1}^\infty e^{-\lambda}{{\lambda^{i-1}}\over{(i-1)!}}\right) = \lambda(\lambda + 1).\] [Check you follow each of these steps. In the last step, we have noted that the two sums are respectively \({\mathbb{E}}[X]\) and \(\sum_i{\mathbb{P}}[X=i] = 1\).]

    Finally, we get \({\operatorname{var}}(X)={\mathbb{E}}[X^2]-({\mathbb{E}}[X])^2 = \lambda\). So, for a Poisson random variable, the expectation and variance are equal.

  5. Number of fixed points. Let \(X\) be the number of fixed points in a random permutation of \(n\) items (i.e., the number of students in a class of size \(n\) who receive their own homework after shuffling). We saw in the previous note that \({\mathbb{E}}[X]=1\), regardless of \(n\). To compute \({\mathbb{E}}[X^2]\), write \(X=X_1+X_2+\cdots+X_n\), where \(X_i=1\) if \(i\) is a fixed point, and \(X_i = 0\) otherwise. Then as usual we have \[ {\mathbb{E}}[X^2] = \sum_{i=1}^n {\mathbb{E}}[X_i^2] + \sum_{i\ne j}{\mathbb{E}}[X_iX_j].\](3) Since \(X_i\) is an indicator r.v., we have that \({\mathbb{E}}[X_i^2]={\mathbb{P}}[X_i=1]=1/n\). Since both \(X_i\) and \(X_j\) are indicators, we can compute \({\mathbb{E}}[X_iX_j]\) as follows: \[{\mathbb{E}}[X_iX_j] = {\mathbb{P}}[X_iX_j = 1] = {\mathbb{P}}[X_i=1\wedge X_j=1] = {\mathbb{P}}[\text{both $i$ and $j$ are fixed points}] = {1\over{n(n-1)}}.\] Make sure that you understand the last step here. Plugging this into Equation 3 we get \[{\mathbb{E}}[X^2] = \sum_{i=1}^n \frac{1}{n} + \sum_{i\ne j}\frac{1}{n(n-1)} = \left(n\times {1\over n}\right) + \left(n(n-1)\times{1\over{n(n-1)}}\right) = 1+1 = 2.\] Thus \({\operatorname{var}}(X)={\mathbb{E}}[X^2]-({\mathbb{E}}[X])^2 = 2-1 = 1\). That is, the variance and the mean are both equal to \(1\). Like the mean, the variance is also independent of \(n\). Intuitively at least, this means that it is unlikely that there will be more than a small number of fixed points even when the number of items, \(n\), is very large.

Independent Random Variables

Independence for random variables is defined in analogous fashion to independence for events:

Definition 2

Random variables \(X\) and \(Y\) on the same probability space are said to be independent if the events \(X=a\) and \(Y=b\) are independent for all values \(a,b\). Equivalently, the joint distribution of independent r.v.’s decomposes as \[{\mathbb{P}}[X=a,Y=b] = {\mathbb{P}}[X=a]{\mathbb{P}}[Y=b] \quad \forall a,b.\]

Mutual independence of more than two r.v.’s is defined similarly. A very important example of independent r.v.’s is indicator r.v.’s for independent events. Thus, for example, if \(\{X_i\}\) are indicator r.v.’s for the \(i\)-th toss of a coin being Heads, then the \(X_i\) are mutually independent r.v.’s.

One of the most important and useful facts about variance is if a random variable \(X\) is the sum of independent random variables \(X = X_1 + \cdots + X_n\), then its variance is the sum of the variances of the individual r.v.’s. In particular, if the individual r.v.’s \(X_i\) are identically distributed (i.e., they have the same distribution), then \({\operatorname{var}}(X) = \sum_i {\operatorname{var}}(X_i) = n \cdot {\operatorname{var}}(X_1)\). This means that the standard deviation is \(\sigma(X) = \sqrt{n} \cdot \sigma(X_1)\). Note that by contrast, the expected value is \({\mathbb{E}}[X] = n \cdot {\mathbb{E}}[X_1]\). Intuitively this means that whereas the average value of \(X\) grows proportionally to \(n\), the spread of the distribution grows proportionally to \(\sqrt{n}\), which is much smaller than \(n\). In other words the distribution of \(X\) tends to concentrate around its mean.

Let us now formalize these ideas. First, we have the following result which states that the expected value of the product of two independent random variables is equal to the product of their expected values.

Theorem 2

For independent random variables \(X,Y\), we have \({\mathbb{E}}[XY]={\mathbb{E}}[X]{\mathbb{E}}[Y]\).

Proof. We have \[\begin{aligned} {\mathbb{E}}[XY] & = & \sum_a\sum_b ab\times{\mathbb{P}}[X=a, Y=b] \\ &=& \sum_a\sum_b ab\times{\mathbb{P}}[X=a]\times{\mathbb{P}}[Y=b] \\ &=& \left(\sum_a a\times{\mathbb{P}}[X=a]\right)\times \left(\sum_b b\times{\mathbb{P}}[Y=b]\right)\\ &=& {\mathbb{E}}[X]\times{\mathbb{E}}[Y],\end{aligned}\] as required. In the second line here we made crucial use of independence. \(\square\)

We now use the above theorem to conclude the nice property that the variance of the sum of independent random variables is equal to the sum of their variances.

Theorem 3

For independent random variables \(X,Y\), we have \[{\operatorname{var}}(X+Y) = {\operatorname{var}}(X) + {\operatorname{var}}(Y).\]

Proof. From the alternative formula for variance in Theorem 1, we have, using linearity of expectation extensively, \[\begin{aligned} {\operatorname{var}}(X+Y) &=& {\mathbb{E}}[(X+Y)^2] - {\mathbb{E}}[X+Y]^2\\ &=& {\mathbb{E}}[X^2]+{\mathbb{E}}[Y^2]+2{\mathbb{E}}[XY] - ({\mathbb{E}}[X]+{\mathbb{E}}[Y])^2\\ &=& ({\mathbb{E}}[X^2]-{\mathbb{E}}[X]^2) + ({\mathbb{E}}[Y^2]-{\mathbb{E}}[Y]^2) + 2({\mathbb{E}}[XY]-{\mathbb{E}}[X]{\mathbb{E}}[Y])\\ &=& {\operatorname{var}}(X) + {\operatorname{var}}(Y) + 2({\mathbb{E}}[XY]-{\mathbb{E}}[X]{\mathbb{E}}[Y]).\end{aligned}\] Now because \(X,Y\) are independent, by Theorem 2 the final term in this expression is zero. Hence we get our result. \(\square\)

Note: The expression \({\mathbb{E}}[XY]-{\mathbb{E}}[X]{\mathbb{E}}[Y]\) appearing in the above proof is called the covariance of \(X\) and \(Y\), and is a measure of the dependence between \(X,Y\). It is zero when \(X,Y\) are independent.

It is very important to remember that neither of these two results is true in general, without the assumption that \(X,Y\) are independent. As a simple example, note that even for a \(0\)-\(1\) r.v. \(X\) with \({\mathbb{P}}[X=1]=p\), \({\mathbb{E}}[X^2]=p\) is not equal to \({\mathbb{E}}[X]^2=p^2\) (because of course \(X\) and \(X\) are not independent!). This is in contrast to the case of the expectation, where we saw that the expectation of a sum of r.v.’s is the sum of the expectations of the individual r.v.’s always.


Let’s return to our motivating example of a sequence of \(n\) coin tosses. Let \(X\) the the number of Heads in \(n\) tosses of a biased coin with Heads probability \(p\) (i.e., \(X\) has the binomial distribution with parameters \(n,p\)). As usual, we write \(X=X_1+X_2+\cdots +X_n\), where \(X_i=1\) if the \(i\)-th toss is Head, and \(X_i = 0\) otherwise.

We already know \({\mathbb{E}}[X]= \sum_{i=1}^n {\mathbb{E}}[X_i] = np\). We can compute \({\operatorname{var}}(X_i) = {\mathbb{E}}(X_i^2) - {\mathbb{E}}(X_i)^2 = p - p^2 = p(1-p)\). Since the \(X_i\)’s are independent, by Theorem 3 we get \({\operatorname{var}}(X) = \sum_{i=1}^n {\operatorname{var}}(X_i) = np(1-p)\).

As an example, for a fair coin (\(p = 1/2\)) the expected number of Heads in \(n\) tosses is \(n/2\), and the standard deviation is \(\sqrt{n/4} = \sqrt{n}/2\). Note that since the maximum number of Heads is \(n\), the standard deviation is much less than this maximum number for large \(n\). This is in contrast to the previous example of the uniformly distributed random variable Equation 2, where the standard deviation \(\sigma(X) ~=~ \sqrt{(n^2 - 1)/12} ~\approx~ n/\sqrt{12}\) is of the same order as the largest value \(n\). In this sense, the spread of a binomially distributed r.v. is much smaller than that of a uniformly distributed r.v.