 # Lecture 29: Law of Large Numbers and Central Limit Theorem | Statistics 110

All right, so let’s get started. So today,
we’re gonna talk about what are probably the two most famous theorems in
the entire history of probably. They’re called the law of large
numbers and the central limit theorem. They’re closely related, so makes sense
to do them together, kind of compare and contrast them. I don’t, I can’t think of a more famous
probability theorem than these two. So the setup for today is that
we have i.i.d random variables. Let’s just call them X1, X2 i.i.d. Since they’re i.i.d they have
the same mean and variance. If the mean and variance exists but
we’ll assume they do. So the mean, we’ll just call it Mu. And the variants, sigma squared. So we’re assuming that
these are finite for now. The mean and variants exist. And both of these theorems tell us what happens to the sample
mean as n gets large. So, the sample mean is
just defined as Xn bar. Standard notation in statistics
is put a bar to mean averages and that’s just the average of the first n. So to take the first n random variables,
and average them, so that’s just called the sample mean. So the question is, what can we
say about Xn bar as n gets large? So the way we would interpret this or
use this is we get to observe. These Xs, they’re random variables but
after we observe them they become data. We’re never going to have
an infinite amount of data so at some point we stop it at n. We can think of that as the sample size
and hopefully we get a large sample size. Of course, it depends on the problem. Some problems,
you may not be able to get large n. Well, we assume n is large, and just take the average,
question is just, what can we say? All right, so first,
here’s what the law of large numbers says. It’s a very simple statement. And hopefully pretty intuitive, too. Law of Large Numbers says that Xn bar converges to mu, as n goes to infinity. With probability 1. That’s the fine print, probability 1. With probability 0, so
something really crazy could happen. But we don’t worry too much about it,
because it has probability 0. With probability 1,
this is the sample mean, and it says that the sample mean
converges to the true mean. So, that is a pretty nice,
intuitive, easy to remember result. That is,
by true I mean the theoretical mean. That is the expected value of Xj for
any j is the true expected value. Whereas this, is a random variable. Right?
We’re taking an average of random variables. That’s a random variable. So this is just a constant but
this is a random variable. But it’s gonna converge and
I should say a little bit more, what is this convergence
statement actually mean. You’ve all seen limit of sequences, but
when we are talking about limits of random variables we have to be
a little more careful. How do we actually define this. The definition of this statement
is just pointwise which means, remember Xn bar is a random variable. Random variable mathematically
speaking is a function. So it’s say for each possible,
if you evaluate this at some specific outcome of the experiment,
then you’ll get a sequence of numbers. That is if you actually observed the
values and this kind of crystallizes into numbers when you evaluate it at
the outcome of the experiment. And so those numbers converge to mu. In other words, this is an event. Either these random variables converge or
they don’t. And we say that event has probability 1. That’ what the statement
of the theorem is. So to just give a simple example. Let’s think about what happens
if we have Bernoulli p. So if Xj is Bernoulli p,
then intuitively we’re just imagining a infinite
sequence of coin tosses. Where the probability of heads is p, and then this says that if we add up
all of these Bernoullis up to n, that it’s just in the first coin flips,
how many times did the coin land heads, divided by the number of flips should
convert to p with probability 1. So for example, so
this is a very intuitive statement. If it’s a fair coin and
you flip the coin a million times, well, you’re not really expecting that it
will be 500,000 heads and 500,00 tails. But you do think that,
in the long run, it should be the case that it’s going to be essentially
half heads, half tails. Not exactly, but essentially. And the proportion should get closer and
closer to the true value. This qualification would probably 1 is
needed because mathematically speaking even if you have a fair coin,
there’s nothing in the math that says it’s impossible that the coin would land
actually gonna happen in reality. It’s just not gonna happen. It’s a fair coin. It might land heads, heads, heads for
a time if you’re very lucky or unlucky or whatever. But it’s not gonna be heads,
says that’s an invalid sequence. So there’s some weird
pathological cases like that. But with probability one,
we get what we expect. If we didn’t have this result,
how we would ever even estimate p? You might imagine if you
didn’t know what p was, kind of the obvious thing to do is
flip the coin a lot of times and take the proportion of heads and
use that as your approximation for p. But what justification could you have for doing that approximation
if you didn’t have this. So this is a very, very necessary result. But I guess to comment a little bit
more about what does it actually say for the coin, because this is kind of
related to gambler’s fallacy, and things like that. The gambler’s fallacy is the idea
that like let’s say your gambling and you lose like ten times in a row and then
it’s the feeling that your due to win. You lost all these times then and
you might try to justify that using a lot of large numbers and say well you
know the coin might landed let’s say, heads you win money, tails you lose money,
you just lost money ten times in a row. But the law of large numbers says,
in the long run, it’s gonna go back to
one-half if it’s fair. So somehow you need to start
winning a lot to compensate. That’s not the way it works. The coin is memoryless. The coin does not care how many failures
or how many losses you had before. So the way it works is not through If
you’re unlucky at the beginning that somehow it gets offset later
by an increase in heads. The way it works is through
what we might call swamping. And let’s say the coin landed
tails a 100 times in a row. It doesn’t mean that the probability
has changed for 101st flip. What it means though, is that we’re
letting n go to infinity here, okay? So no matter how unlucky you
were in the first 100 or the first million trials, that’s
nothing compared to infinity, right? So those first million just get swamped
out by the entire infinite future, so that what’s going on here. Yeah, so
to tell you one little story about the law of large numbers,
a colleague of mine told me this story. He had a student once who
said he hated statistics. And of course,
my colleague was very shocked, like how can anyone hate statistics? And so he asked, why? How is it possible that
you hate statistics? And then the student who was an athlete,
and he was training everyday and he had just learned
the law of large numbers. And he was very, very depressed by this
because he said, the law of large numbers says in the long run, I’m gonna only
be average and I can’t improve. So well, of course the fallacy there,
we assumed iid right now. Now there are generalizations
of this theorem beyond iid, but we can’t just get rid of iid. So the iid is saying that the distribution
is not changing with time. That doesn’t mean that you can’t actually
improve your own distribution then it would not be iid. So don’t be depressed by this,
and in fact this theorem I think is crucial in order for
science to actually be possible. Because if you kind of
imagine kind of hypothetical counter factual world where this
theorem was actually false. That would be really depressing to try
to ever learn about the world, right? Cuz this is saying,
you’re collecting more and more data. You’re letting your sample
size go to infinity. And this says,
you converged to the truth, right? And it would be some weird setting, where
you get more and more data, and more and more data, and yet you’re not able
to converge to the truth, right? So that would be really bad. So this is very intuitive, very important. Okay, so let’s prove this
at least a similar version. So this is actually sometimes called
the strong law of large numbers. And we’re actually gonna
prove what’s sometimes called the weak law of large numbers. I don’t really like the terminology
strong and weak here, but that’s kind of a standard. Strong law of large numbers
is what I just said, where it’s converging
point-wise with probability 1. That is just these random variables
converged to this constant, except on some bad event
that has probability 0. The weak law of large
numbers says that for any, C greater than 0, the probability that Xn bar minus the mean is greater than c goes to 0. So it’s a very similar looking statement. It’s not exactly equivalent. It’s possible to show, you have to
go through some real analysis for this that is not necessary for
our purposes. But it turns out that, this statement, once you’ve proven this thing it
implies this form of convergence. This is called convergence in probability,
but the intuition is very similar. So just to interpret this statement
in words it says, so we can chose, we should interpret c as
being some small number. So let’s say we chose c to be 0.001, okay? And then it says that this thing
goes to 0, so in other words, this, as n goes to infinity again. So this says that if n is large enough,
then it’s extremely unlikely that
these are more than 0.001 apart. In other words, if n is large, it’s extremely likely that this is
extremely close to this, right? So it’s a very similar statement,
n is large, it’s extremely likely that the sample
mean is very close to the true mean. Okay, so that’s what it says. So we’ll prove this one, because to prove this one takes
a lot of work and a lot of time. This one,
it looks like it’s a nice-looking theorem. And it is a nice theorem, but we can prove it very easily
using Chebyshev’s inequality. Okay, so
let’s prove the weak law of large numbers. So all we need to do is show
that this goes to 0, right? That’s what the statement is. So let’s just bound it using, this looks
pretty similar to what we were doing last time, where we did Markov’s inequality,
Chebyshev’s inequality. This looks similar to that
kind of stuff from last time, which is why I did that, well,
one reason for doing that last time. We need the inequalities anyway,
but it’s especially useful here. So we just need to show
this thing goes to 0. Xn bar minus mu greater than c, goes to 0, By Chebyshev’s inequality,
this is less than or equal to the variance of Xn
bar divided by c squared, that’s just exactly
Chebyshev from last time. Now we just need the variance of Xn bar,
variance of Xn bar, well, just stare at the definition
of Xn bar for a second. There’s a 1 over n in front,
that comes out as 1 over n squared. And then since I’m assuming
they’re iid an then dependent, the variance of the sum is just n
times the variance of one term. So that’s n sigma squared
divided by c squared, which is sigma squared over nc squared. Sigma is a constant, c is a constant,
n goes to infinity, so this goes to 0. So that proved the weak law of large
numbers, just only a one line thing. Okay, so that tells us what happens
point-wise when we average a bunch of iid random variables, and
it converges to the mean. So let me just rewrite that statement. Then we’ll write the central limit
theorem and kind of compare them. So another way to write
what we just showed is that Xn bar minus mu
goes to 0 as n goes to infinity, which is a good thing to know. However, it doesn’t tell us what
the distribution of Xn bar looks like. So this is true with probability one,
but what is the distribution? What is the distribution
of Xn bar look like? So this says it’s getting closer,
Xn bar is getting closer and closer to this constant mu. Okay, but that’s not really
telling us the shape, and it’s not really telling us the rate. This goes to 0, but at what rate? So one way to think about problems like
that, when you have something going to 0, and you wanna study something about,
how fast does it go to 0? Then one might, not just in here, but just as a general approach
to that kind of problem. We know this goes to 0, but
we don’t know how fast. One way to study that would be multiply it
by something that goes to infinity, right. Now, if we multiply it by
something that goes to infinity, such that this times
this goes to infinity. Then we know that this part that blows
up is dominating over this part. And if we multiply by something
that goes to infinity, but this whole thing still goes to 0,
then that’s more informative, right? So what’s gonna happen is that we
can imagine multiplying here by n to some power and we’re gonna
show that there’s a power here, and to some power, fill in the blank. What we’re gonna show is that, if the power here is above some
threshold and to the big powers, its gonna go to infinity fast,
this thing will just blow up. And if we put a smaller power than the
threshold here, then this is still going to infinity as long as this is a positive
power of n, this is still going to infinity, this parts going to 0,
but this part’s dominating, right? So this term is competing with this term. This one goes to infinity,
this one goes to 0, okay? So then the question is what’s
that magic threshold value? And the answer is one-half. So that’s what we’re
gonna study right now. So we’re gonna take the square
root of n times xn bar minus mu. This is kind of the happy medium, where we’re gonna get a non-degenerate
distribution, that this is gonna converge in distribution to an actual distribution,
it’s not gonna just get killed to 0 or blow up to infinity, it’s actually
gonna give us a nice distribution. Okay, and I’m also gonna divide by the
sigma here, makes it a little bit cleaner. So this is the central limit theorem now. I’m stating it, then we’ll prove it. Central limit theorem says,
if you take this and look at what happens
as n goes to infinity. Converges to standard
normal in distribution. [SOUND] By convergence and
distribution, what we mean is that the distribution of this converges
to the standard normal distribution. In other words, you could take the CDF. I mean these may be discrete or continuous
or a mixture of discreet and continuous. So it doesn’t necessarily have a PDF,
but every random variable has a CDF. So it says if you take the CDF of this, it’s gonna converge to capital 5,
the standard normal. So I think this is kind of an amazing
result that this holds in such generality, right, because I mean the normal is just
this one, standard normal is just this one particular, it’s a nice looking bell
curve, but that’s just one distribution. And those x’s they could be discrete,
they could be continuous, they could be extremely nasty
looking distributions, right? It could look like anything, the only thing we assumed was
that there was a finite variance. Other than that,
they could have an incredibly complicated, messy distribution. But it’s always gonna
go to standard normal. So this is one of the reasons why
the standard normal distribution is so important on the one hand and so,
widely used, because this is a theorem as n goes to infinity is what it says,
but the way it’s used in practice is then people use normal approximations all the
time and a lot of the justification for normal approximations is coming from this,
because this says that if n is large, then the sample mean will approximately
have a normal distribution. Even if the original data did not look
like they came from a normal distribution, when you average lots and
lots of them, it looks normal, okay. So this is in a sense is a better
theorem than the law of large numbers, but because it’s kind of more
informative to know the distribution, know something about the rate, and
you know it’s interesting that it’s, square root of n is kind of the power
of n that’s just right, right? A larger power it’s gonna blow up,
a smaller power it’s gonna go to 0. N to the one-half is the compromise,
then you always get a normal distribution. It’s more informative in some sense, but you should also keep in mind,
it is a different sense of convergence. Up here, we’re talking about the random
variables actually converging, literally the random variables
converge the sample mean converges literally to point-wise with
probability 1, to the true mean. Here, we’re talking about
convergence in distribution. So we’re not talking about
convergence of random variables. We’re just saying the distribution of this
converges to the normal 0, 1 distribution. So that’s a different sense
of convergence, but anyway, both of them are telling us what’s gonna
happen to Xn bar when n is large, okay? So well, let’s prove this theorem. Here’s another way to write this,
by the way, it’s good to be familiar with both ways. It’s just algebra to go
from one to the other, but they’re both useful enough
to be worth mentioning. Let’s just write the central limit
theorem in terms of the sum of X’s rather than in terms of the sample mean. So I’m just gonna take the sum of Xj,
j equals 1 to n. And so, we can either think of
the central limit theorem as, either think of it as telling us what
happens to the sample mean or we can think of it as telling us what happens
to the sum, or the convolution, okay? It’s equivalent because
they’re just a factor of, we just have to be careful not
to mess up the factor of n, b ut we can go from one to the other
cuz it’s just a factor of n. So the claim is that this is
approximately normal when n is large, but if we just have this thing,
this could easily just blow up. You’re just adding more and more terms. But somehow we wanna
standardize this first. So if we take this thing,
because this thing has mean and mu, right, so let’s subtract n mu. Because then it has zero mean,
because I just want to match. I wanna make the mean 0 and
the variance 1, so that it kind of matches up with that,
rather than just letting it blow up. So this is called centering,
we just subtracted by linearity, the mean is n mu, so
just subtract it n mu. And then let’s divide by
the standard deviation, this is just how we did
standard deviation before. So over there we showed that the variants
of Xn bar is sigma-squared over n. And the variance of this sum
is just n sigma squared. So let’s just divide by
the standard deviation, right, which is square root of n Times sigma,
okay? Cuz the variance is n sigma squared. So that’s just the standardized version. And the statement is again that this
converges to the standard normal in distribution. So if we take this sum and standardize it,
then it’s gonna go standard normal. Okay, so, all right, so
now we’re ready to prove this theorem. And, sort of just a calculation,
but it’s kind of a nice calculation in some ways,
we’re gonna prove it, well. This theorem is always true as
long as the variance exist. We don’t need to assume that, the third
moment or the fourth moment exist. But the proof is much more complicated
to do it in that generality. So we’re gonna assume that the MGF exists,
then we can actually work with the MGFs. Because when you see this thing,
sum of independent random variables, then we know the MGF is gonna be
something useful if it exists. And there’s ways to extend this proof
to cases where the MGF doesn’t exist. But for our purposes,
we may as well just assume MGF exists. So assuming MGF, let’s call it M(t). Of Xj, they’re iid, so if one of them
has an MGF, they all have the same MGF. We’ll just assume that that exists. Once we have MGFs, then our strategy
is to show that the MGFs converge. So that’s a theorem about MGFs, that
if the MGFs converge to some other MGF, then the random variables
converge in distribution, right? We had a homework problem related to that,
where you found that the MGFs converged to some MGF, and that implies
convergence of the distributions, right? Okay, so that’s the whole strategy. So that means all we need to
do is find the MGF of this and then take the limit, okay? So basically at this point,
it’s just like, write down the MGF, take the limit, and
use a few facts about MGFs, okay? So first of all, we can assume. That, let’s just assume mu=0 and sigma=1, just to simplify the notation. This is without loss of generality, because we could write this as,
all we have to do is consider. I wrote the standardized thing this way,
but I could’ve just written it as
standardizing each X separately. I could’ve written Xj- mu over sigma. So this would be standardizing each
of them separately, j=1 to n, and then we have a 1 over root n. That will be the same thing
that we’re looking at. This just says standardize
them separately first. But then you could just, I mean if
you want, just call this thing Yj. And once you have the central limit term
for Yj, then you know that that’s true. So you might as well just assume that
they’ve already been standardized. And so just to have some notation,
let’s just let Sn equal the sum, S for sum, of the first n terms. And what we wanna show is that the MGF of Sn over root n,
that’s what we’re looking at, right? That let mu equal zero, sigma equals one,
so we’re looking at Sn over root n. And we wanna show that that goes
to the standard normal MGF. Right, so we just need to find this MGF,
take a limit. Okay, so let’s just find the MGF. So by definition, that’s the expected
value of e to the t times Sn over root n. And Sn is just the sum. So, and we’re assuming independence,
which means that these, you can write this as e to the t x1 over root n, e
to the t x2 over root n, blah, blah, blah. All of those factors are independent,
therefore, they’re uncorrelated. So we can just split it up as a product,
X1/ over root n. Blah, blah, blah, same thing, just e to the Xj over root n
is the general term, right? I’m just using the fact that
those are uncorrelated, so we can write e of the product
of the expectations. But since these X’s are iid, these are really just the same
thing written, n times. So really,
this is just this thing to the nth power. And this thing,
that should remind you of an MGF, right? That’s just the MGF of X1, except that instead of evaluated at t,
it’s evaluated at t over root n. So really, that’s just the MGF, evaluated at t over root n
raised to the nth power. So that’s what we have. Now we need to take the limit
as n goes to infinity. So let’s just look at what’s gonna
happen here, n is going to infinity. This thing on the inside becomes M of 0. M of 0 is 1 for any MGF, right? Cuz e to the 0 is 1. So this is of the form 1 to the infinity
which is in indeterminate form, right? It could evaluate to anything. So going back to calculus,
how do you deal with 1 to the infinity, or 0 over 0, or whatever. Usually we try to reduce it to something
where we can use L’Hopital’s Rule for those problems, right? Or we can use a Taylor
series type of thing. So, how do we get into that form? Take the log,
because this looks like 1 to infinity. If we take the log,
it’ll look like infinity times log of 1. So it’ll look like infinity times 0,
take logs. Then we just have to remember to
exponentiate at the end to undo the log. Okay, so
let’s write down then what we have. After taking the log, and
we’re trying to do a limit, so we’re doing the limit as n goes
to infinity, and we take the log. It’s n log M(t over root n). So that’s of the form infinity times 0. If we want 0 over 0 or
infinity over infinity, we can just write it as 1
over n in the denominator. Okay, and now it’s of the form 0 over 0. So we can almost use L’Hopital’s Rule,
but not quite. We have to be a little bit careful. Because first of all,
I’m assuming n is an integer, and you can’t do calculus on integers. Secondly, it’s just kind of, even if we
pretended that n is a real number and then the derivative of n would
be- 1 over n squared and that’s kind of annoying to deal with. And it’s kind of annoying to
deal with this square root here. So let’s first make a change of variables. Let’s just let y=1 over root n and
also let y be real, not necessarily, Not necessarily of the form 1 over
square root of an integer, okay? So it’s the same limit, just written
in terms of y instead of in terms of n. So as n goes to infinity y goes to 0 and
1 over n is y squared, so it’s denominator is just y squared. The reason I do it this way is
that 1 over root n is just y by definition but
then the numerator is just log m of yt. That’s a lot easier to deal with
because we got rid of the square roots. So it’s still of the form 0 over 0. So we’re gonna use L’Hospital’s Rule. So limit, y goes to 0. Take the derivative of the numerator and
the denominator separately. The derivative of the denominator is 2y. The derivative of the numerator, well we’re just going to
have to use the chain rule. Derivative of log something
is 1 over that thing. So that’s M of yt hence the derivative
of that thing which again by the chain rule is M prime of
yt times the derivative of yt. We’re treating t as constant,
we’re differentiating with respect to y. So t comes out. And now let’s see what we have. Let’s just summarize
a couple facts about MGFs. So M of t is the expected
value of E to the tX1. So M of 0=1 Okay. And when we first started doing MGF we
said that we take derivatives of the MGF and evaluate it at 0. We get the moments, that is why it’s
called the moment generating function. So the first derivative at 0 is the mean,
but we assume that mu is 0. So this is 0, here. And the second derivative,
while we’re doing this. Secondary derivative is the second moment,
but since we assumed that the variance is 1 and the mean is 0,
the second moment is 1, okay? So over here, as we let y go to 0,
denominator’s still going to 0. Numerator’s also going to 0,
because M prime of 0 is 0, so its still on the form 0 over 0, so
let’s just do what we were told again. So first I can simplify it a little bit,
this t can come out, because that’s acting as a constant,
and the 2 can come out. And limit y goes to 0 and
this M of yt part, that’s just going to 1. So we can write that as part
of a separate limit, but that other limit is just going to 1. You can think of it as just
the limit of this part times the limit of the rest of it. But that part’s just going to 1,
so we can get rid of that. So really is just, what’s left is just the limit of M prime yt divided by y. Everything else is gone, so
it’s actually pretty nicely simplified. Now, using L’Hospital’s Rule
a second time, now the derivative of
the denominator is just 1, okay? And for the numerator,
chain rule, M double prime of yt. That was a t not a t squared,
but now it’s a t squared, because by the chain rule, derivative of
yt is t, so we have a t squared over 2. Now when we let y go to 0,
now it’s just M double prime 0 is 1, so now this limit is just 1. So we get t squared over 2, that’s what we wanted, because t squared over 2 is the log. Of e to the t squared over 2, but e to the t square over 2 is
exactly the normal 0,1 MGF. Okay so, to prove that theorem that’s the end of
the proof of the central limit theorem. All we had to do was just basic facts
out MGF, use, L’Hospital’s Rule twice. And there we have one of the most famous
important theorems in statistics. Now so
there are more general versions of this, like you can extend this in various
ways where it’s not an IID, but it still has to satisfy
some assumptions, right. But anyway,
this is the basic central limit theorem. Okay, so that’s pretty good. Let’s do an example,
like how do we actually use this, for the sake of approximations,
things like that. Last time I was talking about
the difference between inequalities and approximations, right? And we talked about Poisson
approximation before. We haven’t really talked
about normal approximation. This result is giving us the ability
to use normal approximations when we’re studying sample mean and
is large, okay? So historically, though,
the first version of the central limit theorem
that was ever proven, I think was for binomials, okay? So what we’re saying is that binomial np under some conditions
will be approximately normal. And well in the old days that was
incredibly important fact because they didn’t have computers to
binomials how to deal with like n choose k, and n is large, and
k, you have all these factorials. You can’t do these things by hand. Now we have fast computers,
so it’s a little bit better. But it’s still a lot easier working
with normal distributions than binomial distributions most of the time,
right? And even now factorials still grow so
fast that even with a fast computer with large memory and
everything, you may quickly exceed its ability when you’re doing
some big complicated binomial problem. And normals have a lot of nice properties,
as we’ve seen, okay? The question is,
when can we approximate a binomial using a normal, and
how do we do that, okay? So this is just the binomial approximation to the normal, other way around. Normal approximation,
I’ll say binomial approximated by normal, the normal approximation to the binomial. When is that valid? To contrast it with
the Poisson approximation, that we’ve seen before, okay? So, if x is, let’s x be binomial np And as we’ve done many times before we can represent x as
a sum of iid Bernoulli. Right?
Well these are just 1, if success on the J trials 0 otherwise, so
the XJ are iid Bernoulli P. So this does fit into
the framework of the central limit theorem that is we are adding
up iid random variables. So the central limit theorem says that,
if the N is large this will be approximately normal, at least after
we have standardized it, okay? So suppose we wanted to approximate,
suppose we’re interested in the probability
that x is between A and B. And I want to approximate that, first we’ll do equality then
we’re approximating it. So, I mean if you had to do this on
a computer what you would do or by hand, which you wouldn’t want to,
would be to take the PMF and sum up all the values of
the PMF from A to B, right. So okay, you would not want to do
that by hand most of the time. But suppose we just want an approximation
for this, not the exact thing. So first, the strategy is just gonna
be to take x and standardize it first. So we’re gonna subtract the mean,
so we know that the mean is NP, and we’re gonna divide by
the standard deviation, which we know as the square root of NPQ or
Q is 1 minus P. So, I’m just standardizing it right now. So this is still equal,
we haven’t done any approximations yet. And then, now that we’ve standardized it, we can apply the central limit theorem,
if N is large enough, right? If N is, if central limit
theorem said N goes to infinity, that doesn’t answer the question
of how large does N have to be. And for that, there’s various theorems and
various rules of thumb. A lot of books will say,
how large does N have to be? And some books at least will say 30,
and that’s just a rule of thumb. That’s not always gonna work for all,
there’s separate rules of thumb for the binomial, like you want N
times P to be reasonably large and N times 1 minus P to be large,
there are different rules of thumb. But anyway, if N is large enough, then what we’ve just proven is that
this is gonna look like it has a normal distribution because
that’s a sum of IID things. And we standardized it correctly, because
we already knew the mean and the variance, so we just standardized it. Okay, so this is approximately. Now we’re going to use
the normal approximation, we’re going to say this
is approximately normal. And if I want the probability that
the normal is between something and something, that’s just the CDF
here minus the CDF here, right? Because for the normal, I mean this
is discrete but we’re approximating using something continuous and we just
say, integrate the PDF from here to here. But fundamental theorem calculus,
that just says take the CDF and go, okay. So we’re just gonna do Phi of B minus NP over square root of NPQ minus Phi of A minus NP over square root of NPQ. So that would be the basic
normal approximation, I’ll talk a little bit about how
to improve this approximation. But to contrast it with
the Poisson approximation. We talked before about the fact that,
and we proved the fact that if N goes to infinity, and
P goes to 0, and N times P is fixed. Then the binomial distribution
converts to the Poisson distribution, we proved that before. So in the Poisson approximation, so for the Poisson approximation what we had was
N is large but P was very small, right? And we let lambda equal NP and
x as moderate. And most important thing is that
P is small here, P is close to 0. We proved it in the case where this goes
to infinity and this goes to 0, okay? So Poisson is relevant when we’re
dealing with a large number of very rare unlikely things. That’s really in contrast to this, in this case for the normal approximation. Then, while we still want N to be large,
but if you kind of think intuitively
about when is this gonna work well, we actually want P to
be close to one half. Because think about the symmetry, if you
have a binomial of P equals one half, that’s a symmetric distribution. The normal is symmetric, no matter,
every normal distribution is symmetric. If P is far from one half, then the
binomial is very, very skewed, and in that case it’s kind of doesn’t make that much
sense to approximate using a normal. So this is gonna work as an approximation,
that’s normal approximation, as an approximation if P is very small,
this makes a lot more sense than this. However, think about the statement
of the central limit theorem. In that theorem I never said
P was close to one half, in fact that was just a general theorem,
we didn’t even have P in the statement of the central limit theorem, but
somehow this still has to eventually work. But as a practical matter
as an approximation, if P is close to one half this
is going to work quite well, if N is like 30 or 50 or
100, it will work fine. But if P is .001,
the central limit theorem is still true, that as N goes to infinity
it’s gonna work, okay. But if N is kind of not
that enormous of a number, then it’s gonna be a pretty
bad approximation. And let’s just try to reconcile these
statements though, is there a case? If we let N go to infinity and
P be very small, I still said, if N is going to infinity, it’s still gonna converge to
normal just much slower, right? So, how could the binomial
look both normal and Poisson? Well, the answer is that
the Poisson also looks normal. So if you’ve Poisson lambda
where lambda’s very large, that’s also gonna look normal, so
that we’re approximating a discrete distribution
using something continuous. And if we wanted to get, what if we wanted to just
approximate same problem? I just wanna add something to this. Well, let’s just look at that just to see
what more of like what could go wrong with this. What if we look at the case A equals B? So then we’re just saying
the probability that x equals A, that is approximate the Binomial PMF. And one kind of weird thing about this is,
this thing would change if we changed these to strict inequality but
this part would not. As soon as we say that this is
approximately normal than we don’t care about that anymore. So there’s something called the continuity
correction which I just wanted to briefly mention. Which is an improvement to deal with
the fact that you’re using something continuous to approximate
something discrete. And it’s often not explained very well but
if you understand what it does in this simple case,
then it’s not hard to see the idea. The idea is that if you just said this is
approximately normal then you would just say zero, right? Because it would be zero for continuous,
that’s not very useful, right? We want something more useful than zero. So the idea is just
simply to write this as, here let’s assume A is
an integer x is discreet well, x equals A is the same thing
as saying that x is between A plus one-half and A minus one-half. Right? So just use this first. So for each value in this range, replace it by an interval
of length 1 centered there, that’s exactly the same thing because x
is an integer anyway, so that’s true. But here at least we’re giving it
an interval to work with instead of just saying zero, so
that improves this approximation. Anyway, it’s just central limit theorem. All right, so see you next time.