# Lecture 29: Law of Large Numbers and Central Limit Theorem | Statistics 110

Articles, Blog 0 Comments

All right, so let’s get started. So today,

we’re gonna talk about what are probably the two most famous theorems in

the entire history of probably. They’re called the law of large

numbers and the central limit theorem. They’re closely related, so makes sense

to do them together, kind of compare and contrast them. I don’t, I can’t think of a more famous

probability theorem than these two. So the setup for today is that

we have i.i.d random variables. Let’s just call them X1, X2 i.i.d. Since they’re i.i.d they have

the same mean and variance. If the mean and variance exists but

we’ll assume they do. So the mean, we’ll just call it Mu. And the variants, sigma squared. So we’re assuming that

these are finite for now. The mean and variants exist. And both of these theorems tell us what happens to the sample

mean as n gets large. So, the sample mean is

just defined as Xn bar. Standard notation in statistics

is put a bar to mean averages and that’s just the average of the first n. So to take the first n random variables,

and average them, so that’s just called the sample mean. So the question is, what can we

say about Xn bar as n gets large? So the way we would interpret this or

use this is we get to observe. These Xs, they’re random variables but

after we observe them they become data. We’re never going to have

an infinite amount of data so at some point we stop it at n. We can think of that as the sample size

and hopefully we get a large sample size. Of course, it depends on the problem. Some problems,

you may not be able to get large n. Well, we assume n is large, and just take the average,

question is just, what can we say? All right, so first,

here’s what the law of large numbers says. It’s a very simple statement. And hopefully pretty intuitive, too. Law of Large Numbers says that Xn bar converges to mu, as n goes to infinity. With probability 1. That’s the fine print, probability 1. With probability 0, so

something really crazy could happen. But we don’t worry too much about it,

because it has probability 0. With probability 1,

this is the sample mean, and it says that the sample mean

converges to the true mean. So, that is a pretty nice,

intuitive, easy to remember result. That is,

by true I mean the theoretical mean. That is the expected value of Xj for

any j is the true expected value. Whereas this, is a random variable. Right?

We’re taking an average of random variables. That’s a random variable. So this is just a constant but

this is a random variable. But it’s gonna converge and

I should say a little bit more, what is this convergence

statement actually mean. You’ve all seen limit of sequences, but

when we are talking about limits of random variables we have to be

a little more careful. How do we actually define this. The definition of this statement

is just pointwise which means, remember Xn bar is a random variable. Random variable mathematically

speaking is a function. So it’s say for each possible,

if you evaluate this at some specific outcome of the experiment,

then you’ll get a sequence of numbers. That is if you actually observed the

values and this kind of crystallizes into numbers when you evaluate it at

the outcome of the experiment. And so those numbers converge to mu. In other words, this is an event. Either these random variables converge or

they don’t. And we say that event has probability 1. That’ what the statement

of the theorem is. So to just give a simple example. Let’s think about what happens

if we have Bernoulli p. So if Xj is Bernoulli p,

then intuitively we’re just imagining a infinite

sequence of coin tosses. Where the probability of heads is p, and then this says that if we add up

all of these Bernoullis up to n, that it’s just in the first coin flips,

how many times did the coin land heads, divided by the number of flips should

convert to p with probability 1. So for example, so

this is a very intuitive statement. If it’s a fair coin and

you flip the coin a million times, well, you’re not really expecting that it

will be 500,000 heads and 500,00 tails. But you do think that,

in the long run, it should be the case that it’s going to be essentially

half heads, half tails. Not exactly, but essentially. And the proportion should get closer and

closer to the true value. This qualification would probably 1 is

needed because mathematically speaking even if you have a fair coin,

there’s nothing in the math that says it’s impossible that the coin would land

heads, heads, heads, heads, heads forever. You know that that’s never

actually gonna happen in reality. It’s just not gonna happen. It’s a fair coin. It might land heads, heads, heads for

a time if you’re very lucky or unlucky or whatever. But it’s not gonna be heads,

heads, heads forever. But there’s nothing in the math that

says that’s an invalid sequence. So there’s some weird

pathological cases like that. But with probability one,

we get what we expect. If we didn’t have this result,

how we would ever even estimate p? You might imagine if you

didn’t know what p was, kind of the obvious thing to do is

flip the coin a lot of times and take the proportion of heads and

use that as your approximation for p. But what justification could you have for doing that approximation

if you didn’t have this. So this is a very, very necessary result. But I guess to comment a little bit

more about what does it actually say for the coin, because this is kind of

related to gambler’s fallacy, and things like that. The gambler’s fallacy is the idea

that like let’s say your gambling and you lose like ten times in a row and then

it’s the feeling that your due to win. You lost all these times then and

you might try to justify that using a lot of large numbers and say well you

know the coin might landed let’s say, heads you win money, tails you lose money,

you just lost money ten times in a row. But the law of large numbers says,

in the long run, it’s gonna go back to

one-half if it’s fair. So somehow you need to start

winning a lot to compensate. That’s not the way it works. The coin is memoryless. The coin does not care how many failures

or how many losses you had before. So the way it works is not through If

you’re unlucky at the beginning that somehow it gets offset later

by an increase in heads. The way it works is through

what we might call swamping. And let’s say the coin landed

tails a 100 times in a row. It doesn’t mean that the probability

has changed for 101st flip. What it means though, is that we’re

letting n go to infinity here, okay? So no matter how unlucky you

were in the first 100 or the first million trials, that’s

nothing compared to infinity, right? So those first million just get swamped

out by the entire infinite future, so that what’s going on here. Yeah, so

to tell you one little story about the law of large numbers,

a colleague of mine told me this story. He had a student once who

said he hated statistics. And of course,

my colleague was very shocked, like how can anyone hate statistics? And so he asked, why? How is it possible that

you hate statistics? And then the student who was an athlete,

and he was training everyday and he had just learned

the law of large numbers. And he was very, very depressed by this

because he said, the law of large numbers says in the long run, I’m gonna only

be average and I can’t improve. So well, of course the fallacy there,

we assumed iid right now. Now there are generalizations

of this theorem beyond iid, but we can’t just get rid of iid. So the iid is saying that the distribution

is not changing with time. That doesn’t mean that you can’t actually

improve your own distribution then it would not be iid. So don’t be depressed by this,

and in fact this theorem I think is crucial in order for

science to actually be possible. Because if you kind of

imagine kind of hypothetical counter factual world where this

theorem was actually false. That would be really depressing to try

to ever learn about the world, right? Cuz this is saying,

you’re collecting more and more data. You’re letting your sample

size go to infinity. And this says,

you converged to the truth, right? And it would be some weird setting, where

you get more and more data, and more and more data, and yet you’re not able

to converge to the truth, right? So that would be really bad. So this is very intuitive, very important. Okay, so let’s prove this

at least a similar version. So this is actually sometimes called

the strong law of large numbers. And we’re actually gonna

prove what’s sometimes called the weak law of large numbers. I don’t really like the terminology

strong and weak here, but that’s kind of a standard. Strong law of large numbers

is what I just said, where it’s converging

point-wise with probability 1. That is just these random variables

converged to this constant, except on some bad event

that has probability 0. The weak law of large

numbers says that for any, C greater than 0, the probability that Xn bar minus the mean is greater than c goes to 0. So it’s a very similar looking statement. It’s not exactly equivalent. It’s possible to show, you have to

go through some real analysis for this that is not necessary for

our purposes. But it turns out that, this statement, once you’ve proven this thing it

implies this form of convergence. This is called convergence in probability,

but the intuition is very similar. So just to interpret this statement

in words it says, so we can chose, we should interpret c as

being some small number. So let’s say we chose c to be 0.001, okay? And then it says that this thing

goes to 0, so in other words, this, as n goes to infinity again. So this says that if n is large enough,

then it’s extremely unlikely that

these are more than 0.001 apart. In other words, if n is large, it’s extremely likely that this is

extremely close to this, right? So it’s a very similar statement,

n is large, it’s extremely likely that the sample

mean is very close to the true mean. Okay, so that’s what it says. So we’ll prove this one, because to prove this one takes

a lot of work and a lot of time. This one,

it looks like it’s a nice-looking theorem. And it is a nice theorem, but we can prove it very easily

using Chebyshev’s inequality. Okay, so

let’s prove the weak law of large numbers. So all we need to do is show

that this goes to 0, right? That’s what the statement is. So let’s just bound it using, this looks

pretty similar to what we were doing last time, where we did Markov’s inequality,

Chebyshev’s inequality. This looks similar to that

kind of stuff from last time, which is why I did that, well,

one reason for doing that last time. We need the inequalities anyway,

but it’s especially useful here. So we just need to show

this thing goes to 0. Xn bar minus mu greater than c, goes to 0, By Chebyshev’s inequality,

this is less than or equal to the variance of Xn

bar divided by c squared, that’s just exactly

Chebyshev from last time. Now we just need the variance of Xn bar,

variance of Xn bar, well, just stare at the definition

of Xn bar for a second. There’s a 1 over n in front,

that comes out as 1 over n squared. And then since I’m assuming

they’re iid an then dependent, the variance of the sum is just n

times the variance of one term. So that’s n sigma squared

divided by c squared, which is sigma squared over nc squared. Sigma is a constant, c is a constant,

n goes to infinity, so this goes to 0. So that proved the weak law of large

numbers, just only a one line thing. Okay, so that tells us what happens

point-wise when we average a bunch of iid random variables, and

it converges to the mean. So let me just rewrite that statement. Then we’ll write the central limit

theorem and kind of compare them. So another way to write

what we just showed is that Xn bar minus mu

goes to 0 as n goes to infinity, which is a good thing to know. However, it doesn’t tell us what

the distribution of Xn bar looks like. So this is true with probability one,

but what is the distribution? What is the distribution

of Xn bar look like? So this says it’s getting closer,

Xn bar is getting closer and closer to this constant mu. Okay, but that’s not really

telling us the shape, and it’s not really telling us the rate. This goes to 0, but at what rate? So one way to think about problems like

that, when you have something going to 0, and you wanna study something about,

how fast does it go to 0? Then one might, not just in here, but just as a general approach

to that kind of problem. We know this goes to 0, but

we don’t know how fast. One way to study that would be multiply it

by something that goes to infinity, right. Now, if we multiply it by

something that goes to infinity, such that this times

this goes to infinity. Then we know that this part that blows

up is dominating over this part. And if we multiply by something

that goes to infinity, but this whole thing still goes to 0,

then that’s more informative, right? So what’s gonna happen is that we

can imagine multiplying here by n to some power and we’re gonna

show that there’s a power here, and to some power, fill in the blank. What we’re gonna show is that, if the power here is above some

threshold and to the big powers, its gonna go to infinity fast,

this thing will just blow up. And if we put a smaller power than the

threshold here, then this is still going to infinity as long as this is a positive

power of n, this is still going to infinity, this parts going to 0,

but this part’s dominating, right? So this term is competing with this term. This one goes to infinity,

this one goes to 0, okay? So then the question is what’s

that magic threshold value? And the answer is one-half. So that’s what we’re

gonna study right now. So we’re gonna take the square

root of n times xn bar minus mu. This is kind of the happy medium, where we’re gonna get a non-degenerate

distribution, that this is gonna converge in distribution to an actual distribution,

it’s not gonna just get killed to 0 or blow up to infinity, it’s actually

gonna give us a nice distribution. Okay, and I’m also gonna divide by the

sigma here, makes it a little bit cleaner. So this is the central limit theorem now. I’m stating it, then we’ll prove it. Central limit theorem says,

if you take this and look at what happens

as n goes to infinity. Converges to standard

normal in distribution. [SOUND] By convergence and

distribution, what we mean is that the distribution of this converges

to the standard normal distribution. In other words, you could take the CDF. I mean these may be discrete or continuous

or a mixture of discreet and continuous. So it doesn’t necessarily have a PDF,

but every random variable has a CDF. So it says if you take the CDF of this, it’s gonna converge to capital 5,

the standard normal. So I think this is kind of an amazing

result that this holds in such generality, right, because I mean the normal is just

this one, standard normal is just this one particular, it’s a nice looking bell

curve, but that’s just one distribution. And those x’s they could be discrete,

they could be continuous, they could be extremely nasty

looking distributions, right? It could look like anything, the only thing we assumed was

that there was a finite variance. Other than that,

they could have an incredibly complicated, messy distribution. But it’s always gonna

go to standard normal. So this is one of the reasons why

the standard normal distribution is so important on the one hand and so,

widely used, because this is a theorem as n goes to infinity is what it says,

but the way it’s used in practice is then people use normal approximations all the

time and a lot of the justification for normal approximations is coming from this,

because this says that if n is large, then the sample mean will approximately

have a normal distribution. Even if the original data did not look

like they came from a normal distribution, when you average lots and

lots of them, it looks normal, okay. So this is in a sense is a better

theorem than the law of large numbers, but because it’s kind of more

informative to know the distribution, know something about the rate, and

you know it’s interesting that it’s, square root of n is kind of the power

of n that’s just right, right? A larger power it’s gonna blow up,

a smaller power it’s gonna go to 0. N to the one-half is the compromise,

then you always get a normal distribution. It’s more informative in some sense, but you should also keep in mind,

it is a different sense of convergence. Up here, we’re talking about the random

variables actually converging, literally the random variables

converge the sample mean converges literally to point-wise with

probability 1, to the true mean. Here, we’re talking about

convergence in distribution. So we’re not talking about

convergence of random variables. We’re just saying the distribution of this

converges to the normal 0, 1 distribution. So that’s a different sense

of convergence, but anyway, both of them are telling us what’s gonna

happen to Xn bar when n is large, okay? So well, let’s prove this theorem. Here’s another way to write this,

by the way, it’s good to be familiar with both ways. It’s just algebra to go

from one to the other, but they’re both useful enough

to be worth mentioning. Let’s just write the central limit

theorem in terms of the sum of X’s rather than in terms of the sample mean. So I’m just gonna take the sum of Xj,

j equals 1 to n. And so, we can either think of

the central limit theorem as, either think of it as telling us what

happens to the sample mean or we can think of it as telling us what happens

to the sum, or the convolution, okay? It’s equivalent because

they’re just a factor of, we just have to be careful not

to mess up the factor of n, b ut we can go from one to the other

cuz it’s just a factor of n. So the claim is that this is

approximately normal when n is large, but if we just have this thing,

this could easily just blow up. You’re just adding more and more terms. But somehow we wanna

standardize this first. So if we take this thing,

because this thing has mean and mu, right, so let’s subtract n mu. Because then it has zero mean,

because I just want to match. I wanna make the mean 0 and

the variance 1, so that it kind of matches up with that,

rather than just letting it blow up. So this is called centering,

we just subtracted by linearity, the mean is n mu, so

just subtract it n mu. And then let’s divide by

the standard deviation, this is just how we did

standard deviation before. So over there we showed that the variants

of Xn bar is sigma-squared over n. And the variance of this sum

is just n sigma squared. So let’s just divide by

the standard deviation, right, which is square root of n Times sigma,

okay? Cuz the variance is n sigma squared. So that’s just the standardized version. And the statement is again that this

converges to the standard normal in distribution. So if we take this sum and standardize it,

then it’s gonna go standard normal. Okay, so, all right, so

now we’re ready to prove this theorem. And, sort of just a calculation,

but it’s kind of a nice calculation in some ways,

we’re gonna prove it, well. This theorem is always true as

long as the variance exist. We don’t need to assume that, the third

moment or the fourth moment exist. But the proof is much more complicated

to do it in that generality. So we’re gonna assume that the MGF exists,

then we can actually work with the MGFs. Because when you see this thing,

sum of independent random variables, then we know the MGF is gonna be

something useful if it exists. And there’s ways to extend this proof

to cases where the MGF doesn’t exist. But for our purposes,

we may as well just assume MGF exists. So assuming MGF, let’s call it M(t). Of Xj, they’re iid, so if one of them

has an MGF, they all have the same MGF. We’ll just assume that that exists. Once we have MGFs, then our strategy

is to show that the MGFs converge. So that’s a theorem about MGFs, that

if the MGFs converge to some other MGF, then the random variables

converge in distribution, right? We had a homework problem related to that,

where you found that the MGFs converged to some MGF, and that implies

convergence of the distributions, right? Okay, so that’s the whole strategy. So that means all we need to

do is find the MGF of this and then take the limit, okay? So basically at this point,

it’s just like, write down the MGF, take the limit, and

use a few facts about MGFs, okay? So first of all, we can assume. That, let’s just assume mu=0 and sigma=1, just to simplify the notation. This is without loss of generality, because we could write this as,

all we have to do is consider. I wrote the standardized thing this way,

but I could’ve just written it as

standardizing each X separately. I could’ve written Xj- mu over sigma. So this would be standardizing each

of them separately, j=1 to n, and then we have a 1 over root n. That will be the same thing

that we’re looking at. This just says standardize

them separately first. But then you could just, I mean if

you want, just call this thing Yj. And once you have the central limit term

for Yj, then you know that that’s true. So you might as well just assume that

they’ve already been standardized. And so just to have some notation,

let’s just let Sn equal the sum, S for sum, of the first n terms. And what we wanna show is that the MGF of Sn over root n,

that’s what we’re looking at, right? That let mu equal zero, sigma equals one,

so we’re looking at Sn over root n. And we wanna show that that goes

to the standard normal MGF. Right, so we just need to find this MGF,

take a limit. Okay, so let’s just find the MGF. So by definition, that’s the expected

value of e to the t times Sn over root n. And Sn is just the sum. So, and we’re assuming independence,

which means that these, you can write this as e to the t x1 over root n, e

to the t x2 over root n, blah, blah, blah. All of those factors are independent,

therefore, they’re uncorrelated. So we can just split it up as a product,

X1/ over root n. Blah, blah, blah, same thing, just e to the Xj over root n

is the general term, right? I’m just using the fact that

those are uncorrelated, so we can write e of the product

of the expectations. But since these X’s are iid, these are really just the same

thing written, n times. So really,

this is just this thing to the nth power. And this thing,

that should remind you of an MGF, right? That’s just the MGF of X1, except that instead of evaluated at t,

it’s evaluated at t over root n. So really, that’s just the MGF, evaluated at t over root n

raised to the nth power. So that’s what we have. Now we need to take the limit

as n goes to infinity. So let’s just look at what’s gonna

happen here, n is going to infinity. This thing on the inside becomes M of 0. M of 0 is 1 for any MGF, right? Cuz e to the 0 is 1. So this is of the form 1 to the infinity

which is in indeterminate form, right? It could evaluate to anything. So going back to calculus,

how do you deal with 1 to the infinity, or 0 over 0, or whatever. Usually we try to reduce it to something

where we can use L’Hopital’s Rule for those problems, right? Or we can use a Taylor

series type of thing. So, how do we get into that form? Take the log,

because this looks like 1 to infinity. If we take the log,

it’ll look like infinity times log of 1. So it’ll look like infinity times 0,

take logs. Then we just have to remember to

exponentiate at the end to undo the log. Okay, so

let’s write down then what we have. After taking the log, and

we’re trying to do a limit, so we’re doing the limit as n goes

to infinity, and we take the log. It’s n log M(t over root n). So that’s of the form infinity times 0. If we want 0 over 0 or

infinity over infinity, we can just write it as 1

over n in the denominator. Okay, and now it’s of the form 0 over 0. So we can almost use L’Hopital’s Rule,

but not quite. We have to be a little bit careful. Because first of all,

I’m assuming n is an integer, and you can’t do calculus on integers. Secondly, it’s just kind of, even if we

pretended that n is a real number and then the derivative of n would

be- 1 over n squared and that’s kind of annoying to deal with. And it’s kind of annoying to

deal with this square root here. So let’s first make a change of variables. Let’s just let y=1 over root n and

also let y be real, not necessarily, Not necessarily of the form 1 over

square root of an integer, okay? So it’s the same limit, just written

in terms of y instead of in terms of n. So as n goes to infinity y goes to 0 and

1 over n is y squared, so it’s denominator is just y squared. The reason I do it this way is

that 1 over root n is just y by definition but

then the numerator is just log m of yt. That’s a lot easier to deal with

because we got rid of the square roots. So it’s still of the form 0 over 0. So we’re gonna use L’Hospital’s Rule. So limit, y goes to 0. Take the derivative of the numerator and

the denominator separately. The derivative of the denominator is 2y. The derivative of the numerator, well we’re just going to

have to use the chain rule. Derivative of log something

is 1 over that thing. So that’s M of yt hence the derivative

of that thing which again by the chain rule is M prime of

yt times the derivative of yt. We’re treating t as constant,

we’re differentiating with respect to y. So t comes out. And now let’s see what we have. Let’s just summarize

a couple facts about MGFs. So M of t is the expected

value of E to the tX1. So M of 0=1 Okay. And when we first started doing MGF we

said that we take derivatives of the MGF and evaluate it at 0. We get the moments, that is why it’s

called the moment generating function. So the first derivative at 0 is the mean,

but we assume that mu is 0. So this is 0, here. And the second derivative,

while we’re doing this. Secondary derivative is the second moment,

but since we assumed that the variance is 1 and the mean is 0,

the second moment is 1, okay? So over here, as we let y go to 0,

denominator’s still going to 0. Numerator’s also going to 0,

because M prime of 0 is 0, so its still on the form 0 over 0, so

let’s just do what we were told again. So first I can simplify it a little bit,

this t can come out, because that’s acting as a constant,

and the 2 can come out. And limit y goes to 0 and

this M of yt part, that’s just going to 1. So we can write that as part

of a separate limit, but that other limit is just going to 1. You can think of it as just

the limit of this part times the limit of the rest of it. But that part’s just going to 1,

so we can get rid of that. So really is just, what’s left is just the limit of M prime yt divided by y. Everything else is gone, so

it’s actually pretty nicely simplified. Now, using L’Hospital’s Rule

a second time, now the derivative of

the denominator is just 1, okay? And for the numerator,

chain rule, M double prime of yt. That was a t not a t squared,

but now it’s a t squared, because by the chain rule, derivative of

yt is t, so we have a t squared over 2. Now when we let y go to 0,

now it’s just M double prime 0 is 1, so now this limit is just 1. So we get t squared over 2, that’s what we wanted, because t squared over 2 is the log. Of e to the t squared over 2, but e to the t square over 2 is

exactly the normal 0,1 MGF. Okay so, to prove that theorem that’s the end of

the proof of the central limit theorem. All we had to do was just basic facts

out MGF, use, L’Hospital’s Rule twice. And there we have one of the most famous

important theorems in statistics. Now so

there are more general versions of this, like you can extend this in various

ways where it’s not an IID, but it still has to satisfy

some assumptions, right. But anyway,

this is the basic central limit theorem. Okay, so that’s pretty good. Let’s do an example,

like how do we actually use this, for the sake of approximations,

things like that. Last time I was talking about

the difference between inequalities and approximations, right? And we talked about Poisson

approximation before. We haven’t really talked

about normal approximation. This result is giving us the ability

to use normal approximations when we’re studying sample mean and

is large, okay? So historically, though,

the first version of the central limit theorem

that was ever proven, I think was for binomials, okay? So what we’re saying is that binomial np under some conditions

will be approximately normal. And well in the old days that was

incredibly important fact because they didn’t have computers to

binomials how to deal with like n choose k, and n is large, and

k, you have all these factorials. You can’t do these things by hand. Now we have fast computers,

so it’s a little bit better. But it’s still a lot easier working

with normal distributions than binomial distributions most of the time,

right? And even now factorials still grow so

fast that even with a fast computer with large memory and

everything, you may quickly exceed its ability when you’re doing

some big complicated binomial problem. And normals have a lot of nice properties,

as we’ve seen, okay? The question is,

when can we approximate a binomial using a normal, and

how do we do that, okay? So this is just the binomial approximation to the normal, other way around. Normal approximation,

I’ll say binomial approximated by normal, the normal approximation to the binomial. When is that valid? To contrast it with

the Poisson approximation, that we’ve seen before, okay? So, if x is, let’s x be binomial np And as we’ve done many times before we can represent x as

a sum of iid Bernoulli. Right?

Well these are just 1, if success on the J trials 0 otherwise, so

the XJ are iid Bernoulli P. So this does fit into

the framework of the central limit theorem that is we are adding

up iid random variables. So the central limit theorem says that,

if the N is large this will be approximately normal, at least after

we have standardized it, okay? So suppose we wanted to approximate,

suppose we’re interested in the probability

that x is between A and B. And I want to approximate that, first we’ll do equality then

we’re approximating it. So, I mean if you had to do this on

a computer what you would do or by hand, which you wouldn’t want to,

would be to take the PMF and sum up all the values of

the PMF from A to B, right. So okay, you would not want to do

that by hand most of the time. But suppose we just want an approximation

for this, not the exact thing. So first, the strategy is just gonna

be to take x and standardize it first. So we’re gonna subtract the mean,

so we know that the mean is NP, and we’re gonna divide by

the standard deviation, which we know as the square root of NPQ or

Q is 1 minus P. So, I’m just standardizing it right now. So this is still equal,

we haven’t done any approximations yet. And then, now that we’ve standardized it, we can apply the central limit theorem,

if N is large enough, right? If N is, if central limit

theorem said N goes to infinity, that doesn’t answer the question

of how large does N have to be. And for that, there’s various theorems and

various rules of thumb. A lot of books will say,

how large does N have to be? And some books at least will say 30,

and that’s just a rule of thumb. That’s not always gonna work for all,

there’s separate rules of thumb for the binomial, like you want N

times P to be reasonably large and N times 1 minus P to be large,

there are different rules of thumb. But anyway, if N is large enough, then what we’ve just proven is that

this is gonna look like it has a normal distribution because

that’s a sum of IID things. And we standardized it correctly, because

we already knew the mean and the variance, so we just standardized it. Okay, so this is approximately. Now we’re going to use

the normal approximation, we’re going to say this

is approximately normal. And if I want the probability that

the normal is between something and something, that’s just the CDF

here minus the CDF here, right? Because for the normal, I mean this

is discrete but we’re approximating using something continuous and we just

say, integrate the PDF from here to here. But fundamental theorem calculus,

that just says take the CDF and go, okay. So we’re just gonna do Phi of B minus NP over square root of NPQ minus Phi of A minus NP over square root of NPQ. So that would be the basic

normal approximation, I’ll talk a little bit about how

to improve this approximation. But to contrast it with

the Poisson approximation. We talked before about the fact that,

and we proved the fact that if N goes to infinity, and

P goes to 0, and N times P is fixed. Then the binomial distribution

converts to the Poisson distribution, we proved that before. So in the Poisson approximation, so for the Poisson approximation what we had was

N is large but P was very small, right? And we let lambda equal NP and

x as moderate. And most important thing is that

P is small here, P is close to 0. We proved it in the case where this goes

to infinity and this goes to 0, okay? So Poisson is relevant when we’re

dealing with a large number of very rare unlikely things. That’s really in contrast to this, in this case for the normal approximation. Then, while we still want N to be large,

but if you kind of think intuitively

about when is this gonna work well, we actually want P to

be close to one half. Because think about the symmetry, if you

have a binomial of P equals one half, that’s a symmetric distribution. The normal is symmetric, no matter,

every normal distribution is symmetric. If P is far from one half, then the

binomial is very, very skewed, and in that case it’s kind of doesn’t make that much

sense to approximate using a normal. So this is gonna work as an approximation,

that’s normal approximation, as an approximation if P is very small,

this makes a lot more sense than this. However, think about the statement

of the central limit theorem. In that theorem I never said

P was close to one half, in fact that was just a general theorem,

we didn’t even have P in the statement of the central limit theorem, but

somehow this still has to eventually work. But as a practical matter

as an approximation, if P is close to one half this

is going to work quite well, if N is like 30 or 50 or

100, it will work fine. But if P is .001,

the central limit theorem is still true, that as N goes to infinity

it’s gonna work, okay. But if N is kind of not

that enormous of a number, then it’s gonna be a pretty

bad approximation. And let’s just try to reconcile these

statements though, is there a case? If we let N go to infinity and

P be very small, I still said, if N is going to infinity, it’s still gonna converge to

normal just much slower, right? So, how could the binomial

look both normal and Poisson? Well, the answer is that

the Poisson also looks normal. So if you’ve Poisson lambda

where lambda’s very large, that’s also gonna look normal, so

there is a case where those come together. Okay, one last thing about this

is that there is something kind of weird about this in the sense

that we’re approximating a discrete distribution

using something continuous. And if we wanted to get, what if we wanted to just

approximate same problem? I just wanna add something to this. Well, let’s just look at that just to see

what more of like what could go wrong with this. What if we look at the case A equals B? So then we’re just saying

the probability that x equals A, that is approximate the Binomial PMF. And one kind of weird thing about this is,

this thing would change if we changed these to strict inequality but

this part would not. As soon as we say that this is

approximately normal than we don’t care about that anymore. So there’s something called the continuity

correction which I just wanted to briefly mention. Which is an improvement to deal with

the fact that you’re using something continuous to approximate

something discrete. And it’s often not explained very well but

if you understand what it does in this simple case,

then it’s not hard to see the idea. The idea is that if you just said this is

approximately normal then you would just say zero, right? Because it would be zero for continuous,

that’s not very useful, right? We want something more useful than zero. So the idea is just

simply to write this as, here let’s assume A is

an integer x is discreet well, x equals A is the same thing

as saying that x is between A plus one-half and A minus one-half. Right? So just use this first. So for each value in this range, replace it by an interval

of length 1 centered there, that’s exactly the same thing because x

is an integer anyway, so that’s true. But here at least we’re giving it

an interval to work with instead of just saying zero, so

that improves this approximation. Anyway, it’s just central limit theorem. All right, so see you next time.