You are on page 1of 331

STAT101

Statistics can be divided into


two parts:
* Descriptive Statistics
* Inferential Statistics

We start with Descriptive


Statistics
A population is a set of
objects/individuals/things
whose numerical
characteristic(s) we are
interested

For example, we may be


interested in the heights of
all residents of the US, or the
number of pages in all the
books in the US Library of
Congress

One way to describe a


population is to list all the
numerical values, for
example we can list the
heights of all the residents of
the US, a database that will
have almost 300 million
entries

Alternatively we can list


some characteristics of the
population like the mean and
standard deviation
This will just be 2 numbers
instead of 300 million, but
will give us an idea of what
the population looks like

The mean of a population is


the average of all the
individual values in it and is
represented by the Greek
letter :

the above notation is a


formal way of writing that

equals the sum of xi as i


varies from 1 to N, divided
by N

xi is the value of an
individual data point within
the population, and N is the
total number of data points
in the population
Specifically note that the
symbol is used to represent

the sum of whatever follows


in brackets, with i (the
parameter of summation)
varying from the amount
appearing at the lower-right
of , i.e. 1, to the amount
appearing at the top-right of
, i.e. N

So we have to simply be
the sum of values of all data
points in the population
divided by the population
size

is a measure of how
variable the values are
within the population
To measure we find the
deviation of an individual
data point from the mean ,
then square it and add these

squared deviations for all


data points
Why do we square? Because
if we simply add without
squaring we will get a big
grand ZERO as a result

Finally we once again divide


by N, and the result is the
2
variance , the square-root
of which is the standard
deviation

And the standard deviation


is given by:

There is actually more to the


standard deviation than
just avoiding getting zero by
summing
There are other ways of
avoiding the zero, for
example by taking absolute

values or the fourth power


However is special, for
example it is a parameter of
normal distribution, which is
probably the most important
probability distribution. We
will get to that later

How are means and


standard deviations useful?
For example, we may find
that the mean height of adult
men in the US is 59 and the
standard deviation 4,
whereas the mean height of
adult Dutch men are 510

and the standard deviation


2. This information enables
us to form an idea about the
heights of the US male and
Dutch male populations
Alternatively we could
actually list the heights of all
adult males, but then we

would have to deal with


many million values instead
of just 2 for each country

Think of a bowl with many


Red Balls and Blue Balls, say
a total of 1,000,000 (only a
few of which are shown in
the picture)

To apply numerical results


we assign numerical values
of 0 to Reds and 1 to Blues
That implies, for example, if
the population has 30%
(300,000) Reds and 70%
(700,000) Blues, the
population mean will be 0.70

Of course we could have


assigned other values, for
example 5 to Reds and -1 to
Blues, that would just be an
origin shift and scaling
For the remainder of this
course assume that the pot

actually has 52% Reds and


48% Blues
If we actually counted all the
balls in the pot, we would
find 480,000 Blues and
520,000 Reds
Therefore:

= 520K*0 + 480K*1
1,000,000
= 0.48
2
=
2
2
520K*(0-.48) +480K*(1-.48)
1,000K
= 0.2496

=> = 0.4996

Distributions:
The population of Red and
Blue Balls is an example of a
binomial (discrete)
distribution, a data point can
take only one of 2 values

There are many variables


that can take an infinite
number of values, for
example height. These
variables have a continuous
distribution. In such
distributions a particular
value has a very very small

probability (tending to
Zero).
For example a height of
exactly 510. You will be
hard put to find anyone in a
population who is absolutely
exactly 510. Even those
who are very very close will

probably differ by at least a


billionth of an inch
In that case it becomes
meaningful to talk about
number of individuals for a
range of outcomes (say
heights with values between
59 to 510) rather

than individual outcomes


(like 59 exactly or 510
exactly)
Distributions that give
density (amounts for
ranges) are continuous
distributions

The most useful and hence


famous example of a
continuous distribution is the
Normal distribution

We see above that the


Normal Distribution is
symmetric around the mean

For example the probability


density 2 standard
deviations above the mean is
the same as that 2 standard
deviations below the mean

What in the world is


normally distributed?
Well, heights arent!
Suppose an apple orchard
produces a million apples
and puts them randomly into
bags of a thousand apples

each. Then the weights of the


bags would be
approximately normal.

Values for the normal


distribution are usually
given in the form of
cumulative tables
These tables are given in
terms of the z variable,
which is:

Z = (Variable Value
Mean)/Std Dev
For example, a variable that
is one standard deviation
above the mean will have a
value for Z = 1

Normally distributed data


(variable) can be origin
shifted and scaled to a
standard distribution called
Standardized Normal
Distribution, or Zdistribution. The origin
shifting is done by

subtracting the mean, and


the scaling is done by
dividing the variable by its
standard deviation. Once
this is done, the data has the
same distribution as all other
standardized normal
variables. The mean of the Z

variable is 0, and the


standard deviation 1 and
tables are available for
cumulative density values.

For example from the above


we see that the area
(probability) of the random
variable having a value from
infinity to 1 standard
deviation above the mean (Z
= 1) is:
0.8413 = 84.13%.

As the probability from


infinity to the mean is 50%
due to symmetry around the
mean this also implies that
the probability of the
random variable lying
between the mean and mean

plus one standard deviation


is:
84.13% - 50% = 34.13%

Example: Suppose you are


told that the consumption of
potatoes by the Irish is
normally distributed with a
mean of 30 lbs a year, and a
standard deviation of 5 lbs.
The population of Ireland is
4 million. Calculate the

number who consume


between 27 lbs and 31 lbs.
The deviation of 27 lbs from
the mean of 30 lbs = 3 lbs =
3/5 = 0.6 std dev

Therefore the probability


mass from 27 lbs to 30 lbs =
(0.7257 0.5000) = 0.2257
Similarly from 30 lbs to 31
lbs, the deviation = 1/5 = 0.20
std dev and probability mass
= (0.5793 0.5000) = 0.0793

So total probability = 0.2257


+ 0.0793 = 0.3050
So number of Irish
consuming between 27 lbs
and 31 lbs = 0.3050 * 4 M =
1.22M

Inferential Statistics
Think again of the bowl with
many Red Balls and Blue
Balls
Now instead of counting the
number of Reds and Blues

we will Sample and make


inferences about the totals

A random variable is one


whose value is not known to
us. In the future the value
will be known to us (if
observable). For example,
the value of a random draw
from the population in the
jar is unknown to us, it is a

random variable, but once


we have made the pick it will
be either 0 or 1. We may
know the probability of
getting a particular value of
a random variable, for
example in a jar containing
52% Reds and 48% Blues,

the probability of getting 0 is


52% and 1 is 48%. This is
the probability distribution
for this random variable.
You want to estimate what
fraction of the total is Red
and what fraction is Blue.
You dont have time to count

them all, so you just dip your


and select a few randomly
(Random Sample).
From your sample you make
an Estimate of the
fractions of Blues and Reds.

The branch of mathematics


that deals which such issues
is Inferential Statistics.
When you take a Random
Sample from the jar, each
ball has an equal
Probability of being
picked.

Suppose you pick up one ball


every time you dip your
hand into the jar.
If there are a total of N balls
in the jar, then each ball has
a probability of 1/N of being
randomly picked every dip
of your hand.

The total probability of that


any one ball will be picked is
the sum of probabilities for
all balls: 1/N + 1/N + 1/N
+ 1/N (for the N balls) = 1.
So the probability that any
one ball be picked for one
dip of the hand is 1.

A probability of 1 is a
Certain event. The reason
we can add the probabilities
of 1/N is because the events
(of a ball being picked) are
Mutually Exclusive. That
is, if one ball is picked then
another cant be picked.

When you pick a ball, the


result is either Red or Blue.
So your data takes one of
two values, which is the
simplest possible result. Your
data is said to be
Binomial, and the
Probability Distribution of

the data is also Binomial.


Your data has a Probability
Distribution because every
time you dip your hand into
the jar, you cannot predict
Ex-ante (prior to the dip)
what data you will get, but

you can say what the


probability will be.
For example, if the fraction
of Reds equal 52%, then you
can say the probability of
picking a Red (randomly) is
52% and a Blue is 48%. The
Probability Distribution

then is (Red = 0.52; Blue =


0.48).
Most of the time your data
will take many values, in
which case there will be
probabilities associated with
not 2, but many possible
outcomes. If the outcomes

are Continuous, (that is


not discrete) the number of
possible outcomes becomes
infinite, and each individual
outcome has a very very
small probability (tending to
Zero).

In that case it becomes


meaningful to talk about
probabilities for a range of
outcomes (say outcomes with
values between 2.5 and 3.1)
rather than individual
outcomes (like 2.5 or 2.9 or
3.1).

When a random variable can


take an infinite number of
values, the probability of one
particular value (usually)
tends to zero. Then we do
not have probabilities
associated with particular
outcomes, like a height of

510, but probability


densities given by a
(continuous) probability
distribution. The probability
for a range of outcomes is
given by the integral of
density over the range, that
is the area under the

probability (density)
distribution curve. If the
range is small we can
approximate the area
(probability) as the average
of densities at the endpoints
of the range, multiplied by
the length of the range.

The probability distributions


then become Continuous
Probability Distribution,
the most famous example of
which is the Normal
Distribution. Other examples
are Chi-Square, Tdistribution, Poisson etc.

Suppose you have dipped


your hand in the pot 80
times and find that you have
picked 38 Reds and 42 Blues,
that is a frequency of 0.475
for Reds and 0.525 for Blues.
What can you conclude?

Now there are different ways


in which we can think of the
situation and different
questions we can answer.
Two major ways in which we
can do statistical analysis is:
1) Frequentist (or Classical)
Statistics

2) Bayesian Statistics
Though there is
disagreement between the
above two approaches, each
is logically valid. They just
view the world differently
and ask different questions.
Classical Statistics is by far

more commonly used, and


that is what we will study. I
will also discuss Bayes Rule
in an Advanced MBA
Lecture.
As is proper for a course of
this nature, I will take an
applied approach. That is,

I will sacrifice rigor while


striving to give you the
important intuitions.
Remember that the question
of estimating the population
frequency of Reds and Blues
is equivalent to estimating
the population mean.

To make an inference from


the data we have (Reds =
0.475 and Blues = 0.525 for a
mean of 0.525) we first need
to understand what kind of
samples a binomial
population will generate.

For example it is quite likely


that our population (52%
Reds and 48% Blues) will
generate a sample with, say
40 Reds and 40 Blues, but
NOT IMPOSSIBLE that it
will generate a sample of 0
Reds and 80 Blues.

If we see our sample has 0


Reds and 80 Blues, we
cannot therefore conclude
that the population does not
have 52% Reds and 48%
Blues, but we can conclude it
is UNLIKELY. Similarly if
we see a sample of 70 Reds

and 10 Blues, we would


conclude that it is LIKELY
that the population has more
Reds than Blues.
Essentially, to make
inferences from samples we
first need to know with what
probabilities a population

produces different samples.


Our inferences are in terms
of what is likely and what is
not, that is from a sample we
cannot make an inference
that a population is exactly
something, only that it is
likely to be something and

unlikely to be something else.


This likeliness/unlikeliness is
expressed in terms of
rejection/or failure to reject
hypothesis, or in terms of
confidence intervals for
the estimates of mean,
standard deviation etc.

Our binomial population


generates samples according
to a binomial distribution,
which involves
Combinatorial mathematics
(Permutations and
Combinations). Instead of
going there we will use an

Asymptotic result, The


Central Limit Theorem
(CLT).

CLT is undoubtedly the most


useful result in statistics. It
says the mean of a large
sample from a population
(discrete or continuous) is
distributed according to a
Normal Distribution, with
mean equal to population

mean, and standard


deviation equal population
standard deviation divided
by the square-root of the
sample size. CLT applies to
populations with all kinds of
distributions as long as they
dont have some unusual

features (like infinite


variance).

As a practical matter, a
sample larger than 30 can be
thought of as a large
sample.

Now think of repeated


samples of 80 from our
population. How would the
means of these samples be
distributed? CLT says that
the means would be
distributed with a mean of
0.48 and a standard

1/2

deviation of 0.4996/80 =
0.05586
Or, when you do repeated
sampling of the population
many times, you will see that
about 68.26% of the means
of the samples lie between

0.48 0.05586 = 0.42414 and


0.48 + 0.05586 = 0.53586.
I repeat, the result given by
CLT is that sample means
are normally distributed
with mean 0.48 and standard
distribution 0.05586.

You may wonder how a


discrete distribution in
which the data takes values
of only 0 or 1, leads to a
continuous normal
distribution for the means.
The answer is that a sample
of 80 has means that take

one of 80 values, starting


from 0 (80 Reds, 0 Blues)
and increasing in steps of
0.0125 (for one more Blue)
all the way to 1 (0 Reds, 80
Blues). Though the mean is
still discrete, the number of
possible values it takes is

large enough (80) to think of


it as approximately
continuous.
The next more thing to do is
to define precisely the
inference we wish to test. We
call it the Null Hypothesis.
This is the hypothesis we test

using the sample, and we


either Reject it, or Fail to
Reject it.

Lets say the Null Hypothesis


(represented by H0) is:
H0: = 0.40, i.e., the
population is 60% Reds and
40% Blues.

We are almost ready to make


our first statistical
inference, exciting!
CLT says that the mean of
the probability distribution
of the sample mean is the
population mean .

So if we have many samples


we can find an estimate of
the mean of the means.
However we dont have
many samples, but just one.
So how do we proceed?

This is the joy of statistics, to


be able to say something
logical about something of
which we have only partial
knowledge!
We do the following two
things:

1) We start by assuming the


Null Hypothesis is true. Then
we will have the distribution
for sample means given by
CLT.
2
= 0.40, and = 60% * (0
2
2
0.40) + 40% * (1 0.40) =
0.24 or = 0.48990.

This implies the sample


means will be distributed
normally with mean 0.40 and
1/2
std dev of 0.4899/80 =
0.05477.
Note: As the actual
population has 48% (not
40%) Blues, the actual

distribution of sample means


will have a mean of 0.48 and
std dev of 0.05586. However
we do not know what the
actual population is (if we
did the game was over and
we could go home) so we
proceed assuming the

hypothesis that we are


testing to be true.
2) Next we look at the
sample mean of the sample
we have taken, and ask the
question:

Can we say that it is really


UNLIKELY that we would
have got this mean from a
distribution with a mean of
0.40 and std dev of 0.48990?
If the answer is Yes, then we
reject the Null Hypothesis.

If the answer is No, we Fail


to Reject the Null
Hypothesis.
Next we need to define what
we mean by UNLIKELY. It
is defined in terms of how
far away the sample mean is

away from the mean as per


the Null Hypothesis.
This distance is presented in
terms of the percentage of
the total area under the
normal curve.

We decide before taking the


sample what the level for
rejection. The choice is
ours, but generally two levels
are used: 95% and 99%.
A level of 95% means that:

IF the sample mean DOES


NOT lie in the 95% area
surrounding the mean,
THEN we infer that it is
unlikely that the sample was
produced by a distribution
with characteristics of the
Null Hypothesis, and

therefore REJECT the Null


Hypothesis.
OTHERWISE we FAIL TO
REJECT the Null
Hypothesis.

From the above picture we


see that the Null Hypothesis

would be rejected if the Z for


the sample mean is greater
than 1.96 or less than -1.96.
Similarly, if we set our
standard for rejection at
99%, then we reject the null
only if the sample mean does

not lie in the 99% area


surrounding the mean.

Now the value of the mean


we have for the sample is
0.525. We are assuming (not
yet rejected the Null
Hypothesis) that this has
been produced by a normal
distribution with a mean of

0.40 and standard deviation


of 0.05477.
The number of standard
deviations the value of 0.525
is above the mean is:
Z =0.525 0.40 =2.28
0.05477

The above statistic is called


the Z statistic.
Now if we had chosen our
level for rejection to be 95%
BEFORE taking the sample,
then we will REJECT the
Null Hypothesis of = 0.40.

However if we had chosen


our level for rejection to be
99% BEFORE taking the
sample, then we will FAIL
TO REJECT the Null
Hypothesis of = 0.40.

You should understand the


above example of statistical
inference well. We will see
many different examples of
statistical inference in this
course, but the underlying
logic presented above, which

is the Classical (Frequentist)


approach, remains the same.

The p-value
With a Z of 2.28, the area
under the curve for the right
and left tails (area for which
Z < -2.28 and Z > +2.28)
equals 1.13% * 2 = 2.26% =
0.0226. This is referred to as

the p-value. It is the


probability that the
population (assuming the
Null is true) would produce a
sample mean equal to or
greater than 0.525.
Often statistical inference is
approached not by deciding

before hand the confidence


level (say 95% or 99%) but
by reporting the p-value and
letting those considering the
results to judge for
themselves whether the pvalue warrants rejection of
the Null Hypothesis.

The smaller the p-value, the


smaller the probability that
the population under the
Null Hypothesis would have
produced the sample. Hence
smaller the p-value the more
confidence we can have in
rejecting a Null Hypothesis.

Recap of the Classical


(Frequentist) Approach to
Statistical Testing
1) We start with population
that we want to test a Null
Hypothesis about, say the

mean of the population


equals a certain value.
2) We describe an
appropriate test statistic.
Appropriate implies a
statistic that we will be able
to test.

3) Assuming that the Null


Hypothesis is true, we find
the distribution of test
statistic. For means, the CLT
is a very powerful result that
provides us the distribution
when the sample size is
large (in practical terms >

30). It is the statisticians job


to tell us what the test
statistic is, how it is
distributed, and to provide
us tables for critical values.
We will take these as given
rather than delving deep into
these issues.

4) We optionally define the


level of confidence (level of
significance/rejection).
Alternatively we skip this
step and report the p-value
at step 6.

5) We take a sample of the


population and estimate the
test statistic.
6) On the distribution of the
test statistic implied by the
Null, we see where the test
statistic lies. If it lies in the
regions of rejection from

the Null value we


REJECT the Null
Hypothesis. Otherwise we
FAIL TO REJECT the
Null Hypothesis. The
regions of rejection are
those which are
UNLIKELY for the test

statistic to end up in if the


Null Hypothesis is true, for
example the right and left
tails of the standard normal
distributions when we were
testing means. If the test
statistic does not lie in the
regions of rejection we

conclude the sample


estimates are NOT
UNLIKELY ENOUGH to
justify rejecting the Null
Hypothesis.
You should get comfortable
with the above logic. This is
what statistical testing is, it

will appear in many different


forms, but the logic remains
the same.
Also various words/phrases
are used to describe the same
thing: for example 95%
confidence level = 5% level

of significance = 5% level of
rejection

One-tailed and Two-tailed


tests:
Look back at the picture of
the 95% region around the
mean. We reject the
hypothesis that the
population mean equals 0.40

if the Z-value of the sample


mean lies in either of the
shaded regions on the right
or the left. This is an
example of a Two-tailed test,
that is we reject if the sample
mean lies in either tail.

Sometimes we may want to


test the hypothesis that the
population mean is greater
than a particular number,
rather than different from
a particular number. For
example, we may want to
test that the population

mean is greater than 0.40.


Why would be want to do
that?
Suppose the 40% is the
success rate of a drug. We
have a population (an
existing drug that is already
in use) that we know has a

population mean of 0.40 and


we have a preference for
greater means (greater
success). Success could
imply, for example that the
patient is alive after 5 years.
Hence the population is
distributed binomially, the

patient is either alive or not


alive. We have to choose
between the known
population (existing drug)
and the population to be
sampled (new drug).
We want to continue using
the existing drug unless it is

UNLIKELY that the new


drug has a smaller success
rate, that is the population to
be sampled has a smaller
mean. Then the Null
Hypothesis becomes <
0.40. In this situation what is
the rejection region for the

Null Hypothesis. The tail


on the left can no longer be a
rejection region. So only the
right tail is the rejection
region, and now this has to
have the entire probability
mass of 5%, so the value of Z
for rejection falls to 1.64

Note that in the above test


comparing two drugs, the

benefit of doubt is being


given to the existing drug,
the Null Hypothesis is that it
is the better drug. The Null
Hypothesis gets the benefit
of doubt.
We are ready to accept that
the new drug is an inferior

treatment unless the sample


result shows that is
UNLIKELY, as in having
a sample mean that has a Z
> 1.64. This corresponds to
sample mean > 0.40 + 1.64 *
0.05477 => sample mean >
0.4898. (sample size 80)

It is not enough for the new


drug to have a sample mean
merely greater than 0.40, for
example 0.42 wont do. The
amount by which 0.42
exceeds 0.40 is not sufficient
to reject the Null Hypothesis.
Statistics says that if we use

95% one-tailed confidence


then sample mean must
exceed 0.4898.
We are favoring the Null
Hypothesis, in the sense that
we are making it difficult to
reject. Even if the sample
mean is 0.42 we still do not

reject the Null that the


population mean of the new
drug is less than 0.40. Even
though we favor the Null
Hypothesis, we still may
sometimes make the mistake
of rejecting it mistakenly

(can happen with a small


probability).
For example suppose the
population mean of the new
drug is actually 0.38, then
there is still a chance (less
than 5%) that we will end up
with a sample mean greater

than 0.4898. We would then


have wrongly rejected the
Null Hypothesis. An error of
this type is called a Type I
Error.
We are more likely to make
the error of the other sort,
that is not rejecting the Null

Hypothesis even though the


new drug has a success rate
larger than 0.40. For
example, if the success rate
of the new drug is 0.42, then
the population standard
deviation will be = (0.42 *
1/2
0.58) = 0.4936. Then the Z

for the sample mean (of a


sample of size 80) that equals
0.4898 is:
Z =0.4898 0.42
1/2
(0.4936/80 )
=1.265

The probability of the


sample mean exceeding
0.4898 can be found by using
the above Z and the table for
Standard Normal
Distribution. It is 10.29%.
So there is a 100% - 10.29%
= 89.71% probability that

with a population mean of


0.42 we will erroneously
FAIL TO REJECT the
Null Hypothesis that the
population mean of the new
drug is less than 0.40 (while
in reality it is 0.42 > 0.40).
This is a Type II Error, the

error of not rejecting a false


Null.

Another quick example: If


the population mean for the
new drug is 0.4898, then the
probability of making a Type
II error with the rejection
level at 95% (also called the
5% significance level) is
50%. This follows as there is

a 50% chance that the


sample mean from a
population with mean 0.4898
will turn out to be less than
0.4898.
If a test of the Null has lesser
probability of making a Type
II Error, it is said to possess

greater Power. The


Power of a test the more
faith we can have in our
statistical inference. Tests
should be designed to
maximize Power for a
given probability of making
a Type I error. We should be

aware of the Power of our


tests.
If we use, say the 99%
confidence level (1%
significance level) instead of
the 95% confidence level, we
make it more difficult to
reject the Null Hypothesis.

This reduces the probability


of making a Type I Error,
but increases the probability
of making a Type II Error. It
decreases the Power of the
test.

The t-distribution
The example above was of a
population which was
Binomially distributed.
Specifically, such a
distribution enabled us to
calculate the standard

deviation that was implied


by the mean of the Null
Hypothesis. If the population
is not distributed Binomially,
then the Null will not be
enough to compute the
standard deviation, and we
will have to form an estimate

of the population standard


deviation from the sample
standard deviation.
When the population
standard deviation also has
to be estimated, the test
statistic becomes the
deviation of the sample mean

from the population mean


divided the estimate of the
population standard
deviation divided by squareroot of N. This statistic is
called the t-statistic and it is
said to have N-1 degrees of
freedom (dof).

Henceforth we will call the


sample mean xS and the
sample standard deviation
sS.

These estimators are


random variables. From the
sample we get one particular
estimate for each of these
two random variables. We
use these estimates to
make inferences about the
population.

Fortunately for us, the tdistribution has been


extensively studied and
tables similar to the
Standard Normal table have
been prepared. So the only
change we need to make
from the earlier example is

to calculate the sample


standard deviation, compute
the t-statistic, and use the t
tables.
If the size of the sample is
large (say greater than 30)
we can approximate the
population standard

deviation to the sample


standard deviation, and use
the Z-statistic instead of the
t-statistic.
t-statistic tables are available
on the internet, for example:

http://www.statsoft.com/text
book/sttable.html

Confidence Intervals
Besides testing Null
Hypothesis that the
population parameter (for
example the mean ) is of a
particular value, we can also
wish by sampling to

ESTIMATE the population


parameter. For finite
(especially small) samples we
cannot determine for sure
the population parameter,
but can only make a
statement. The statement
that we make is in the form

of a CONFIDENCE
INTERVAL.
In Classical Statistics a
Confidence Interval is an
Interval Estimator. The
Confidence Interval that we
compute based on a sample

is one particular realization


of the Interval Estimator.

A Confidence Interval for a


population parameter
consists of two endpoints
(the interval is between the
endpoints) and a confidence
level p% (usually 95%).

For the population mean, the


sample mean is the midpoint
of the Confidence Interval.
We consider 4 situations:
1) If the underlying
population is distributed
normally (which is rare)

then sample means are


distributed normally (even if
the sample size is small). And
the population standard
deviation is known (), then
we have the confidence
interval to be:

1/2

xs + Z*/(N )
Where = (100% - p%)/2
and

Z is defined as Z >

Z having a probability
mass of .

2) If the underlying
population has a known
standard deviation () and
sample size is large (say >
80) then once again we have
the confidence interval to be:
1/2
s

x + Z */(N )

3) If the underlying
population has an unknown
standard deviation and the
sample size is small, then the
confidence interval is:
1/2
s

x + t *s /(N )

4) If the underlying
population has an unknown
standard deviation and the
sample size is large (say >
80), then the t distribution
can be approximated by the
Z distribution and the
confidence interval is:

1/2

xs + Z*ss/(N )
Note that the sample
standard deviation

estimator s is:

Why divide by N-1 rather


than N? The intuition is that
we are not using the actual
mean but the fitted

mean xs, which will lead to


deviations being
underestimated, and to
correct that divide by N-1
rather than N.
Online resource for Stats:
http://onlinestatbook.com

Example: Suppose we wish


to find the Confidence
Interval at the 95% level for
estimating mean of a
binomial population (with
values 0 or 1). The sample
mean is 52.5% for a sample
of size 80 (38 0s and 42 1s).

First estimate the sample

standard deviation s.
2
2
s = (38*(0-0.525)

+
2
42*(1-0.525) )/(80-1)
=> ss = 0.5025

Our estimate of standard


deviation for the mean is:
1/2
0.5025 / (80 ) = 0.0562
Given the sample size of 80,
we can use the Z
approximation in place of t.
The critical values for the
5% two-tails become: + 1.96

The Confidence Interval


then becomes:
52.5% + 1.96 * 5.62%
= 52.5% + 11.02%
= (41.48%, 63.52%)
In the newspapers the results
of the sampling will be

reported as In our opinion


poll, the Reds have 52.5% of
the vote and the Blues have
47.5%, with an error of
11.02%. The confidence
level is not reported, the
default level is 95% for such
sampling.

What does the Confidence


Interval of (41.48%, 63.52%)
mean?
It means that if the
population standard
deviation is really the one we
estimated (which is a fair

approximation given the


largish sample size of 80),
then if the population mean
was 41.48%, then the
probability of getting a
sample mean of 52.5% or
greater would be 2.5%. And
if the population mean was

63.52%, then the probability


of getting a sample mean of
52.5% or lesser would be
2.5%.
Another way to think of the
confidence interval: Pick any
value from the interval.
Assume that is the

population mean. Then the


95% region around that
population mean will contain
the sample estimate of mean
(0.525).
If the sample is large the
95% region will be given by
the z-distribution with the

population standard
deviation approximated to
the sample standard
deviation.
If the sample is small, then
the 95% region is given by
the t-distribution.

Yet another way to think of a


confidence interval, the set of
values for which we would
fail to reject a Null
Hypothesis setting the
population mean equal to
those values.

What a Confidence Interval


in Classical Statistics does
NOT mean is that there is a
95% chance that the
population mean lies
between 41.48% and
63.52%.

The population variable in


this sort of statistics is not a
random variable, hence to
talk about probability with
respect to it is meaningless.

Difference of Means
Another example of
statistical testing is to
determine if two populations
are different in terms of their
means. This differs slightly
from the earlier existing and

new drug example. There we


knew the population mean of
one population (existing
drug) and wanted to test if
the other drug was better.
Here we simply wish to test
whether two populations are
different without any

predisposition about one


having a greater or lesser
mean. For example we may
wish to test if the populations
of adult females of US and
Canada have the same
heights.

We take one sample each


from both populations. Both
these sample means are
normally distributed. We
also assume that the
population variances
(standard deviations) are
equal, though we do not

know what they are. Under


the Null that both
populations have the same
mean, the following statistic
is distributed as t:

Where n1 and n2 are the sizes


of the samples of population
1 and 2 respectively, xs1 and
xs2 are the samples means
respectively, ss1 and ss2 are
the sample standard
deviations respectively, and
1 and 2 are the population

means under the Null


Hypothesis respectively.
If we do not assume the
population variances are
equal, then we have the tstatistic given by:

Paired Comparison Test


Did firms increase R&D
spending as percentage of
total sales after Congress
passed tax breaks for R&D
expenditure? Suppose we
look at a sample of 25 firm.

One way to do the test is to


look at the two populations
of R&D spending and
compare their means using
the above tests of unequal
means. A test with more
power is to instead study
the difference in firms

spending before and after.


That is, take the population
of changes and test whether
its mean is greater than zero.
This is a Paired
Comparison Test and it has
more power because we are
not losing information by

aggregating the R&D


spending.
Suppose the mean for the
population of changes is 7%,
and the standard deviation is
5%. Assume that the
distribution is normal, so the
test will use the t-statistic.

The degrees of freedom (dof)


is 25-1 = 24. Suppose we test
at the 95% confidence level,
then the critical value as
given by the t-tables is
1.7109.
The t-statistic will be:

t24 =

7%
1/2
5% / 25
= 7
As the t-statistic exceeds the
critical value we will reject
the Null Hypothesis that the
tax breaks passed by

Congress did not increase


R&D expenditure.
The t-tables are given in
terms of the right-tail. If you
want them for two-tailed
tests, you have to take the
critical value for half the
significance level. That is if

you want 95% confidence,


that corresponds to 5%
significance, then look up the
critical value for 2.5%
significance. This works
because the t-distribution is
symmetric (like the normal
distribution). So 2.5% on

each tail adds up to 5%


significance. The 2.5%
critical value for only the
right tail from the tables
with 24 dof was 2.0639,
which would be the critical
value for the two-tailed test.

How does the Paired


Comparison test increase
power? Suppose there are 8
firms in the sample, 4 of
which move from 5% to 6%
and the other 4 move from
25% to 26%. Then if we
aggregate, the aggregate

moves from 12.5% to 13.5%.


There is however substantial
standard deviation in the
samples due to the difference
between 5% and 25% and
between 6% and 26%. The
standard deviation for the
two samples (pre and post

change) turn out to be 10%.


If we do the t-test we get:
t15 =
0.01
2
2 1/2
( 0.10 + 0.10 )
8
8
= 0.2

The above t-stat is too small


to reject the Null at any
meaningful levels (the pvalue is greater than 40%).
Given this the 1% change
may not show up as
significant, and we may fail
to reject the Null. The above

is a test of differences in
means.
However when we form a
sample of differences, all
differences are 1%, making
the sample standard
deviation to zero. This makes
the t-stat infinite and the p-

value zero! We can reject the


Null at all levels of
significance. This is a test of
mean of differences.
This is an example how the
power of tests can be
increased by paying
attention to the design of the

test. How did the increase in


power of the test happen? In
the unequal means test there
was variation in the samples
due to variations between
firms. We are however
interested only in the
variation due to the tax

breaks that Congress passed.


So considering only intrafirm variations enables us to
focus on the tax breaks and
eliminate the noise due to
inter-firm differences, and
results in a test of greater
power.

Of course, if the samples are


independent (say trials of
two drugs) then we cant use
the Paired Comparison test.
However when possible,
when samples are not
independent (as in the case
of changes in firms R&D

spending) we should use the


Paired Comparison (mean
of differences).

Tests of Variances
For a normally distributed
population we can test
whether the variance equals
a particular value. This is
similar to testing whether the
mean equals a particular

value as we have done


before, but with variance
substituted for the mean.
Suppose the Null Hypothesis
H0 and the alternative
Hypothesis HA are:
2
2
H0: = 0

HA: 0
The logic for the test remains
the same as for the Classical
(Frequentist) approach to
statistical testing. That is
suppose the population is
normally distributed and has
2
2
= 0 then the test statistic

will have a particular


distribution. This
distribution was Z (standard
normal) or t for the statistic
to test the mean, and is Chisquared for testing the
variance.

If the test statistic lies


beyond the critical value
then we conclude our initial
assumption that the
population conformed to the
Null Hypothesis was wrong
and we Reject. If the test
statistic lies within the

critical value then we Fail


to Reject the Null
Hypothesis.
The Chi-squared test
2
statistic has N-1 degrees of
freedom and is:
2
2
N-1 = (N 1) ss

0
The two-tailed Chi-squared
test rejects the Null if the
value of the test statistic is
too small (sample variance
much smaller than expected
under the Null) or too large
(sample variance much

larger than expected under


the Null). For example, we
divide a 5% level of
significance into a 2.5%
right tail and a 2.5% left tail.
The Chi-squared table, and
also F-tables that we will
need soon are available at:

http://www.statsoft.com/text
book/sttable.html
Example: A sample of 25
daily returns for a stock has
2
a variance of 0.20% .
Assuming that the process
that generates the returns is
stable (unchanging) test at

the 1% significance level the


hypothesis that the variance
of the daily stock returns is
2
0.25% .
The dof is 24 and the two
tails at the 1% level of
significance are: 9.88 and
45.55

The test statistic from the


sample is:

24

= 24 * 0.20
2
0.25
= 15.36
As 45.5585 > 15.36 > 9.8862
the test statistic does not lie
in the region of rejection and
we FAIL TO REJECT the

Null Hypothesis that the


variance of daily stock
2
returns is 0.25% .

Test of equality of variances


Equality of variances can be
tested using a test statistic
that is the ratio of variances
of the samples and under the
Null has a F-distribution.
The larger variance is put in

the numerator and the


smaller in the denominator.
The test statistic has two
dofs, first the numerators
next the denominators. The
order of the dofs is
important as they are not
interchangeable. As before

the dofs are one less than the


sample sizes.
The larger variance is put in
the numerator, so the test is
single (right) tailed. This
putting of the larger sample
variance on the top means
that we eliminate the lower

tail. So when we test for


inequality of variances,
suppose the level of
significance is 5%, the right
tail should only have
probability of 2.5%.

Example: An analyst is
comparing the monthly
returns of 2 year T-bonds
and bonds issued by GM.
She decides to investigate

whether the T-bills have the


same variance as the GM
bonds by taking a sample of
20 monthly returns of the
former and 31 monthly
returns of the latter. The
2
variances are 0.001 and
2
0.0012 . What can you

conclude if the level of


significance is 10%?
The critical level for the Fstat at the 10% level of
significance with degrees of
freedom 30 and 19 is 2.07
(right tail of with 5%
probability)

The F-statistic for the test of


equality of variances is:
2
F30,19 = 0.0012
2
0.0010
= 1.44
As the F-stat is less than the
critical value 2.07, we cannot

reject the Null Hypothesis


that the two variances are
equal.

Covariance, Correlation
Coefficient
These are (descriptive)
statistics that measures the
characteristics not of one
variable (like mean and

standard deviation) but a


pair of variables.
The Covariance of a pair of
variables is a measure of
how much they move
together. For example, one of
the variables increases by
5%, then covariance is a

measure of how much we


would the other variable to
have increased. If the
variables have a tendency to
move in opposite directions
the covariance will be
negative.

The formula for Covariance


is similar to that for
Variance, except that it takes
the deviations from the mean
for the two variables and
multiplies them rather than
taking the square of either.
The formula for the

estimator for covariance in a


sample between variables X
and Y is:

We take the pairwise


occurrences of the variables
X and Y, find their
deviations from their means
and multiple them. Finally
we divide the number of
occurrences to find the
covariance. If the population

means are not known, then


we can use the sample means
xs and ys to form an estimate
of the covariance.

The correlation coefficient


XY is the covariance
between X and Y
normalized (scaled down)
by the standard deviations of
the two variables:

The correlation coefficient


lies in the range (-1, +1).
A correlation of +1 mean
that X and Y are perfectly

correlated, every P%
increase (decrease) in X will
result in a P% increase
(decrease) in Y.
A correlation of -1 mean that
X and Y are perfectly
negatively correlated, every
P% decrease (increase) in X

will result in a P% increase


(decrease) in Y.
We denote the estimator of
XY as rXY. It is a random
variable. Sometimes it is
called sample correlation
coefficient. Prior to the
sample being taken we have

the random variable (which


we know how to compute
using the formula). Post
sampling we have one
realization for (estimate of)
the estimator.

Outliers are data points that


are deviate significantly
from the norm. As the
computation of variances
and covariances involve
squaring, outliers can have a
disproportionate impact on
their calculation. For

example, a data point that


deviates from the mean by 3
units, will have the same
impact as 9 data points that
deviate from the mean by 1
unit in the calculation of
variance. Hence we should
examine the impact of

outliers on our estimates,


and may even exclude them
from our calculations.

Correlation and Causality


Correlation is a statistical
estimate, it does not imply
causality. Just because X and
Y are positively correlated
does not mean that X causes
Y or vice versa. Consider two

variables, number of boats


out in Lake Michigan and
number of cars on the
highways. If you do a
correlation you will find that
they are positively
correlated. However it is not
the boats that are causing

more cars to be on the


highways, or vice versa.
Rather both are activities
that increase in the summer,
leading to the positive
correlation. The correlation
between boats and cars is

spurious in the sense that


it is not causal.

Testing whether the


correlation coefficient is
ZERO.
We represent the estimator
of the correlation coefficient
XY by rXY.
The Null to be tested is:

H0: XY = 0
The following test statistic is
for a two-tailed test of the
Null and is distributed as t
with N-2 degrees of freedom:

Example: You have 22


pairwise observations for
daily returns to stocks X an

Y. The correlation coefficient


is 0.42. Test the Null at the
1% level of significance.
The critical level for t at the
1% level of significance with
20 dofs is 2.85.
The test statistic has a value:

1/2

t =0.42 * 20
2 1/2
(1 0.42 )
=2.07
So we cannot reject the Null
that the two daily stock
returns have a zero

correlation (uncorrelated) at
the 1% level of significance.

Linear Regressions
We now come to an area of
statistical inference and
estimation that is of
particular importance to
Economics and Finance:
Regressions.

The basic single


independent variable
Regression model is:

Yi = + Xi + i
The above model says that
the dependent variable Y
is determined by the value of

the independent variable


X multiplied by a constant
plus a constant plus other
influences on Y summarized
in the error term .
The constant is also called
the slope coefficient, and
constant the intercept.

The reason should be clear


from the following graph

The work now is to estimate


the coefficients of the
Regression Equation, namely
and .

If we define the vectors b, X,


and Y:
b=[, ]
X1 =[1, 1, 1]
X2 =[X11, X12, X1N]
Y =[Y1, Y2, YN]

Where X1 and X1 represent


st
nd
the 1 and 2 columns of the
Nx2 matrix X. The first
columns can be thought of as
using 1s (a constant) as a
variable to extract the
coefficient .

That is, like the other


coefficient , the coefficient
is also multiplied by a
variable, except that the
variable happens to be the
constant 1.
The linear regression model
becomes:

Y= Xb

Then the Ordinary Least


Squares (OLS) estimator for
the coefficients is the 2
dimensional vector:
-1

bOLS = (XX) XY

with the first element of

bOLS the intercept and


the second element the
slope.
If you do not want vector
notation, you can use the
summary statistics:

Sx = x1 + x2 + + xN
Sy = y1 + y2 + + yN
2
2
2
Sxx = x 1 + x 2 + x N
Sxy=x1y1+x2y2+xNyN
OLS = N*Sxy Sx*Sy

N*Sxx Sx*Sy
OLS = Sy OLS* Sx
N

Let the vector of residuals


generated by the regression
e. It is a N dimensional
vector.
OLS chooses the coefficients
such that the sum of squares

of the ei (i=1,2,.., N) is
minimized.

The method was applied in


1801 by Gauss to predict the
position of the asteroid
Ceres. Other
mathematicians who had
tried to predict (forecast) the
asteroids position had been
unsuccessful. Legendre

shares the credit with Gauss


for discovering OLS.

The OLS estimate has some


desirable features, provided
some conditions are satisfied,
prominently:
1) X and are uncorrelated
2) has the same variance
across different data points.

3) has a zero expected


value (note and can be
defined to make this true)
4) is not serially correlated
(that is i is uncorrelated
with i+1, i+2 etc.)

If these conditions are


satisfied the OLS estimator
is BLUE (Best Linear
Unbiased Estimator)
Best as in lowest variance,
unbiased implies that if you
find the OLS estimators

repeatedly their average will


be and .

If the conditions for OLS are


not satisfied a variety of
other estimators have been
designed by statisticians, for
example:
1) Generalized Least
Squares (GLS) if does not

have the same variance


across different data points.
2) Instrument Variable
estimators if X and are
correlated.

Just like we had a variance


for the estimator of the mean
of a population, we have
variances for the OLS
estimators.
As of now, I will not get into
how the variances are

calculated, rather will


provide you the variance.
Usually exam questions and
statistical packages provide
the variances rather than
expect you to calculate.

Hypothesis Testing
The most common Null
Hypothesis to be tested for
the linear regression is that
one of the coefficients is zero.
As before, the Null is
rejected if the value of the

coefficient estimate divided


by its standard deviation
exceeds (in magnitude) some
critical value (usually either
Z or t).

Example: A regression with


50 data points produces an
estimate of 2.27 for the OLS
estimator. The sample
estimate of standard
deviation ss is 1.14.

Test the Null that the slope


coefficient is greater than 0
at the 95% level.
As the number of data points
is large at 50, we can use the
Z-distribution. The critical
value for the Z-stat at the
one-tailed 95% level is 1.64.

test statistic = 2.27 / 1.14 =


1.99
As the test statistic is more
than the critical value, we
reject the Null Hypothesis
that the slope coefficient is 0.

Following the Frequentist


approach, we can form
Confidence Intervals (which
are estimators) for the
coefficients of the linear
model with critical values tC
from the t-table:

OLS +tC*ss
OLS +tC*ss
where ss and ss are the
standard deviations for the
OLS and OLS estimators.

Example: A regression with


70 data points yields OLS
and OLS estimates of 1.57
and 2.49.
The variances ss and ss are
0.49 and 1.08.

Estimate the 95% confidence


intervals for the coefficients.
As the number of data points
is large, we can use the twotailed 95% Z critical values
of +1.96.

OLS +ZC*ss =1.57 +


1.96*0.49 = (0.61, 2.53)
OLS +ZC*ss = 2.49 +
1.96*1.08 = (0.37, 4.61)

Analysis of Variance
(ANOVA) for Regressions
When we say Sum of
Squares we mean Sum of
Squared Deviations from the
Mean

The Sum of Squares for Y


is called Total Sum of
Squares (SST)
The Sum of Squares for
OLS*X is called Regression
Sum of Squares (SSR). This
is the amount of variation
that would exist for Y if only

X was the cause of variation.


This is the amount
explained by the
regression.
The Sum of Squares for
is called Error Sum of
Squares (SSE). This is the
amount of variation that the

regression does not


explain, and attributes to
the error term. It is the
unexplained part. This is
also called Sum of Squared
Residuals as it is indeed
calculated by squaring the

estimated residuals and


adding.

By algebra we have: SSR +


SSE = SST
The Mean Squared Error
(MSE) equals SSE/(N-2).
MSE the average of the
squared residual errors.

Normally when we calculate


average we divide by the
number of observations N,
but here we divide by the
number of observations
minus 2. Why N-2 rather
than N? The answer is the
degrees of freedom are N-2.

At which point you may be


inclined to ask Why is the
degrees of freedom N-2?
Without getting into a
detailed answer, note that
our goal for the regression is
to minimize the residuals in
some optimal fashion.

Our ability to choose mean


and standard deviation (the
two unknowns) implies we
can pick these parameters
such that 2 of the residuals
become zero or any
arbitrary number we wish to
assign.

We are not really facing a


vector of N residuals that are
beyond our control, but
rather a vector of N-2
residuals.
That is the residuals can
vary in N-2 dimensions,
rather than N dimensions,

hence N-2 degrees of


freedom.

The R (R-squared or
Coefficient of
Determination) for a
regression is:
2

R = SSR/SST
The better the data fits the
linear regression, the higher

the R will be. If X can


explain all the variation in Y,
2
the R will be 100%. That
would mean there would be
no variation needed to be
explained by the error term,
and X and Y would be
perfectly correlated.

A R of zero implies that X


can explain none of Y, that is
X is orthogonal to Y, they
have zero correlation.

Confidence Interval for


Forecasts of Y
We can use the regression
coefficients to make
Forecasts for values of Y for
given values of X.
The Predicted Value of Y is:

The sign over Y is a hat


and the variable is called Yhat. This is standard
statistics notation.

Define the Standard Error


of Estimate (SEE):
1/2
SEE = (SSE /(N-2))
2
And then the variance FY
for the Forecasts of Y, is:

Where XG is the given value


of X for which Yi is
predicted, SX and sSX are the

sample means and standard


deviation for X respectively.

Example: A linear regression


with 40 data points yields
OLS and OLS estimates of
7.45 and 1.93.
The SEE from the regression
is 0.59, SX is 1.25 and sSX is
0.31.

Estimate the Predicted Y


value for X = 1.89 and the
95% confidence interval.
The Predicted Value = 7.45 +
1.93 * 3.29 = 13.80

We next compute the


variance for the predicted
value of Y to be:
0.59 (1 + (1/40) + (1.89
2
2
1.25) /(40-1)0.31 )
= 0.00394846

Which gives the standard


deviation to be: 6.28%
As N is larger than 30, we
can use the Z critical values.
If it were smaller, we should
use t with N-2 dofs.

The critical values for the


95% confidence interval are
+1.96.
The 95% confidence interval
is:
13.80 + 1.96*0.0628
= (13.68, 13.92)

You might also like