Statistics Introduction Slides

STAT101
Statistics can be divided into

two parts:
* Descriptive Statistics
* Inferential Statistics
We start with Descriptive

Statistics
A population is a set of
objects/individuals/things
whose numerical
characteristic(s) we are
interested
For example, we may be

interested in the heights of
all residents of the US, or the
number of pages in all the
books in the US Library of
Congress
One way to describe a

population is to list all the
numerical values, for
example we can list the
heights of all the residents of
the US, a database that will
have almost 300 million
entries
Alternatively we can list

some characteristics of the
population like the mean and
standard deviation
This will just be 2 numbers
instead of 300 million, but
will give us an idea of what
the population looks like
The mean of a population is

the average of all the
individual values in it and is
represented by the Greek
letter :
the above notation is a

formal way of writing that
equals the sum of xi as i

varies from 1 to N, divided
by N
xi is the value of an
individual data point within
the population, and N is the
total number of data points
in the population
Specifically note that the
symbol is used to represent
the sum of whatever follows

in brackets, with i (the
parameter of summation)
varying from the amount
appearing at the lower-right
of , i.e. 1, to the amount
appearing at the top-right of
, i.e. N
So we have to simply be
the sum of values of all data
points in the population
divided by the population
size
is a measure of how
variable the values are
within the population
To measure we find the
deviation of an individual
data point from the mean ,
then square it and add these
squared deviations for all

data points
Why do we square? Because
if we simply add without
squaring we will get a big
grand ZERO as a result
Finally we once again divide

by N, and the result is the
2
variance , the square-root
of which is the standard
deviation
And the standard deviation

is given by:
There is actually more to the

standard deviation than
just avoiding getting zero by
summing
There are other ways of
avoiding the zero, for
example by taking absolute
values or the fourth power

However is special, for
example it is a parameter of
normal distribution, which is
probably the most important
probability distribution. We
will get to that later
How are means and

standard deviations useful?
For example, we may find
that the mean height of adult
men in the US is 59 and the
standard deviation 4,
whereas the mean height of
adult Dutch men are 510
and the standard deviation

2. This information enables
us to form an idea about the
heights of the US male and
Dutch male populations
Alternatively we could
actually list the heights of all
adult males, but then we
would have to deal with

many million values instead
of just 2 for each country
Think of a bowl with many

Red Balls and Blue Balls, say
a total of 1,000,000 (only a
few of which are shown in
the picture)
To apply numerical results

we assign numerical values
of 0 to Reds and 1 to Blues
That implies, for example, if
the population has 30%
(300,000) Reds and 70%
(700,000) Blues, the
population mean will be 0.70
Of course we could have

assigned other values, for
example 5 to Reds and -1 to
Blues, that would just be an
origin shift and scaling
For the remainder of this
course assume that the pot
actually has 52% Reds and

48% Blues
If we actually counted all the
balls in the pot, we would
find 480,000 Blues and
520,000 Reds
Therefore:
= 520K*0 + 480K*1
1,000,000
= 0.48
2
=
2
2
520K*(0-.48) +480K*(1-.48)
1,000K
= 0.2496
=> = 0.4996
Distributions:
The population of Red and
Blue Balls is an example of a
binomial (discrete)
distribution, a data point can
take only one of 2 values
There are many variables

that can take an infinite
number of values, for
example height. These
variables have a continuous
distribution. In such
distributions a particular
value has a very very small
probability (tending to
Zero).
For example a height of
exactly 510. You will be
hard put to find anyone in a
population who is absolutely
exactly 510. Even those
who are very very close will
probably differ by at least a

billionth of an inch
In that case it becomes
meaningful to talk about
number of individuals for a
range of outcomes (say
heights with values between
59 to 510) rather
than individual outcomes

(like 59 exactly or 510
exactly)
Distributions that give
density (amounts for
ranges) are continuous
distributions
The most useful and hence

famous example of a
continuous distribution is the
Normal distribution
We see above that the

Normal Distribution is
symmetric around the mean
For example the probability

density 2 standard
deviations above the mean is
the same as that 2 standard
deviations below the mean
What in the world is

normally distributed?
Well, heights arent!
Suppose an apple orchard
produces a million apples
and puts them randomly into
bags of a thousand apples
each. Then the weights of the

bags would be
approximately normal.
Values for the normal

distribution are usually
given in the form of
cumulative tables
These tables are given in
terms of the z variable,
which is:
Z = (Variable Value
Mean)/Std Dev
For example, a variable that
is one standard deviation
above the mean will have a
value for Z = 1
Normally distributed data

(variable) can be origin
shifted and scaled to a
standard distribution called
Standardized Normal
Distribution, or Zdistribution. The origin
shifting is done by
subtracting the mean, and

the scaling is done by
dividing the variable by its
standard deviation. Once
this is done, the data has the
same distribution as all other
standardized normal
variables. The mean of the Z
variable is 0, and the

standard deviation 1 and
tables are available for
cumulative density values.
For example from the above

we see that the area
(probability) of the random
variable having a value from
infinity to 1 standard
deviation above the mean (Z
= 1) is:
0.8413 = 84.13%.
As the probability from

infinity to the mean is 50%
due to symmetry around the
mean this also implies that
the probability of the
random variable lying
between the mean and mean
plus one standard deviation

is:
84.13% - 50% = 34.13%
Example: Suppose you are

told that the consumption of
potatoes by the Irish is
normally distributed with a
mean of 30 lbs a year, and a
standard deviation of 5 lbs.
The population of Ireland is
4 million. Calculate the
number who consume

between 27 lbs and 31 lbs.
The deviation of 27 lbs from
the mean of 30 lbs = 3 lbs =
3/5 = 0.6 std dev
Therefore the probability

mass from 27 lbs to 30 lbs =
(0.7257 0.5000) = 0.2257
Similarly from 30 lbs to 31
lbs, the deviation = 1/5 = 0.20
std dev and probability mass
= (0.5793 0.5000) = 0.0793
So total probability = 0.2257

+ 0.0793 = 0.3050
So number of Irish
consuming between 27 lbs
and 31 lbs = 0.3050 * 4 M =
1.22M
Inferential Statistics
Think again of the bowl with
many Red Balls and Blue
Balls
Now instead of counting the
number of Reds and Blues
we will Sample and make

inferences about the totals
A random variable is one

whose value is not known to
us. In the future the value
will be known to us (if
observable). For example,
the value of a random draw
from the population in the
jar is unknown to us, it is a
random variable, but once

we have made the pick it will
be either 0 or 1. We may
know the probability of
getting a particular value of
a random variable, for
example in a jar containing
52% Reds and 48% Blues,
the probability of getting 0 is

52% and 1 is 48%. This is
the probability distribution
for this random variable.
You want to estimate what
fraction of the total is Red
and what fraction is Blue.
You dont have time to count
them all, so you just dip your

and select a few randomly
(Random Sample).
From your sample you make
an Estimate of the
fractions of Blues and Reds.
The branch of mathematics

that deals which such issues
is Inferential Statistics.
When you take a Random
Sample from the jar, each
ball has an equal
Probability of being
picked.
Suppose you pick up one ball

every time you dip your
hand into the jar.
If there are a total of N balls
in the jar, then each ball has
a probability of 1/N of being
randomly picked every dip
of your hand.
The total probability of that

any one ball will be picked is
the sum of probabilities for
all balls: 1/N + 1/N + 1/N
+ 1/N (for the N balls) = 1.
So the probability that any
one ball be picked for one
dip of the hand is 1.
A probability of 1 is a
Certain event. The reason
we can add the probabilities
of 1/N is because the events
(of a ball being picked) are
Mutually Exclusive. That
is, if one ball is picked then
another cant be picked.
When you pick a ball, the

result is either Red or Blue.
So your data takes one of
two values, which is the
simplest possible result. Your
data is said to be
Binomial, and the
Probability Distribution of
the data is also Binomial.

Your data has a Probability
Distribution because every
time you dip your hand into
the jar, you cannot predict
Ex-ante (prior to the dip)
what data you will get, but
you can say what the

probability will be.
For example, if the fraction
of Reds equal 52%, then you
can say the probability of
picking a Red (randomly) is
52% and a Blue is 48%. The
Probability Distribution
then is (Red = 0.52; Blue =

0.48).
Most of the time your data
will take many values, in
which case there will be
probabilities associated with
not 2, but many possible
outcomes. If the outcomes
are Continuous, (that is

not discrete) the number of
possible outcomes becomes
infinite, and each individual
outcome has a very very
small probability (tending to
Zero).
In that case it becomes

meaningful to talk about
probabilities for a range of
outcomes (say outcomes with
values between 2.5 and 3.1)
rather than individual
outcomes (like 2.5 or 2.9 or
3.1).
When a random variable can

take an infinite number of
values, the probability of one
particular value (usually)
tends to zero. Then we do
not have probabilities
associated with particular
outcomes, like a height of
510, but probability

densities given by a
(continuous) probability
distribution. The probability
for a range of outcomes is
given by the integral of
density over the range, that
is the area under the
probability (density)
distribution curve. If the
range is small we can
approximate the area
(probability) as the average
of densities at the endpoints
of the range, multiplied by
the length of the range.
The probability distributions

then become Continuous
Probability Distribution,
the most famous example of
which is the Normal
Distribution. Other examples
are Chi-Square, Tdistribution, Poisson etc.
Suppose you have dipped

your hand in the pot 80
times and find that you have
picked 38 Reds and 42 Blues,
that is a frequency of 0.475
for Reds and 0.525 for Blues.
What can you conclude?
Now there are different ways

in which we can think of the
situation and different
questions we can answer.
Two major ways in which we
can do statistical analysis is:
1) Frequentist (or Classical)
Statistics
2) Bayesian Statistics
Though there is
disagreement between the
above two approaches, each
is logically valid. They just
view the world differently
and ask different questions.
Classical Statistics is by far
more commonly used, and

that is what we will study. I
will also discuss Bayes Rule
in an Advanced MBA
Lecture.
As is proper for a course of
this nature, I will take an
applied approach. That is,
I will sacrifice rigor while

striving to give you the
important intuitions.
Remember that the question
of estimating the population
frequency of Reds and Blues
is equivalent to estimating
the population mean.
To make an inference from

the data we have (Reds =
0.475 and Blues = 0.525 for a
mean of 0.525) we first need
to understand what kind of
samples a binomial
population will generate.
For example it is quite likely

that our population (52%
Reds and 48% Blues) will
generate a sample with, say
40 Reds and 40 Blues, but
NOT IMPOSSIBLE that it
will generate a sample of 0
Reds and 80 Blues.
If we see our sample has 0

Reds and 80 Blues, we
cannot therefore conclude
that the population does not
have 52% Reds and 48%
Blues, but we can conclude it
is UNLIKELY. Similarly if
we see a sample of 70 Reds
and 10 Blues, we would

conclude that it is LIKELY
that the population has more
Reds than Blues.
Essentially, to make
inferences from samples we
first need to know with what
probabilities a population
produces different samples.

Our inferences are in terms
of what is likely and what is
not, that is from a sample we
cannot make an inference
that a population is exactly
something, only that it is
likely to be something and
unlikely to be something else.

This likeliness/unlikeliness is
expressed in terms of
rejection/or failure to reject
hypothesis, or in terms of
confidence intervals for
the estimates of mean,
standard deviation etc.
Our binomial population

generates samples according
to a binomial distribution,
which involves
Combinatorial mathematics
(Permutations and
Combinations). Instead of
going there we will use an
Asymptotic result, The

Central Limit Theorem
(CLT).
CLT is undoubtedly the most

useful result in statistics. It
says the mean of a large
sample from a population
(discrete or continuous) is
distributed according to a
Normal Distribution, with
mean equal to population
mean, and standard

deviation equal population
standard deviation divided
by the square-root of the
sample size. CLT applies to
populations with all kinds of
distributions as long as they
dont have some unusual
features (like infinite

variance).
As a practical matter, a
sample larger than 30 can be
thought of as a large
sample.
Now think of repeated

samples of 80 from our
population. How would the
means of these samples be
distributed? CLT says that
the means would be
distributed with a mean of
0.48 and a standard
1/2
deviation of 0.4996/80 =
0.05586
Or, when you do repeated
sampling of the population
many times, you will see that
about 68.26% of the means
of the samples lie between
0.48 0.05586 = 0.42414 and

0.48 + 0.05586 = 0.53586.
I repeat, the result given by
CLT is that sample means
are normally distributed
with mean 0.48 and standard
distribution 0.05586.
You may wonder how a

discrete distribution in
which the data takes values
of only 0 or 1, leads to a
continuous normal
distribution for the means.
The answer is that a sample
of 80 has means that take
one of 80 values, starting

from 0 (80 Reds, 0 Blues)
and increasing in steps of
0.0125 (for one more Blue)
all the way to 1 (0 Reds, 80
Blues). Though the mean is
still discrete, the number of
possible values it takes is
large enough (80) to think of

it as approximately
continuous.
The next more thing to do is
to define precisely the
inference we wish to test. We
call it the Null Hypothesis.
This is the hypothesis we test
using the sample, and we

either Reject it, or Fail to
Reject it.
Lets say the Null Hypothesis

(represented by H0) is:
H0: = 0.40, i.e., the
population is 60% Reds and
40% Blues.
We are almost ready to make

our first statistical
inference, exciting!
CLT says that the mean of
the probability distribution
of the sample mean is the
population mean .
So if we have many samples

we can find an estimate of
the mean of the means.
However we dont have
many samples, but just one.
So how do we proceed?
This is the joy of statistics, to

be able to say something
logical about something of
which we have only partial
knowledge!
We do the following two
things:
1) We start by assuming the

Null Hypothesis is true. Then
we will have the distribution
for sample means given by
CLT.
2
= 0.40, and = 60% * (0
2
2
0.40) + 40% * (1 0.40) =
0.24 or = 0.48990.
This implies the sample

means will be distributed
normally with mean 0.40 and
1/2
std dev of 0.4899/80 =
0.05477.
Note: As the actual
population has 48% (not
40%) Blues, the actual
distribution of sample means

will have a mean of 0.48 and
std dev of 0.05586. However
we do not know what the
actual population is (if we
did the game was over and
we could go home) so we
proceed assuming the
hypothesis that we are

testing to be true.
2) Next we look at the
sample mean of the sample
we have taken, and ask the
question:
Can we say that it is really

UNLIKELY that we would
have got this mean from a
distribution with a mean of
0.40 and std dev of 0.48990?
If the answer is Yes, then we
reject the Null Hypothesis.
If the answer is No, we Fail

to Reject the Null
Hypothesis.
Next we need to define what
we mean by UNLIKELY. It
is defined in terms of how
far away the sample mean is
away from the mean as per

the Null Hypothesis.
This distance is presented in
terms of the percentage of
the total area under the
normal curve.
We decide before taking the

sample what the level for
rejection. The choice is
ours, but generally two levels
are used: 95% and 99%.
A level of 95% means that:
IF the sample mean DOES

NOT lie in the 95% area
surrounding the mean,
THEN we infer that it is
unlikely that the sample was
produced by a distribution
with characteristics of the
Null Hypothesis, and
therefore REJECT the Null

Hypothesis.
OTHERWISE we FAIL TO
REJECT the Null
Hypothesis.
From the above picture we

see that the Null Hypothesis
would be rejected if the Z for

the sample mean is greater
than 1.96 or less than -1.96.
Similarly, if we set our
standard for rejection at
99%, then we reject the null
only if the sample mean does
not lie in the 99% area

surrounding the mean.
Now the value of the mean

we have for the sample is
0.525. We are assuming (not
yet rejected the Null
Hypothesis) that this has
been produced by a normal
distribution with a mean of
0.40 and standard deviation

of 0.05477.
The number of standard
deviations the value of 0.525
is above the mean is:
Z =0.525 0.40 =2.28
0.05477
The above statistic is called

the Z statistic.
Now if we had chosen our
level for rejection to be 95%
BEFORE taking the sample,
then we will REJECT the
Null Hypothesis of = 0.40.
However if we had chosen

our level for rejection to be
99% BEFORE taking the
sample, then we will FAIL
TO REJECT the Null
Hypothesis of = 0.40.
You should understand the

above example of statistical
inference well. We will see
many different examples of
statistical inference in this
course, but the underlying
logic presented above, which
is the Classical (Frequentist)

approach, remains the same.
The p-value
With a Z of 2.28, the area
under the curve for the right
and left tails (area for which
Z < -2.28 and Z > +2.28)
equals 1.13% * 2 = 2.26% =
0.0226. This is referred to as
the p-value. It is the

probability that the
population (assuming the
Null is true) would produce a
sample mean equal to or
greater than 0.525.
Often statistical inference is
approached not by deciding
before hand the confidence

level (say 95% or 99%) but
by reporting the p-value and
letting those considering the
results to judge for
themselves whether the pvalue warrants rejection of
the Null Hypothesis.
The smaller the p-value, the

smaller the probability that
the population under the
Null Hypothesis would have
produced the sample. Hence
smaller the p-value the more
confidence we can have in
rejecting a Null Hypothesis.
Recap of the Classical

(Frequentist) Approach to
Statistical Testing
1) We start with population
that we want to test a Null
Hypothesis about, say the
mean of the population

equals a certain value.
2) We describe an
appropriate test statistic.
Appropriate implies a
statistic that we will be able
to test.
3) Assuming that the Null

Hypothesis is true, we find
the distribution of test
statistic. For means, the CLT
is a very powerful result that
provides us the distribution
when the sample size is
large (in practical terms >
30). It is the statisticians job

to tell us what the test
statistic is, how it is
distributed, and to provide
us tables for critical values.
We will take these as given
rather than delving deep into
these issues.
4) We optionally define the

level of confidence (level of
significance/rejection).
Alternatively we skip this
step and report the p-value
at step 6.
5) We take a sample of the

population and estimate the
test statistic.
6) On the distribution of the
test statistic implied by the
Null, we see where the test
statistic lies. If it lies in the
regions of rejection from
the Null value we

REJECT the Null
Hypothesis. Otherwise we
FAIL TO REJECT the
Null Hypothesis. The
regions of rejection are
those which are
UNLIKELY for the test
statistic to end up in if the

Null Hypothesis is true, for
example the right and left
tails of the standard normal
distributions when we were
testing means. If the test
statistic does not lie in the
regions of rejection we
conclude the sample

estimates are NOT
UNLIKELY ENOUGH to
justify rejecting the Null
Hypothesis.
You should get comfortable
with the above logic. This is
what statistical testing is, it
will appear in many different

forms, but the logic remains
the same.
Also various words/phrases
are used to describe the same
thing: for example 95%
confidence level = 5% level
of significance = 5% level of
rejection
One-tailed and Two-tailed

tests:
Look back at the picture of
the 95% region around the
mean. We reject the
hypothesis that the
population mean equals 0.40
if the Z-value of the sample

mean lies in either of the
shaded regions on the right
or the left. This is an
example of a Two-tailed test,
that is we reject if the sample
mean lies in either tail.
Sometimes we may want to

test the hypothesis that the
population mean is greater
than a particular number,
rather than different from
a particular number. For
example, we may want to
test that the population
mean is greater than 0.40.

Why would be want to do
that?
Suppose the 40% is the
success rate of a drug. We
have a population (an
existing drug that is already
in use) that we know has a
population mean of 0.40 and

we have a preference for
greater means (greater
success). Success could
imply, for example that the
patient is alive after 5 years.
Hence the population is
distributed binomially, the
patient is either alive or not

alive. We have to choose
between the known
population (existing drug)
and the population to be
sampled (new drug).
We want to continue using
the existing drug unless it is
UNLIKELY that the new

drug has a smaller success
rate, that is the population to
be sampled has a smaller
mean. Then the Null
Hypothesis becomes <
0.40. In this situation what is
the rejection region for the
Null Hypothesis. The tail

on the left can no longer be a
rejection region. So only the
right tail is the rejection
region, and now this has to
have the entire probability
mass of 5%, so the value of Z
for rejection falls to 1.64
Note that in the above test

comparing two drugs, the
benefit of doubt is being

given to the existing drug,
the Null Hypothesis is that it
is the better drug. The Null
Hypothesis gets the benefit
of doubt.
We are ready to accept that
the new drug is an inferior
treatment unless the sample

result shows that is
UNLIKELY, as in having
a sample mean that has a Z
> 1.64. This corresponds to
sample mean > 0.40 + 1.64 *
0.05477 => sample mean >
0.4898. (sample size 80)
It is not enough for the new

drug to have a sample mean
merely greater than 0.40, for
example 0.42 wont do. The
amount by which 0.42
exceeds 0.40 is not sufficient
to reject the Null Hypothesis.
Statistics says that if we use
95% one-tailed confidence

then sample mean must
exceed 0.4898.
We are favoring the Null
Hypothesis, in the sense that
we are making it difficult to
reject. Even if the sample
mean is 0.42 we still do not
reject the Null that the

population mean of the new
drug is less than 0.40. Even
though we favor the Null
Hypothesis, we still may
sometimes make the mistake
of rejecting it mistakenly
(can happen with a small

probability).
For example suppose the
drug is actually 0.38, then
there is still a chance (less
than 5%) that we will end up
with a sample mean greater
than 0.4898. We would then

have wrongly rejected the
Null Hypothesis. An error of
this type is called a Type I
Error.
We are more likely to make
the error of the other sort,
that is not rejecting the Null
Hypothesis even though the

new drug has a success rate
larger than 0.40. For
example, if the success rate
of the new drug is 0.42, then
the population standard
deviation will be = (0.42 *
1/2
0.58) = 0.4936. Then the Z
for the sample mean (of a

sample of size 80) that equals
0.4898 is:
Z =0.4898 0.42
1/2
(0.4936/80 )
=1.265
The probability of the

sample mean exceeding
0.4898 can be found by using
the above Z and the table for
Standard Normal
Distribution. It is 10.29%.
So there is a 100% - 10.29%
= 89.71% probability that
with a population mean of

0.42 we will erroneously
FAIL TO REJECT the
Null Hypothesis that the
drug is less than 0.40 (while
in reality it is 0.42 > 0.40).
This is a Type II Error, the
error of not rejecting a false

Null.
Another quick example: If

the population mean for the
new drug is 0.4898, then the
probability of making a Type
II error with the rejection
level at 95% (also called the
5% significance level) is
50%. This follows as there is
a 50% chance that the

sample mean from a
population with mean 0.4898
will turn out to be less than
0.4898.
If a test of the Null has lesser
probability of making a Type
II Error, it is said to possess
greater Power. The

Power of a test the more
faith we can have in our
statistical inference. Tests
should be designed to
maximize Power for a
given probability of making
a Type I error. We should be
aware of the Power of our

tests.
If we use, say the 99%
confidence level (1%
significance level) instead of
the 95% confidence level, we
make it more difficult to
reject the Null Hypothesis.
This reduces the probability

of making a Type I Error,
but increases the probability
of making a Type II Error. It
decreases the Power of the
test.
The t-distribution
The example above was of a
population which was
Binomially distributed.
Specifically, such a
distribution enabled us to
calculate the standard
deviation that was implied

by the mean of the Null
Hypothesis. If the population
is not distributed Binomially,
then the Null will not be
enough to compute the
standard deviation, and we
will have to form an estimate
of the population standard

deviation from the sample
standard deviation.
When the population
standard deviation also has
to be estimated, the test
statistic becomes the
deviation of the sample mean
from the population mean

divided the estimate of the
population standard
deviation divided by squareroot of N. This statistic is
called the t-statistic and it is
said to have N-1 degrees of
freedom (dof).
Henceforth we will call the

sample mean xS and the
sample standard deviation
sS.
These estimators are

random variables. From the
sample we get one particular
estimate for each of these
two random variables. We
use these estimates to
make inferences about the
population.
Fortunately for us, the tdistribution has been

extensively studied and
tables similar to the
Standard Normal table have
been prepared. So the only
change we need to make
from the earlier example is
to calculate the sample

standard deviation, compute
the t-statistic, and use the t
tables.
If the size of the sample is
large (say greater than 30)
we can approximate the
population standard
deviation to the sample

standard deviation, and use
the Z-statistic instead of the
t-statistic.
t-statistic tables are available
on the internet, for example:
http://www.statsoft.com/text
book/sttable.html
Confidence Intervals
Besides testing Null
Hypothesis that the
population parameter (for
example the mean ) is of a
particular value, we can also
wish by sampling to
ESTIMATE the population

parameter. For finite
(especially small) samples we
cannot determine for sure
the population parameter,
but can only make a
statement. The statement
that we make is in the form
of a CONFIDENCE
INTERVAL.
In Classical Statistics a
Confidence Interval is an
Interval Estimator. The
Confidence Interval that we
compute based on a sample
is one particular realization

of the Interval Estimator.
A Confidence Interval for a

population parameter
consists of two endpoints
(the interval is between the
endpoints) and a confidence
level p% (usually 95%).
For the population mean, the

sample mean is the midpoint
of the Confidence Interval.
We consider 4 situations:
1) If the underlying
population is distributed
normally (which is rare)
then sample means are

distributed normally (even if
the sample size is small). And
the population standard
deviation is known (), then
we have the confidence
interval to be:
1/2
xs + Z*/(N )
Where = (100% - p%)/2
and
Z is defined as Z >
Z having a probability
mass of .
population has a known
standard deviation () and
sample size is large (say >
80) then once again we have
the confidence interval to be:
1/2
s
x + Z */(N )
population has an unknown
standard deviation and the
sample size is small, then the
confidence interval is:
1/2
s
x + t *s /(N )
population has an unknown
standard deviation and the
sample size is large (say >
80), then the t distribution
can be approximated by the
Z distribution and the
confidence interval is:
1/2
xs + Z*ss/(N )
Note that the sample
standard deviation
estimator s is:
Why divide by N-1 rather

than N? The intuition is that
we are not using the actual
mean but the fitted
mean xs, which will lead to

deviations being
underestimated, and to
correct that divide by N-1
rather than N.
Online resource for Stats:
http://onlinestatbook.com
Example: Suppose we wish

to find the Confidence
Interval at the 95% level for
estimating mean of a
binomial population (with
values 0 or 1). The sample
mean is 52.5% for a sample
of size 80 (38 0s and 42 1s).
First estimate the sample
standard deviation s.
2
2
s = (38*(0-0.525)
+
2
42*(1-0.525) )/(80-1)
=> ss = 0.5025
Our estimate of standard

deviation for the mean is:
1/2
0.5025 / (80 ) = 0.0562
Given the sample size of 80,
we can use the Z
approximation in place of t.
The critical values for the
5% two-tails become: + 1.96
The Confidence Interval

then becomes:
52.5% + 1.96 * 5.62%
= 52.5% + 11.02%
= (41.48%, 63.52%)
In the newspapers the results
of the sampling will be
reported as In our opinion

poll, the Reds have 52.5% of
the vote and the Blues have
47.5%, with an error of
11.02%. The confidence
level is not reported, the
default level is 95% for such
sampling.
What does the Confidence

Interval of (41.48%, 63.52%)
mean?
It means that if the
population standard
deviation is really the one we
estimated (which is a fair
approximation given the

largish sample size of 80),
then if the population mean
was 41.48%, then the
probability of getting a
sample mean of 52.5% or
greater would be 2.5%. And
if the population mean was
63.52%, then the probability

of getting a sample mean of
52.5% or lesser would be
2.5%.
Another way to think of the
confidence interval: Pick any
value from the interval.
Assume that is the
population mean. Then the

95% region around that
population mean will contain
the sample estimate of mean
(0.525).
If the sample is large the
95% region will be given by
the z-distribution with the
population standard
deviation approximated to
the sample standard
deviation.
If the sample is small, then
the 95% region is given by
the t-distribution.
Yet another way to think of a

confidence interval, the set of
values for which we would
fail to reject a Null
Hypothesis setting the
population mean equal to
those values.
What a Confidence Interval

in Classical Statistics does
NOT mean is that there is a
95% chance that the
population mean lies
between 41.48% and
63.52%.
The population variable in

this sort of statistics is not a
random variable, hence to
talk about probability with
respect to it is meaningless.
Difference of Means
Another example of
statistical testing is to
determine if two populations
are different in terms of their
means. This differs slightly
from the earlier existing and
new drug example. There we

knew the population mean of
one population (existing
drug) and wanted to test if
the other drug was better.
Here we simply wish to test
whether two populations are
different without any
predisposition about one

having a greater or lesser
mean. For example we may
wish to test if the populations
of adult females of US and
Canada have the same
heights.
We take one sample each

from both populations. Both
these sample means are
normally distributed. We
also assume that the
population variances
(standard deviations) are
equal, though we do not
know what they are. Under

the Null that both
populations have the same
mean, the following statistic
is distributed as t:
Where n1 and n2 are the sizes

of the samples of population
1 and 2 respectively, xs1 and
xs2 are the samples means
respectively, ss1 and ss2 are
the sample standard
deviations respectively, and
1 and 2 are the population
means under the Null

Hypothesis respectively.
If we do not assume the
population variances are
equal, then we have the tstatistic given by:
Paired Comparison Test

Did firms increase R&D
spending as percentage of
total sales after Congress
passed tax breaks for R&D
expenditure? Suppose we
look at a sample of 25 firm.
One way to do the test is to

look at the two populations
of R&D spending and
compare their means using
the above tests of unequal
means. A test with more
power is to instead study
the difference in firms
spending before and after.

That is, take the population
of changes and test whether
its mean is greater than zero.
This is a Paired
Comparison Test and it has
more power because we are
not losing information by
aggregating the R&D

spending.
Suppose the mean for the
population of changes is 7%,
and the standard deviation is
5%. Assume that the
distribution is normal, so the
test will use the t-statistic.
The degrees of freedom (dof)

is 25-1 = 24. Suppose we test
at the 95% confidence level,
then the critical value as
given by the t-tables is
1.7109.
The t-statistic will be:
t24 =
7%
1/2
5% / 25
= 7
As the t-statistic exceeds the
critical value we will reject
the Null Hypothesis that the
tax breaks passed by
Congress did not increase

R&D expenditure.
The t-tables are given in
terms of the right-tail. If you
want them for two-tailed
tests, you have to take the
critical value for half the
significance level. That is if
you want 95% confidence,

that corresponds to 5%
significance, then look up the
critical value for 2.5%
significance. This works
because the t-distribution is
symmetric (like the normal
distribution). So 2.5% on
each tail adds up to 5%

significance. The 2.5%
critical value for only the
right tail from the tables
with 24 dof was 2.0639,
which would be the critical
value for the two-tailed test.
How does the Paired

Comparison test increase
power? Suppose there are 8
firms in the sample, 4 of
which move from 5% to 6%
and the other 4 move from
25% to 26%. Then if we
aggregate, the aggregate
moves from 12.5% to 13.5%.

There is however substantial
standard deviation in the
samples due to the difference
between 5% and 25% and
between 6% and 26%. The
standard deviation for the
two samples (pre and post
change) turn out to be 10%.

If we do the t-test we get:
t15 =
0.01
2
2 1/2
( 0.10 + 0.10 )
8
8
= 0.2
The above t-stat is too small

to reject the Null at any
meaningful levels (the pvalue is greater than 40%).
Given this the 1% change
may not show up as
significant, and we may fail
to reject the Null. The above
is a test of differences in
means.
However when we form a
sample of differences, all
differences are 1%, making
the sample standard
deviation to zero. This makes
the t-stat infinite and the p-
value zero! We can reject the

Null at all levels of
significance. This is a test of
mean of differences.
This is an example how the
power of tests can be
increased by paying
attention to the design of the
test. How did the increase in

power of the test happen? In
the unequal means test there
was variation in the samples
due to variations between
firms. We are however
interested only in the
variation due to the tax
breaks that Congress passed.

So considering only intrafirm variations enables us to
focus on the tax breaks and
eliminate the noise due to
inter-firm differences, and
results in a test of greater
power.
Of course, if the samples are

independent (say trials of
two drugs) then we cant use
the Paired Comparison test.
However when possible,
when samples are not
independent (as in the case
of changes in firms R&D
spending) we should use the

Paired Comparison (mean
of differences).
Tests of Variances
For a normally distributed
population we can test
whether the variance equals
a particular value. This is
similar to testing whether the
mean equals a particular
value as we have done

before, but with variance
substituted for the mean.
Suppose the Null Hypothesis
H0 and the alternative
Hypothesis HA are:
2
2
H0: = 0
HA: 0
The logic for the test remains
the same as for the Classical
(Frequentist) approach to
statistical testing. That is
suppose the population is
normally distributed and has
2
2
= 0 then the test statistic
will have a particular

distribution. This
distribution was Z (standard
normal) or t for the statistic
to test the mean, and is Chisquared for testing the
variance.
If the test statistic lies

beyond the critical value
then we conclude our initial
assumption that the
population conformed to the
Null Hypothesis was wrong
and we Reject. If the test
statistic lies within the
critical value then we Fail

to Reject the Null
Hypothesis.
The Chi-squared test
2
statistic has N-1 degrees of
freedom and is:
2
2
N-1 = (N 1) ss
0
The two-tailed Chi-squared
test rejects the Null if the
value of the test statistic is
too small (sample variance
much smaller than expected
under the Null) or too large
(sample variance much
larger than expected under

the Null). For example, we
divide a 5% level of
significance into a 2.5%
right tail and a 2.5% left tail.
The Chi-squared table, and
also F-tables that we will
need soon are available at:
http://www.statsoft.com/text
book/sttable.html
Example: A sample of 25
daily returns for a stock has
2
a variance of 0.20% .
Assuming that the process
that generates the returns is
stable (unchanging) test at
the 1% significance level the

hypothesis that the variance
of the daily stock returns is
2
0.25% .
The dof is 24 and the two
tails at the 1% level of
significance are: 9.88 and
45.55
The test statistic from the

sample is:
24
= 24 * 0.20
2
0.25
= 15.36
As 45.5585 > 15.36 > 9.8862
the test statistic does not lie
in the region of rejection and
we FAIL TO REJECT the
Null Hypothesis that the

variance of daily stock
2
returns is 0.25% .
Test of equality of variances

Equality of variances can be
tested using a test statistic
that is the ratio of variances
of the samples and under the
Null has a F-distribution.
The larger variance is put in
the numerator and the

smaller in the denominator.
The test statistic has two
dofs, first the numerators
next the denominators. The
order of the dofs is
important as they are not
interchangeable. As before
the dofs are one less than the

sample sizes.
The larger variance is put in
the numerator, so the test is
single (right) tailed. This
putting of the larger sample
variance on the top means
that we eliminate the lower
tail. So when we test for

inequality of variances,
suppose the level of
significance is 5%, the right
tail should only have
probability of 2.5%.
Example: An analyst is
comparing the monthly
returns of 2 year T-bonds
and bonds issued by GM.
She decides to investigate
whether the T-bills have the

same variance as the GM
bonds by taking a sample of
20 monthly returns of the
former and 31 monthly
returns of the latter. The
2
variances are 0.001 and
2
0.0012 . What can you
conclude if the level of

significance is 10%?
The critical level for the Fstat at the 10% level of
significance with degrees of
freedom 30 and 19 is 2.07
(right tail of with 5%
probability)
The F-statistic for the test of

equality of variances is:
2
F30,19 = 0.0012
2
0.0010
= 1.44
As the F-stat is less than the
critical value 2.07, we cannot
reject the Null Hypothesis

that the two variances are
equal.
Covariance, Correlation
Coefficient
These are (descriptive)
statistics that measures the
characteristics not of one
variable (like mean and
standard deviation) but a

pair of variables.
The Covariance of a pair of
variables is a measure of
how much they move
together. For example, one of
the variables increases by
5%, then covariance is a
measure of how much we

would the other variable to
have increased. If the
variables have a tendency to
move in opposite directions
the covariance will be
negative.
The formula for Covariance

is similar to that for
Variance, except that it takes
the deviations from the mean
for the two variables and
multiplies them rather than
taking the square of either.
The formula for the
estimator for covariance in a

sample between variables X
and Y is:
We take the pairwise

occurrences of the variables
X and Y, find their
deviations from their means
and multiple them. Finally
we divide the number of
occurrences to find the
covariance. If the population
means are not known, then

we can use the sample means
xs and ys to form an estimate
of the covariance.
The correlation coefficient

XY is the covariance
between X and Y
normalized (scaled down)
by the standard deviations of
the two variables:
The correlation coefficient

lies in the range (-1, +1).
A correlation of +1 mean
that X and Y are perfectly
correlated, every P%
increase (decrease) in X will
result in a P% increase
(decrease) in Y.
A correlation of -1 mean that
X and Y are perfectly
negatively correlated, every
P% decrease (increase) in X
will result in a P% increase

(decrease) in Y.
We denote the estimator of
XY as rXY. It is a random
variable. Sometimes it is
called sample correlation
coefficient. Prior to the
sample being taken we have
the random variable (which

we know how to compute
using the formula). Post
sampling we have one
realization for (estimate of)
the estimator.
Outliers are data points that

are deviate significantly
from the norm. As the
computation of variances
and covariances involve
squaring, outliers can have a
disproportionate impact on
their calculation. For
example, a data point that

deviates from the mean by 3
units, will have the same
impact as 9 data points that
deviate from the mean by 1
unit in the calculation of
variance. Hence we should
examine the impact of
outliers on our estimates,

and may even exclude them
from our calculations.
Correlation and Causality

Correlation is a statistical
estimate, it does not imply
causality. Just because X and
Y are positively correlated
does not mean that X causes
Y or vice versa. Consider two
variables, number of boats

out in Lake Michigan and
number of cars on the
highways. If you do a
correlation you will find that
they are positively
correlated. However it is not
the boats that are causing
more cars to be on the

highways, or vice versa.
Rather both are activities
that increase in the summer,
leading to the positive
correlation. The correlation
between boats and cars is
spurious in the sense that

it is not causal.
Testing whether the

correlation coefficient is
ZERO.
We represent the estimator
of the correlation coefficient
XY by rXY.
The Null to be tested is:
H0: XY = 0
The following test statistic is
for a two-tailed test of the
Null and is distributed as t
with N-2 degrees of freedom:
Example: You have 22

pairwise observations for
daily returns to stocks X an
Y. The correlation coefficient

is 0.42. Test the Null at the
1% level of significance.
The critical level for t at the
1% level of significance with
20 dofs is 2.85.
The test statistic has a value:
1/2
t =0.42 * 20
2 1/2
(1 0.42 )
=2.07
So we cannot reject the Null
that the two daily stock
returns have a zero
correlation (uncorrelated) at
the 1% level of significance.
Linear Regressions
We now come to an area of
statistical inference and
estimation that is of
particular importance to
Economics and Finance:
Regressions.
The basic single

independent variable
Regression model is:
Yi = + Xi + i
The above model says that
the dependent variable Y
is determined by the value of
the independent variable

X multiplied by a constant
plus a constant plus other
influences on Y summarized
in the error term .
The constant is also called
the slope coefficient, and
constant the intercept.
The reason should be clear

from the following graph
The work now is to estimate

the coefficients of the
Regression Equation, namely
and .
If we define the vectors b, X,

and Y:
b=[, ]
X1 =[1, 1, 1]
X2 =[X11, X12, X1N]
Y =[Y1, Y2, YN]
Where X1 and X1 represent

st
nd
the 1 and 2 columns of the
Nx2 matrix X. The first
columns can be thought of as
using 1s (a constant) as a
variable to extract the
coefficient .
That is, like the other

coefficient , the coefficient
is also multiplied by a
variable, except that the
variable happens to be the
constant 1.
The linear regression model
becomes:
Y= Xb
Then the Ordinary Least

Squares (OLS) estimator for
the coefficients is the 2
dimensional vector:
-1
bOLS = (XX) XY
with the first element of
bOLS the intercept and

the second element the
slope.
If you do not want vector
notation, you can use the
summary statistics:
Sx = x1 + x2 + + xN
Sy = y1 + y2 + + yN
2
2
2
Sxx = x 1 + x 2 + x N
Sxy=x1y1+x2y2+xNyN
OLS = N*Sxy Sx*Sy
N*Sxx Sx*Sy
OLS = Sy OLS* Sx
N
Let the vector of residuals

generated by the regression
e. It is a N dimensional
vector.
OLS chooses the coefficients
such that the sum of squares
of the ei (i=1,2,.., N) is
minimized.
The method was applied in

1801 by Gauss to predict the
position of the asteroid
Ceres. Other
mathematicians who had
tried to predict (forecast) the
asteroids position had been
unsuccessful. Legendre
shares the credit with Gauss

for discovering OLS.
The OLS estimate has some

desirable features, provided
some conditions are satisfied,
prominently:
1) X and are uncorrelated
2) has the same variance
across different data points.
3) has a zero expected

value (note and can be
defined to make this true)
4) is not serially correlated
(that is i is uncorrelated
with i+1, i+2 etc.)
If these conditions are

satisfied the OLS estimator
is BLUE (Best Linear
Unbiased Estimator)
Best as in lowest variance,
unbiased implies that if you
find the OLS estimators
repeatedly their average will

be and .
If the conditions for OLS are

not satisfied a variety of
other estimators have been
designed by statisticians, for
example:
1) Generalized Least
Squares (GLS) if does not
have the same variance

across different data points.
2) Instrument Variable
estimators if X and are
correlated.
Just like we had a variance

for the estimator of the mean
of a population, we have
variances for the OLS
estimators.
As of now, I will not get into
how the variances are
calculated, rather will

provide you the variance.
Usually exam questions and
statistical packages provide
the variances rather than
expect you to calculate.
Hypothesis Testing
The most common Null
Hypothesis to be tested for
the linear regression is that
one of the coefficients is zero.
As before, the Null is
rejected if the value of the
coefficient estimate divided

by its standard deviation
exceeds (in magnitude) some
critical value (usually either
Z or t).
Example: A regression with

50 data points produces an
estimate of 2.27 for the OLS
estimator. The sample
estimate of standard
deviation ss is 1.14.
Test the Null that the slope

coefficient is greater than 0
at the 95% level.
As the number of data points
is large at 50, we can use the
Z-distribution. The critical
value for the Z-stat at the
one-tailed 95% level is 1.64.
test statistic = 2.27 / 1.14 =

1.99
As the test statistic is more
than the critical value, we
reject the Null Hypothesis
that the slope coefficient is 0.
Following the Frequentist

approach, we can form
Confidence Intervals (which
are estimators) for the
coefficients of the linear
model with critical values tC
from the t-table:
OLS +tC*ss
OLS +tC*ss
where ss and ss are the
standard deviations for the
OLS and OLS estimators.
Example: A regression with

70 data points yields OLS
and OLS estimates of 1.57
and 2.49.
The variances ss and ss are
0.49 and 1.08.
Estimate the 95% confidence

intervals for the coefficients.
As the number of data points
is large, we can use the twotailed 95% Z critical values
of +1.96.
OLS +ZC*ss =1.57 +

1.96*0.49 = (0.61, 2.53)
OLS +ZC*ss = 2.49 +
1.96*1.08 = (0.37, 4.61)
Analysis of Variance
(ANOVA) for Regressions
When we say Sum of
Squares we mean Sum of
Squared Deviations from the
Mean
The Sum of Squares for Y

is called Total Sum of
Squares (SST)
The Sum of Squares for
OLS*X is called Regression
Sum of Squares (SSR). This
is the amount of variation
that would exist for Y if only
X was the cause of variation.

This is the amount
explained by the
regression.
The Sum of Squares for
is called Error Sum of
Squares (SSE). This is the
amount of variation that the
regression does not

explain, and attributes to
the error term. It is the
unexplained part. This is
also called Sum of Squared
Residuals as it is indeed
calculated by squaring the
estimated residuals and

adding.
By algebra we have: SSR +

SSE = SST
The Mean Squared Error
(MSE) equals SSE/(N-2).
MSE the average of the
squared residual errors.
Normally when we calculate

average we divide by the
number of observations N,
but here we divide by the
number of observations
minus 2. Why N-2 rather
than N? The answer is the
degrees of freedom are N-2.
At which point you may be

inclined to ask Why is the
degrees of freedom N-2?
Without getting into a
detailed answer, note that
our goal for the regression is
to minimize the residuals in
some optimal fashion.
Our ability to choose mean

and standard deviation (the
two unknowns) implies we
can pick these parameters
such that 2 of the residuals
become zero or any
arbitrary number we wish to
assign.
We are not really facing a

vector of N residuals that are
beyond our control, but
rather a vector of N-2
residuals.
That is the residuals can
vary in N-2 dimensions,
rather than N dimensions,
hence N-2 degrees of

freedom.
The R (R-squared or
Coefficient of
Determination) for a
regression is:
2
R = SSR/SST
The better the data fits the
linear regression, the higher
the R will be. If X can

explain all the variation in Y,
2
the R will be 100%. That
would mean there would be
no variation needed to be
explained by the error term,
and X and Y would be
perfectly correlated.
A R of zero implies that X

can explain none of Y, that is
X is orthogonal to Y, they
have zero correlation.
Confidence Interval for

Forecasts of Y
We can use the regression
coefficients to make
Forecasts for values of Y for
given values of X.
The Predicted Value of Y is:
The sign over Y is a hat

and the variable is called Yhat. This is standard
statistics notation.
Define the Standard Error

of Estimate (SEE):
1/2
SEE = (SSE /(N-2))
2
And then the variance FY
for the Forecasts of Y, is:
Where XG is the given value

of X for which Yi is
predicted, SX and sSX are the
sample means and standard

deviation for X respectively.
Example: A linear regression

with 40 data points yields
OLS and OLS estimates of
7.45 and 1.93.
The SEE from the regression
is 0.59, SX is 1.25 and sSX is
0.31.
Estimate the Predicted Y

value for X = 1.89 and the
95% confidence interval.
The Predicted Value = 7.45 +
1.93 * 3.29 = 13.80
We next compute the

variance for the predicted
value of Y to be:
0.59 (1 + (1/40) + (1.89
2
2
1.25) /(40-1)0.31 )
= 0.00394846
Which gives the standard

deviation to be: 6.28%
As N is larger than 30, we
can use the Z critical values.
If it were smaller, we should
use t with N-2 dofs.
The critical values for the

95% confidence interval are
+1.96.
The 95% confidence interval
is:
13.80 + 1.96*0.0628
= (13.68, 13.92)

Statistics Introduction Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Introduction Slides

Uploaded by

Copyright:

Available Formats

STAT101

Statistics can be divided into

We start with Descriptive

For example, we may be

One way to describe a

Alternatively we can list

The mean of a population is

the above notation is a

equals the sum of xi as i

the sum of whatever follows

squared deviations for all

Finally we once again divide

And the standard deviation

There is actually more to the

values or the fourth power

How are means and

and the standard deviation

would have to deal with

Think of a bowl with many

To apply numerical results

Of course we could have

actually has 52% Reds and

There are many variables

probably differ by at least a

than individual outcomes

The most useful and hence

We see above that the

For example the probability

What in the world is

each. Then the weights of the

Values for the normal

Normally distributed data

subtracting the mean, and

variable is 0, and the

For example from the above

As the probability from

plus one standard deviation

Example: Suppose you are

number who consume

Therefore the probability

So total probability = 0.2257

we will Sample and make

A random variable is one

random variable, but once

the probability of getting 0 is

them all, so you just dip your

The branch of mathematics

Suppose you pick up one ball

The total probability of that

When you pick a ball, the

the data is also Binomial.

you can say what the

then is (Red = 0.52; Blue =

are Continuous, (that is

In that case it becomes

When a random variable can

510, but probability

The probability distributions

Suppose you have dipped

Now there are different ways

more commonly used, and

I will sacrifice rigor while

To make an inference from

For example it is quite likely