Professional Documents
Culture Documents
xi is the value of an
individual data point within
the population, and N is the
total number of data points
in the population
Specifically note that the
symbol is used to represent
So we have to simply be
the sum of values of all data
points in the population
divided by the population
size
is a measure of how
variable the values are
within the population
To measure we find the
deviation of an individual
data point from the mean ,
then square it and add these
= 520K*0 + 480K*1
1,000,000
= 0.48
2
=
2
2
520K*(0-.48) +480K*(1-.48)
1,000K
= 0.2496
=> = 0.4996
Distributions:
The population of Red and
Blue Balls is an example of a
binomial (discrete)
distribution, a data point can
take only one of 2 values
probability (tending to
Zero).
For example a height of
exactly 510. You will be
hard put to find anyone in a
population who is absolutely
exactly 510. Even those
who are very very close will
Z = (Variable Value
Mean)/Std Dev
For example, a variable that
is one standard deviation
above the mean will have a
value for Z = 1
Inferential Statistics
Think again of the bowl with
many Red Balls and Blue
Balls
Now instead of counting the
number of Reds and Blues
A probability of 1 is a
Certain event. The reason
we can add the probabilities
of 1/N is because the events
(of a ball being picked) are
Mutually Exclusive. That
is, if one ball is picked then
another cant be picked.
probability (density)
distribution curve. If the
range is small we can
approximate the area
(probability) as the average
of densities at the endpoints
of the range, multiplied by
the length of the range.
2) Bayesian Statistics
Though there is
disagreement between the
above two approaches, each
is logically valid. They just
view the world differently
and ask different questions.
Classical Statistics is by far
As a practical matter, a
sample larger than 30 can be
thought of as a large
sample.
1/2
deviation of 0.4996/80 =
0.05586
Or, when you do repeated
sampling of the population
many times, you will see that
about 68.26% of the means
of the samples lie between
The p-value
With a Z of 2.28, the area
under the curve for the right
and left tails (area for which
Z < -2.28 and Z > +2.28)
equals 1.13% * 2 = 2.26% =
0.0226. This is referred to as
of significance = 5% level of
rejection
The t-distribution
The example above was of a
population which was
Binomially distributed.
Specifically, such a
distribution enabled us to
calculate the standard
http://www.statsoft.com/text
book/sttable.html
Confidence Intervals
Besides testing Null
Hypothesis that the
population parameter (for
example the mean ) is of a
particular value, we can also
wish by sampling to
of a CONFIDENCE
INTERVAL.
In Classical Statistics a
Confidence Interval is an
Interval Estimator. The
Confidence Interval that we
compute based on a sample
1/2
xs + Z*/(N )
Where = (100% - p%)/2
and
Z is defined as Z >
Z having a probability
mass of .
2) If the underlying
population has a known
standard deviation () and
sample size is large (say >
80) then once again we have
the confidence interval to be:
1/2
s
x + Z */(N )
3) If the underlying
population has an unknown
standard deviation and the
sample size is small, then the
confidence interval is:
1/2
s
x + t *s /(N )
4) If the underlying
population has an unknown
standard deviation and the
sample size is large (say >
80), then the t distribution
can be approximated by the
Z distribution and the
confidence interval is:
1/2
xs + Z*ss/(N )
Note that the sample
standard deviation
estimator s is:
standard deviation s.
2
2
s = (38*(0-0.525)
+
2
42*(1-0.525) )/(80-1)
=> ss = 0.5025
population standard
deviation approximated to
the sample standard
deviation.
If the sample is small, then
the 95% region is given by
the t-distribution.
Difference of Means
Another example of
statistical testing is to
determine if two populations
are different in terms of their
means. This differs slightly
from the earlier existing and
t24 =
7%
1/2
5% / 25
= 7
As the t-statistic exceeds the
critical value we will reject
the Null Hypothesis that the
tax breaks passed by
is a test of differences in
means.
However when we form a
sample of differences, all
differences are 1%, making
the sample standard
deviation to zero. This makes
the t-stat infinite and the p-
Tests of Variances
For a normally distributed
population we can test
whether the variance equals
a particular value. This is
similar to testing whether the
mean equals a particular
HA: 0
The logic for the test remains
the same as for the Classical
(Frequentist) approach to
statistical testing. That is
suppose the population is
normally distributed and has
2
2
= 0 then the test statistic
0
The two-tailed Chi-squared
test rejects the Null if the
value of the test statistic is
too small (sample variance
much smaller than expected
under the Null) or too large
(sample variance much
http://www.statsoft.com/text
book/sttable.html
Example: A sample of 25
daily returns for a stock has
2
a variance of 0.20% .
Assuming that the process
that generates the returns is
stable (unchanging) test at
24
= 24 * 0.20
2
0.25
= 15.36
As 45.5585 > 15.36 > 9.8862
the test statistic does not lie
in the region of rejection and
we FAIL TO REJECT the
Example: An analyst is
comparing the monthly
returns of 2 year T-bonds
and bonds issued by GM.
She decides to investigate
Covariance, Correlation
Coefficient
These are (descriptive)
statistics that measures the
characteristics not of one
variable (like mean and
correlated, every P%
increase (decrease) in X will
result in a P% increase
(decrease) in Y.
A correlation of -1 mean that
X and Y are perfectly
negatively correlated, every
P% decrease (increase) in X
H0: XY = 0
The following test statistic is
for a two-tailed test of the
Null and is distributed as t
with N-2 degrees of freedom:
1/2
t =0.42 * 20
2 1/2
(1 0.42 )
=2.07
So we cannot reject the Null
that the two daily stock
returns have a zero
correlation (uncorrelated) at
the 1% level of significance.
Linear Regressions
We now come to an area of
statistical inference and
estimation that is of
particular importance to
Economics and Finance:
Regressions.
Yi = + Xi + i
The above model says that
the dependent variable Y
is determined by the value of
Y= Xb
bOLS = (XX) XY
Sx = x1 + x2 + + xN
Sy = y1 + y2 + + yN
2
2
2
Sxx = x 1 + x 2 + x N
Sxy=x1y1+x2y2+xNyN
OLS = N*Sxy Sx*Sy
N*Sxx Sx*Sy
OLS = Sy OLS* Sx
N
of the ei (i=1,2,.., N) is
minimized.
Hypothesis Testing
The most common Null
Hypothesis to be tested for
the linear regression is that
one of the coefficients is zero.
As before, the Null is
rejected if the value of the
OLS +tC*ss
OLS +tC*ss
where ss and ss are the
standard deviations for the
OLS and OLS estimators.
Analysis of Variance
(ANOVA) for Regressions
When we say Sum of
Squares we mean Sum of
Squared Deviations from the
Mean
The R (R-squared or
Coefficient of
Determination) for a
regression is:
2
R = SSR/SST
The better the data fits the
linear regression, the higher