Statistical Methods (Jadhav)

Statistical Methods
Friday, September 03, 2010 11:00 AM
Statistics V54000 -> government (laid off 125000) ->Private (gained 67000) - Stock market goes up when employment goes down Look behind the numbers They take a sample and use it to exemplify the population
(Receipts + Deficits)= Govt. Exp. -entitlements -services (highways, education etc) -[discretionary ]-> defense $600 Bill y= a+b*other exp Sample: Data - Qualitative - Quantitative - Discrete 0 1 2 3 - 100 - Continual 2.1 Systematic sampling - Take every third or large section Example: beer bottles- every thousand are tested to make sure all have 12 oz ? Random sampling - Every member of the universe has equal probability to get included into the sample Ex. 5'3'' 5'4'' 5'7'' 5'8'' 5'9'' 5'11'' 6'0'' 6'2'' The average does not have to be a number in the sample Class Frequency Midpoint Interval The cross boundaries must me mutually exclusive. interval 3x5'5''=16'5'' 5'3''-5'7'' 5'3''-> 5'7'' 3 5'5'' 16'4'' 5'8''->5'11'' 3 5'9'' 3x5'9''=17'9'' 5'8''-5'11'' 6'-> up 12'2'' 2 6'1'' 49'4''/8=average height in the class Frequency Histogram average
Presentation: Scatter plot
5'3'' 6'2'' Using raw data
5'3'' 5'8'' 6'
5'5''
5'9'' 6'1''
Ogive Less than
Skewed Negative skew Positive skew 0 x
Morethan
6"
Statistical Methods Page 1
Presentation is best when there is a story that is simple and tells exactly whats going on
Slope: Intr growth mat x y= a- bx decline Q/Sales
<
y Y L y= L/ 1+a*e^(-bt)
time
e= nat log = 2.72 (log= something raised to that = log)
Central Tendency Mean: (central limits)
Standard deviation= 0 (wife shoots husband after confession)
65 Ex. Polls (sample) +/- 4 Obama: 52% Other : 48% Pay attention to the numbers these are essentially the same
Parameter Size Mean N
Statistic n
= nE1 Xi/ N /x= nE1 Xi/ n
p.60 exercise 1 6 = 0.6 3 = -2.4 5 = -0.4 7 = 1.6 6 = 0.6 3 /x= 27/5=5.4 5.4 Size Mean 7
-2.8 +2.8 =0
n=5 Exi=27
#13 p.62 W1 300 @ $20 a share W2 400 @ $25 per share
W1 300 @ $20 a share W2 400 @ $25 per share W3 400 @ $23 per share [(W1*X1) + (W2*X2) + (W3*X3)]/(w1+ w2= w3) = 25,200/1,100=22.9091
MEAN:
Simple Weighted Med Mode /x The middle value in ordered data (the mid-point) Mode med /x
Arithmetic mean Geometric mean =3 x1*x2*x3 -> Arithmetic mean is in some cases higher than the geometric mean.
MEDIAN MODE
Most frequently occurring value
Two types of DATA:

1) Time series (time on x axis data on y axis) T
i. 2009
f 120 110 States in USA- Alabama Alaska

Xm 8 13 18 23 28 33 38 n=20 f*Xm 8 26 54 115 112 99 76 fXm=490
Annual equilibrium
2008
Class interval Frequency (f) 5.5-10.5 10.55-15.5 2) 15.55-20.5 20.55-25.5 25.55-30.5 30.55-35.5 35.55-40.5 1 2 3 5 4 3 2
fXm/ n
=490/20 =24.5
Group with highest number of frequencies= mode 20.55-20.5 f=mode, xm= midpoint
Dispersion How far is each value from the mean Range Its best to chose the distribution with the shorter range Because data is more uniform
5.5
24.5/x
40.4
Absolute value (every sign is positive) Ix-/xI s= (x-/x) / n = (x-) /N STD Deviations= s = -3.5 -2.5 -5 5 25 35 (x-/x)/ s 68% 95% 99.7%
Population mean is more confident the further it is
Variance =s =(fXm- [(fXm)/n])/ ni1
= [13310- (490)/20]/19 68.7 How many distributions lie beneath What is the average (the mean) How far are the values spread from it Standard deviation Ch1-4
Probability
The chance that something is going to happen Experiment: Results in an outcome
All outcomes = Event Sample Space Coin H/T DIE 1 -> 6

1 4 2 5 3 6 2 1/6
Even= Odd = 3/6 = 0.5
Venn Diagram: # UD students 100/ 3000 -> in sample Red Blue
Conditional probability
20 30 Probability with replacement
Classical
When a favorable outcome P= # of favorable outcomes Total # of Possibilities = 4/52 (to get an ace) P(E)= 0.08 P(/E)= 0.92 (the alternative) P(E) + P(/E)=1 0_<P(E)_< 1
Empirical
#time Event occurred Total # of Observations
Subjective
(Knowledge or Authority-there is not necessarily data to support decision) P(E)= n(E) n(S) 1 Ace or 1 king 0.08 0.08 8/52=0.1538 1 ace and 1 king 0.08*0.08=0.0064 Coin (2 conscutive heads) H H 1/2 1/2 .5*.5=0.25 Efficiency Bias R R&B B
Q: 10 t/f 10 mult choice example
Probability of A or B P(A or B)= P(A) + P(B) P(A)+P(A)=1
Blood Types F A B O AB 22 5 2 21 n=50
22 f 50 n 28 50 = A
Not mutually exclusive P(A or B)= P(A) + P(B)- P(A and B) B A
A and B
Independent Probability of selecting A doesnt effect the probability of selecting B K V 4/52 Q V 4/52 P(A) and P(B)= P(A) * P(B) (K+Q)= (4/52) * (4/52)
Not Independent
P(A) AND P(B)= P(A) * P(B I A ) K 4/52 Q 4/51

P(AandB) = P(A) * P(BI A) p(A) P(A) Probability Is it mutually exclusive See if the are independent
ABC ACB BAC BCA
nPr= n! (n-r)! (3-3)!= 0! 3x2x1 1 =6
Raw Data: Mean Standard deviation Standardize the distribution
Combinations
Cv=s x 20 -1s Xm 10-20 20.5-30 5 10 15 12 13 15 17 20 f*Xm 0 25 s-5 30 1s
nCr=
n! (n-r)! * r!
30! 30x29x28 (27)!* 3! = 3x2x1
Midpoint is the average of the values
2x2x2=8 h h h t T h t t M= (x*P(x)) = (x-) *P(x) t h t h t h Hhh Hht Hth Htt Thh Tht Tth ttt
P() 1/8 or .125
Hhh hht hth thh htt tht tth ttt 1/8=0.125 1/8=0.125 3/8=.275 3/8=0.375
X 3 2 1 0
P(x) .125 .375 .375 .125

1.5=
x*p(x) .375 .75 .375 0
(x-m) 1.5
(x-) 2.25
= (x-) *P(x) 0.28
3 h
0.5 -0.5 .05
0.25 0.25 2.25
.09 .09 0.28
2 h 1 t
1 h 2 t
o h 3 t
= 0.74 = 0.86
Binomial 1. Limited # of Trials= n 2. Only 2 possibilities 3. P(success) P(failure) p + q =1 P(x)= nCr * p^x * q^n-x = 3C2 * 0.5^2 * 0.5^1 =3! 1! =3 * 0.25 * 0.5=.375
Three tosses= three trials nCr= n! (n-r)!r!
p.760 apendix b9 Binomial distribution
- Once you know n,p,q =n*p =n*p*q

# on Balls 0 Simple formula to find the variance =X*P(x) =[(x-)*P(x)] =*(x)+*P(x)- P(x) 1/5 1 .2 2 .2 3 .2 4 .2
= 0+0.2+0.4+0.6+0.8=2 =1+4+9+16=(30*2)=6-4=2 = 2=1.4
=x n
=mean
.136 .34 .34 .136 .023 -36 -26 -16 2 .023 +16 +26 +36 15 =2
=1.4
STD Normal= x-
2-2 1.4
Z=x-
Mt Z*=x-
Area under the Normal Curve: z= Xi- Xi=+ *z
The closer we are to the mean the more accurate we are If the sample size is 30 or more than the number of samples gets the same result
=50-15*1 =.35 .3413 .3413
p.750 .04 column Z column= 1 z=1.04

100
50 +1 =15
.3413 .3413
=80-14*.84 =8-11.76 =68.24
0 .30% 80 =14 Xi----> 80%
100
X=+*z
z= -3 38
-2 -1 52 66
0 1 2 3 80 94 108 122
Area under the graph
z=x- Find the area between z=0 and z=1.8
0 42%
1.8
z=-2.48 + z=-0.83
-2.48 -.83 0 I -.4934 +.2967 .1967 =19.67%
-.4934+.2967=-0.1967
Why a sample other than entire population? Its not possible sometimes , so you take a representative area and come to a conclusion of the population. -x= Bias, sampling error x x3 x1 x2 x4
x The mean of the means of sample is always uniform, you always end up with normal distribution. The result is better the higher number of samples
z= Xi- /n
# samples
(1)pop
(2) Bias V
How do we find the area under this?!
Chebysheu: Area under curve is at least equal to: =1- (1/k)

2=1-(1/4)= 3/4=0.75 =75%
k>1
Stratified sampling:
100 150
250
600
n=n1 + n2 + n3 X is estimator of 1) Unbiased 2) Consistent - as n^ Bias v 3) Efficient - smallest "s" -> -2.6 Max error E=X+ Z(/2)* (/ n) E*n=X+ Z(/2)* E n=X-+Z(/2)* E interval X V +2.6 z=x- z*=x z=Xi- /n z*(/n)=Xi-
Confidence Level @95% .95 z=2
= 1 conf. level = 1-.95=0.05 X- Z(/2)* (/n)<<X+ Z(/2)* (/n)
.475 z=1.96 Z(/2)=
X-Z/2(/n) <M< X +Z/2(/n) Average age of students; n=50 X=21.2 =2 years Confidence level = .95 21.2-.6=20.6 21.2+.6 .5 .025
.025
Years 20.6 Z 1.96 20.6 1.96
How many students are older than 20.6 One sided
Average height of UD students: @ .99 confidence level Within one year .5 =3" -.005 /2= .005 .495
= 1- confidence level n= Z/2 * E =(2.57*3/1)^2 =7.71^2
z=2.57 N>30 known Z Z N<30 pop normally
X 7 9 10 11
x1 8 7 10 11
n-1 5-1=4
pop normally
not known N<30 t N>30 Z Df
11 13 50 x=10
11 14 50
5-1=4
As df ^ t distribution approaches Z distribution- obviously because you are approaching 30
Hypothesis Testing Hypothesis: some statement about some population parameter - Could be non numerical UD students drink beer Copernicus heliocentric model Do not reject =0.025 Rejection zone Null Hypothesis: Ho: =k (350) Alternate Hypothesis: H1= k Level of significance - Confidence level - Test Value= X -> z If the value is >x> then it is rejected
=0.025
-1.96 350 v =15 321 v z=critical value
+1.96 v 371
How many students drink at least 325 bottles?
When it is at least or more than it is a left sided hypothesis
325 0 Left Tailed 350 Ho: >K
H1: <K
Ho True Reject Ho Do not reject Ho Type 1 error
Ho False
Type II Error
A medical report states, that average cost of rehabilitation is $24,672 its the mean- so its two sided Researcher -> n=35 x=25,226 Is the number significantly different ? =3,251 =0.01 Ho:=24672 H1:24672 Cv= 2.58 TV= x- /n =25226-24672
/n =25226-24672 3251/35 = 1.01 CV -2.58 1.01 CV +2.58
The average starting salary for a nurse is $2400 =$24,000 n=10 x=23,450 s=400 =0.05 Ho: =24000 H1:24000 CV= t= .5- .025=0.475 Df=(n-1)= 9 TV= 23450-24000 400/10 = -4.35
-2.262 -4.35
+2.262
Rejected!
The average salary of an assistant professor > $42,000 n=30 -One sided example x=43,260 =5,230 = 0.05 Ho:<42,000 H1:>42,000 0
Ho false Do not reject
1.32
Rejection zone CV 1.65
z=+1.65 =CV TV= 43260-42000 5230/30 =1.32
II
error
The average price of shoe<$80 n=36 x=$75 s=$19.2 =0.10
Ho: >80
H1: <80
TV=-1.56
CV -1.56 -1.28 p(Type 1 error)> 0.10
Quiz answers
n=28 , Ho: m<23, H1: m>23, =.05, df=27, CVt=1.703, TV=4.5 29 24 24 .05 28 1.701 1.88 27 25 25 .1 26 1.315 1.84 Reject Ho
When is Ho true? TV, 1.55 z= 1.55 = -0.4392 p .0608 = 0.05 >P I Ho is true - Reject P> I Ho is true - Do Not Reject
CV 2.33
P= P(TV> Pi)
The Project:
Qd=a-b*Price Burro of labor statistics Y=a+b*x Select any two series where the independent variable impacts the Infl=a+B*M1 dependent variable Housing starts= a-b* Mortgage Rate Y X Y1 X1 Stationary(within 1 year timeframe) Y2 X2 ajadhav@udallas.edu Cars sold 1 yr DONT DO A TIME SERIES Simple linear regression in excel 15k 16k 17k 18k 19k 20k
Mortgage rates effect housing start Hupo Data Source: Find two variables: cause and effect- number of classes missed and grade achieved Interest rate goes up borrowing goes down For one specific year One line saying what im trying to relate Give source of data Appendix A Testing how close our sample mean is to population mean Compare two sample means: the sample means are independent of each other, populations are normally distributed Your IQ before stat class and after Ho: 1=2 or 1-2= H1: 12 or 1-2 (x1-x2)= 2 1 + 2 2 n1 n2
Minimum Pairs
Price of car
0 z= (observed value-expected value) 2 1 + 2 2 n1 n2
The average price of a hotel room in Dallas= $88.42, n1=50 s1=$5.62 Denton= $80.61 n2=50 s2= $4.38 =0.05
Cv
cv
tv Rejection zone
-1.96
+1.96
7.45
Test Value= (88.42-80.61) = 7.45 5.62 + 4.83 50 50 # of sports for boys=8.6 # of sports for girls =7.9 Ho: 1<2 H1: 1>2 z=8.6-7.9 (3.3/50)+(3.3/50) =1.06=0.3554 P=.5-.3554=.1446 n1=50 n2=50 S1=3.3 S2=3.3 =0.10
Find the P value In p value you always contest the aternate hypothesis
.3554
(x1-x2)-z/2
<1-2<(x1-x2)+z/2* 1/n1 + 2/n2 => S1S2 S1=S2 => 1) Sp= (n-1)s1 + (n2-1)S2 n1+n2-2 2) t= .(x1-x2) Sp(1/n1+1/n2)
t=x1-x2 s1/n1 + s2/n2
Df=1
Df=9 Df=49
Find Right-> =0.1/2= 0.5 -> 36.415 Left -> 0.95 ---->13.848 Df=49 =0.1 Conf=0.9 n=25 df=24 (n-1)s<<(n-1)s right left Smaller<2.56<larger STD DEV Test: Study through the probabilities 8,9,10,11 Many samples: Central limits theorem - As n increases the bias goes down 2 errors: 1- Bias X~ 2- Efficiency s~
z=X- /n
X-Z /2* /n<<X+Z /2*/n

E=Z /2* / Hypothesis Testing - Confidence level
z= area under half the curve t= series of distributions Depends on degrees of freedom (n-1)
known - N>30 z - N<30 Population normally distributed z not known - N>30 z - n<30 Population normally distributed t How we resolve the hypothesis: X= (two sided) X< or > (one sided) Whatever the claim is that becomes the alternate hypothesis Type 1 error Type 2 error P value method:
2 sample means: Ho: X1 = X2
X1-X2=0 S1 S2
X1 - X2 = X1 -+X2 X1 - X2 = 1 + 2 n1 n2
S1 =S2 Sp=(n1-1)S1 + (n2-1)S2 n1+n2-2 t=TV=X1 - X2
Left tail means the null hypothesis means it has a left end Ho:1>2 H1:1<2
Variance Interval: (n-1)s<<(n-1)s right left
90% confidence or =0.1

Find the two values when the degrees of freedom are df=24 /2=0.05 table: 36.415 (right) 13,848 (left) Df=24
71
95% confidence interval Standard deviation= 1.6 mgs =0.05 n=19 (n-1)*s 18*1.6=46.08 46.08 < < 46.08 32.852 8.907
1 2 F= S1 --------n1=25 S2 --------------n2=29 F+VE dfN=24 dfD=28
F value is when we are comparing variances in two different samples. We find a range around each variance. - Independent sample has no effect on the other samples - Both populations are normal- so assume that they are normally distributed - The variance of the population are critical n=size of sample x=mean s=standard deviation S in one sample vs s in another sample
2 sided Left Ho: H1:
Right
F= S1 S2 ~1
1=2 1>2 1<2 12 1<2 1<2
As n increases the distribution around the mean becomes more and more unified. Positively skewed 0 1 -->
F= S1 S2
---- N1-1= dfN 15 ---- N2-1=dfN 21
=0.025 =0.05 F -----> dfN ---V 15 dfN : 2.18 21
Critical value of F distribution Medical researcher see whether variance in heart rates of smokers and non smokers are different? Smokers n1=26 non-smokers n2=18 Ho: 1=2 H1: 12 F=36/10=3.6 s1=36 s2=10 =0.1 /2=0.05 2.19 Not Reject 3.6
Smokers n1=26 non-smokers n2=18 Ho: 1=2 H1: 12
s1=36 s2=10
=0.1 /2=0.05 2.19 Not Reject

1 CV
3.6
F=36/10=3.6 - Test value >24 (25) V 2.19 17
Project: plug data into excel then select the scatter function
Higher divergence in church 1 vs church 2
Ho: 1 <1 H1: 1 >1

Always look at the right tailed test Null must be less/more than or equal to Variation of joggers in US vs africa Whether = or Level of significance/2
If hypothesis is correct use the right table
Correlation -Linear relationship
r-sample -correlation -1 Perfectly elastic claim 0 1
Completely inelastic
We want negative correlation or positive correlation: r= (x-x)*(y-y) (n-1)*Sx*Sy = n(x*y))-(x)*(y) [n(x 2)-(x)2]*[n(y2)-(y)2]
-1
0 H0:=0 H1:0
tv= r*n-2 = r* n-2 1-r2 1-r2
Number of absence (X) Grade (Y) 6 2 15 9 82 86 43 74
X*Y 492 172 645 666
X2 36 4 225 81
Y2 6724 7396 1849 5476

50 100 75
9 12 5 8 X=57
74 58 90 78 Y=511
666 696 450 624
81 144 25 64
5476 3364 8100 6084 =38993
50 25 0 5 10 15
=3745 =579
r=7*3745 - 57*511 [7*579-(57) 2]*[7*38993-(511) 2] =-0.944 =0.1 Cv=2.015
tv=.944 * [5/(1-.9442) =-6.36

Statistical Methods (Jadhav)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Methods (Jadhav)

Uploaded by

Copyright:

Available Formats

Statistical Methods

Friday, September 03, 2010 11:00 AM

Presentation: Scatter plot

5'3'' 6'2'' Using raw data

5'3'' 5'8'' 6'

Ogive Less than

Skewed Negative skew Positive skew 0 x

Statistical Methods Page 1

Slope: Intr growth mat x y= a- bx decline Q/Sales

e= nat log = 2.72 (log= something raised to that = log)

Central Tendency Mean: (central limits)

Standard deviation= 0 (wife shoots husband after confession)

Parameter Size Mean N

= nE1 Xi/ N /x= nE1 Xi/ n

#13 p.62 W1 300 @ $20 a share W2 400 @ $25 per share

Statistical Methods Page 2

Most frequently occurring value

Two types of DATA:

f 120 110 States in USA- Alabama Alaska

Population mean is more confident the further it is

Variance =s =(fXm- [(fXm)/n])/ ni1

Statistical Methods Page 3

Statistical Methods Page 4

All outcomes = Event Sample Space Coin H/T DIE 1 -> 6

Even= Odd = 3/6 = 0.5

Venn Diagram: # UD students 100/ 3000 -> in sample Red Blue

20 30 Probability with replacement

Q: 10 t/f 10 mult choice example

Statistical Methods Page 5

Probability of A or B P(A or B)= P(A) + P(B) P(A)+P(A)=1

Blood Types F A B O AB 22 5 2 21 n=50

Not mutually exclusive P(A or B)= P(A) + P(B)- P(A and B) B A

P(A) AND P(B)= P(A) * P(B I A ) K 4/52 Q 4/51

ABC ACB BAC BCA

nPr= n! (n-r)! (3-3)!= 0! 3x2x1 1 =6

Raw Data: Mean Standard deviation Standardize the distribution

Cv=s x 20 -1s Xm 10-20 20.5-30 5 10 15 12 13 15 17 20 f*Xm 0 25 s-5 30 1s

Midpoint is the average of the values

Statistical Methods Page 6

P() 1/8 or .125

P(x) .125 .375 .375 .125

x*p(x) .375 .75 .375 0

= (x-) *P(x) 0.28

0.5 -0.5 .05

0.25 0.25 2.25

.09 .09 0.28

Three tosses= three trials nCr= n! (n-r)!r!

p.760 apendix b9 Binomial distribution

- Once you know n,p,q =n*p =n*p*q

= 0+0.2+0.4+0.6+0.8=2 =1+4+9+16=(30*2)=6-4=2 = 2=1.4

Statistical Methods Page 7

Area under the Normal Curve: z= Xi- Xi=+ *z

=50-15*1 =.35 .3413 .3413

p.750 .04 column Z column= 1 z=1.04

=80-14*.84 =8-11.76 =68.24

0 .30% 80 =14 Xi----> 80%

Area under the graph

Statistical Methods Page 8

z=x- Find the area between z=0 and z=1.8

-2.48 -.83 0 I -.4934 +.2967 .1967 =19.67%

How do we find the area under this?!

Chebysheu: Area under curve is at least equal to: =1- (1/k)

Statistical Methods Page 9

- Once you know n,p,q =np =np*q

tv= rn-2 = r n-2 1-r2 1-r2

r=73745 - 57511 [7579-(57) 2][7*38993-(511) 2] =-0.944 =0.1 Cv=2.015