You are on page 1of 18

Statistical Methods

Friday, September 03, 2010 11:00 AM

Statistics V54000 -> government (laid off 125000) ->Private (gained 67000) - Stock market goes up when employment goes down Look behind the numbers They take a sample and use it to exemplify the population
(Receipts + Deficits)= Govt. Exp. -entitlements -services (highways, education etc) -[discretionary ]-> defense $600 Bill y= a+b*other exp Sample: Data - Qualitative - Quantitative - Discrete 0 1 2 3 - 100 - Continual 2.1 Systematic sampling - Take every third or large section Example: beer bottles- every thousand are tested to make sure all have 12 oz ? Random sampling - Every member of the universe has equal probability to get included into the sample Ex. 5'3'' 5'4'' 5'7'' 5'8'' 5'9'' 5'11'' 6'0'' 6'2'' The average does not have to be a number in the sample Class Frequency Midpoint Interval The cross boundaries must me mutually exclusive. interval 3x5'5''=16'5'' 5'3''-5'7'' 5'3''-> 5'7'' 3 5'5'' 16'4'' 5'8''->5'11'' 3 5'9'' 3x5'9''=17'9'' 5'8''-5'11'' 6'-> up 12'2'' 2 6'1'' 49'4''/8=average height in the class Frequency Histogram average

Presentation: Scatter plot

5'3'' 6'2'' Using raw data

5'3'' 5'8'' 6'

5'5''

5'9'' 6'1''

Ogive Less than

Skewed Negative skew Positive skew 0 x

Morethan
6"

Statistical Methods Page 1

Presentation is best when there is a story that is simple and tells exactly whats going on

Slope: Intr growth mat x y= a- bx decline Q/Sales

<

y Y L y= L/ 1+a*e^(-bt)

time

e= nat log = 2.72 (log= something raised to that = log)

Central Tendency Mean: (central limits)

Standard deviation= 0 (wife shoots husband after confession)

65 Ex. Polls (sample) +/- 4 Obama: 52% Other : 48% Pay attention to the numbers these are essentially the same

Parameter Size Mean N

Statistic n

= nE1 Xi/ N /x= nE1 Xi/ n

p.60 exercise 1 6 = 0.6 3 = -2.4 5 = -0.4 7 = 1.6 6 = 0.6 3 /x= 27/5=5.4 5.4 Size Mean 7

-2.8 +2.8 =0

n=5 Exi=27

#13 p.62 W1 300 @ $20 a share W2 400 @ $25 per share

Statistical Methods Page 2

W1 300 @ $20 a share W2 400 @ $25 per share W3 400 @ $23 per share [(W1*X1) + (W2*X2) + (W3*X3)]/(w1+ w2= w3) = 25,200/1,100=22.9091

MEAN:
Simple Weighted Med Mode /x The middle value in ordered data (the mid-point) Mode med /x

Arithmetic mean Geometric mean =3 x1*x2*x3 -> Arithmetic mean is in some cases higher than the geometric mean.

MEDIAN MODE

Most frequently occurring value

Two types of DATA:


1) Time series (time on x axis data on y axis) T
i. 2009

f 120 110 States in USA- Alabama Alaska


Xm 8 13 18 23 28 33 38 n=20 f*Xm 8 26 54 115 112 99 76 fXm=490

Annual equilibrium

2008

Class interval Frequency (f) 5.5-10.5 10.55-15.5 2) 15.55-20.5 20.55-25.5 25.55-30.5 30.55-35.5 35.55-40.5 1 2 3 5 4 3 2

fXm/ n
=490/20 =24.5

Group with highest number of frequencies= mode 20.55-20.5 f=mode, xm= midpoint

Dispersion How far is each value from the mean Range Its best to chose the distribution with the shorter range Because data is more uniform

5.5

24.5/x

40.4

Absolute value (every sign is positive) Ix-/xI s= (x-/x) / n = (x-) /N STD Deviations= s = -3.5 -2.5 -5 5 25 35 (x-/x)/ s 68% 95% 99.7%

Population mean is more confident the further it is

Variance =s =(fXm- [(fXm)/n])/ ni1

Statistical Methods Page 3

= [13310- (490)/20]/19 68.7 How many distributions lie beneath What is the average (the mean) How far are the values spread from it Standard deviation Ch1-4

Probability
The chance that something is going to happen Experiment: Results in an outcome

Statistical Methods Page 4

All outcomes = Event Sample Space Coin H/T DIE 1 -> 6


1 4 2 5 3 6 2 1/6

Even= Odd = 3/6 = 0.5

Venn Diagram: # UD students 100/ 3000 -> in sample Red Blue

Conditional probability

20 30 Probability with replacement

Classical
When a favorable outcome P= # of favorable outcomes Total # of Possibilities = 4/52 (to get an ace) P(E)= 0.08 P(/E)= 0.92 (the alternative) P(E) + P(/E)=1 0_<P(E)_< 1

Empirical
#time Event occurred Total # of Observations

Subjective
(Knowledge or Authority-there is not necessarily data to support decision) P(E)= n(E) n(S) 1 Ace or 1 king 0.08 0.08 8/52=0.1538 1 ace and 1 king 0.08*0.08=0.0064 Coin (2 conscutive heads) H H 1/2 1/2 .5*.5=0.25 Efficiency Bias R R&B B

Q: 10 t/f 10 mult choice example

Statistical Methods Page 5

Probability of A or B P(A or B)= P(A) + P(B) P(A)+P(A)=1

Blood Types F A B O AB 22 5 2 21 n=50

22 f 50 n 28 50 = A

Not mutually exclusive P(A or B)= P(A) + P(B)- P(A and B) B A

A and B
Independent Probability of selecting A doesnt effect the probability of selecting B K V 4/52 Q V 4/52 P(A) and P(B)= P(A) * P(B) (K+Q)= (4/52) * (4/52)

Not Independent

P(A) AND P(B)= P(A) * P(B I A ) K 4/52 Q 4/51


P(AandB) = P(A) * P(BI A) p(A) P(A) Probability Is it mutually exclusive See if the are independent

ABC ACB BAC BCA

nPr= n! (n-r)! (3-3)!= 0! 3x2x1 1 =6

Raw Data: Mean Standard deviation Standardize the distribution

Combinations

Cv=s x 20 -1s Xm 10-20 20.5-30 5 10 15 12 13 15 17 20 f*Xm 0 25 s-5 30 1s

nCr=

n! (n-r)! * r!
30! 30x29x28 (27)!* 3! = 3x2x1

Midpoint is the average of the values

Statistical Methods Page 6

2x2x2=8 h h h t T h t t M= (x*P(x)) = (x-) *P(x) t h t h t h Hhh Hht Hth Htt Thh Tht Tth ttt

P() 1/8 or .125

Hhh hht hth thh htt tht tth ttt 1/8=0.125 1/8=0.125 3/8=.275 3/8=0.375

X 3 2 1 0

P(x) .125 .375 .375 .125


1.5=

x*p(x) .375 .75 .375 0

(x-m) 1.5

(x-) 2.25

= (x-) *P(x) 0.28

3 h

0.5 -0.5 .05

0.25 0.25 2.25

.09 .09 0.28

2 h 1 t

1 h 2 t

o h 3 t

= 0.74 = 0.86

Binomial 1. Limited # of Trials= n 2. Only 2 possibilities 3. P(success) P(failure) p + q =1 P(x)= nCr * p^x * q^n-x = 3C2 * 0.5^2 * 0.5^1 =3! 1! =3 * 0.25 * 0.5=.375

Three tosses= three trials nCr= n! (n-r)!r!

p.760 apendix b9 Binomial distribution

- Once you know n,p,q =n*p =n*p*q


# on Balls 0 Simple formula to find the variance =X*P(x) =[(x-)*P(x)] =*(x)+*P(x)- P(x) 1/5 1 .2 2 .2 3 .2 4 .2

= 0+0.2+0.4+0.6+0.8=2 =1+4+9+16=(30*2)=6-4=2 = 2=1.4

=x n

=mean

.136 .34 .34 .136 .023 -36 -26 -16 2 .023 +16 +26 +36 15 =2

=1.4

Statistical Methods Page 7

STD Normal= x-

2-2 1.4

Z=x-

Mt Z*=x-

Area under the Normal Curve: z= Xi- Xi=+ *z

The closer we are to the mean the more accurate we are If the sample size is 30 or more than the number of samples gets the same result

=50-15*1 =.35 .3413 .3413

p.750 .04 column Z column= 1 z=1.04


100

50 +1 =15

.3413 .3413

=80-14*.84 =8-11.76 =68.24

0 .30% 80 =14 Xi----> 80%

100

X=+*z

z= -3 38

-2 -1 52 66

0 1 2 3 80 94 108 122

Area under the graph

Statistical Methods Page 8

z=x- Find the area between z=0 and z=1.8

0 42%

1.8

z=-2.48 + z=-0.83

-2.48 -.83 0 I -.4934 +.2967 .1967 =19.67%

-.4934+.2967=-0.1967

Why a sample other than entire population? Its not possible sometimes , so you take a representative area and come to a conclusion of the population. -x= Bias, sampling error x x3 x1 x2 x4

x The mean of the means of sample is always uniform, you always end up with normal distribution. The result is better the higher number of samples

z= Xi- /n

# samples
(1)pop

(2) Bias V

How do we find the area under this?!

Chebysheu: Area under curve is at least equal to: =1- (1/k)


2=1-(1/4)= 3/4=0.75 =75%

k>1

Statistical Methods Page 9

Stratified sampling:

100 150

250

600

n=n1 + n2 + n3 X is estimator of 1) Unbiased 2) Consistent - as n^ Bias v 3) Efficient - smallest "s" -> -2.6 Max error E=X+ Z(/2)* (/ n) E*n=X+ Z(/2)* E n=X-+Z(/2)* E interval X V +2.6 z=x- z*=x z=Xi- /n z*(/n)=Xi-

Confidence Level @95% .95 z=2

= 1 conf. level = 1-.95=0.05 X- Z(/2)* (/n)<<X+ Z(/2)* (/n)

.475 z=1.96 Z(/2)=

X-Z/2(/n) <M< X +Z/2(/n) Average age of students; n=50 X=21.2 =2 years Confidence level = .95 21.2-.6=20.6 21.2+.6 .5 .025

.025
Years 20.6 Z 1.96 20.6 1.96

How many students are older than 20.6 One sided

Average height of UD students: @ .99 confidence level Within one year .5 =3" -.005 /2= .005 .495

= 1- confidence level n= Z/2 * E =(2.57*3/1)^2 =7.71^2

z=2.57 N>30 known Z Z N<30 pop normally

X 7 9 10 11

x1 8 7 10 11
n-1 5-1=4

Statistical Methods Page 10

pop normally
not known N<30 t N>30 Z Df

11 13 50 x=10

11 14 50

5-1=4

As df ^ t distribution approaches Z distribution- obviously because you are approaching 30

Hypothesis Testing Hypothesis: some statement about some population parameter - Could be non numerical UD students drink beer Copernicus heliocentric model Do not reject =0.025 Rejection zone Null Hypothesis: Ho: =k (350) Alternate Hypothesis: H1= k Level of significance - Confidence level - Test Value= X -> z If the value is >x> then it is rejected

=0.025

-1.96 350 v =15 321 v z=critical value

+1.96 v 371

How many students drink at least 325 bottles?

When it is at least or more than it is a left sided hypothesis

325 0 Left Tailed 350 Ho: >K

H1: <K

Ho True Reject Ho Do not reject Ho Type 1 error

Ho False

Type II Error

A medical report states, that average cost of rehabilitation is $24,672 its the mean- so its two sided Researcher -> n=35 x=25,226 Is the number significantly different ? =3,251 =0.01 Ho:=24672 H1:24672 Cv= 2.58 TV= x- /n =25226-24672

Statistical Methods Page 11

/n =25226-24672 3251/35 = 1.01 CV -2.58 1.01 CV +2.58

The average starting salary for a nurse is $2400 =$24,000 n=10 x=23,450 s=400 =0.05 Ho: =24000 H1:24000 CV= t= .5- .025=0.475 Df=(n-1)= 9 TV= 23450-24000 400/10 = -4.35

-2.262 -4.35

+2.262

Rejected!

The average salary of an assistant professor > $42,000 n=30 -One sided example x=43,260 =5,230 = 0.05 Ho:<42,000 H1:>42,000 0
Ho false Do not reject

1.32

Rejection zone CV 1.65

z=+1.65 =CV TV= 43260-42000 5230/30 =1.32

II

error

The average price of shoe<$80 n=36 x=$75 s=$19.2 =0.10

Ho: >80

H1: <80
TV=-1.56

CV -1.56 -1.28 p(Type 1 error)> 0.10

Quiz answers

n=28 , Ho: m<23, H1: m>23, =.05, df=27, CVt=1.703, TV=4.5 29 24 24 .05 28 1.701 1.88 27 25 25 .1 26 1.315 1.84 Reject Ho

Statistical Methods Page 12

When is Ho true? TV, 1.55 z= 1.55 = -0.4392 p .0608 = 0.05 >P I Ho is true - Reject P> I Ho is true - Do Not Reject

CV 2.33

P= P(TV> Pi)

The Project:
Qd=a-b*Price Burro of labor statistics Y=a+b*x Select any two series where the independent variable impacts the Infl=a+B*M1 dependent variable Housing starts= a-b* Mortgage Rate Y X Y1 X1 Stationary(within 1 year timeframe) Y2 X2 ajadhav@udallas.edu Cars sold 1 yr DONT DO A TIME SERIES Simple linear regression in excel 15k 16k 17k 18k 19k 20k
Mortgage rates effect housing start Hupo Data Source: Find two variables: cause and effect- number of classes missed and grade achieved Interest rate goes up borrowing goes down For one specific year One line saying what im trying to relate Give source of data Appendix A Testing how close our sample mean is to population mean Compare two sample means: the sample means are independent of each other, populations are normally distributed Your IQ before stat class and after Ho: 1=2 or 1-2= H1: 12 or 1-2 (x1-x2)= 2 1 + 2 2 n1 n2

Minimum Pairs

Price of car

0 z= (observed value-expected value) 2 1 + 2 2 n1 n2

The average price of a hotel room in Dallas= $88.42, n1=50 s1=$5.62 Denton= $80.61 n2=50 s2= $4.38 =0.05

Cv

cv

tv Rejection zone

-1.96

+1.96

7.45

Statistical Methods Page 13

Test Value= (88.42-80.61) = 7.45 5.62 + 4.83 50 50 # of sports for boys=8.6 # of sports for girls =7.9 Ho: 1<2 H1: 1>2 z=8.6-7.9 (3.3/50)+(3.3/50) =1.06=0.3554 P=.5-.3554=.1446 n1=50 n2=50 S1=3.3 S2=3.3 =0.10

Find the P value In p value you always contest the aternate hypothesis

.3554

(x1-x2)-z/2

<1-2<(x1-x2)+z/2* 1/n1 + 2/n2 => S1S2 S1=S2 => 1) Sp= (n-1)s1 + (n2-1)S2 n1+n2-2 2) t= .(x1-x2) Sp(1/n1+1/n2)

t=x1-x2 s1/n1 + s2/n2

Df=1

Df=9 Df=49

Find Right-> =0.1/2= 0.5 -> 36.415 Left -> 0.95 ---->13.848 Df=49 =0.1 Conf=0.9 n=25 df=24 (n-1)s<<(n-1)s right left Smaller<2.56<larger STD DEV Test: Study through the probabilities 8,9,10,11 Many samples: Central limits theorem - As n increases the bias goes down 2 errors: 1- Bias X~ 2- Efficiency s~

Statistical Methods Page 14

z=X- /n

X-Z /2* /n<<X+Z /2*/n


E=Z /2* / Hypothesis Testing - Confidence level

z= area under half the curve t= series of distributions Depends on degrees of freedom (n-1)

known - N>30 z - N<30 Population normally distributed z not known - N>30 z - n<30 Population normally distributed t How we resolve the hypothesis: X= (two sided) X< or > (one sided) Whatever the claim is that becomes the alternate hypothesis Type 1 error Type 2 error P value method:

2 sample means: Ho: X1 = X2

X1-X2=0 S1 S2

X1 - X2 = X1 -+X2 X1 - X2 = 1 + 2 n1 n2

S1 =S2 Sp=(n1-1)S1 + (n2-1)S2 n1+n2-2 t=TV=X1 - X2

Left tail means the null hypothesis means it has a left end Ho:1>2 H1:1<2

Variance Interval: (n-1)s<<(n-1)s right left

90% confidence or =0.1


Find the two values when the degrees of freedom are df=24 /2=0.05 table: 36.415 (right) 13,848 (left) Df=24

Statistical Methods Page 15

71

95% confidence interval Standard deviation= 1.6 mgs =0.05 n=19 (n-1)*s 18*1.6=46.08 46.08 < < 46.08 32.852 8.907

1 2 F= S1 --------n1=25 S2 --------------n2=29 F+VE dfN=24 dfD=28

F value is when we are comparing variances in two different samples. We find a range around each variance. - Independent sample has no effect on the other samples - Both populations are normal- so assume that they are normally distributed - The variance of the population are critical n=size of sample x=mean s=standard deviation S in one sample vs s in another sample

2 sided Left Ho: H1:

Right
F= S1 S2 ~1

1=2 1>2 1<2 12 1<2 1<2

As n increases the distribution around the mean becomes more and more unified. Positively skewed 0 1 -->

F= S1 S2

---- N1-1= dfN 15 ---- N2-1=dfN 21

=0.025 =0.05 F -----> dfN ---V 15 dfN : 2.18 21

Critical value of F distribution Medical researcher see whether variance in heart rates of smokers and non smokers are different? Smokers n1=26 non-smokers n2=18 Ho: 1=2 H1: 12 F=36/10=3.6 s1=36 s2=10 =0.1 /2=0.05 2.19 Not Reject 3.6

Statistical Methods Page 16

Smokers n1=26 non-smokers n2=18 Ho: 1=2 H1: 12

s1=36 s2=10

=0.1 /2=0.05 2.19 Not Reject


1 CV

3.6

F=36/10=3.6 - Test value >24 (25) V 2.19 17

Project: plug data into excel then select the scatter function

Higher divergence in church 1 vs church 2

Ho: 1 <1 H1: 1 >1


Always look at the right tailed test Null must be less/more than or equal to Variation of joggers in US vs africa Whether = or Level of significance/2

If hypothesis is correct use the right table

Correlation -Linear relationship

r-sample -correlation -1 Perfectly elastic claim 0 1

Completely inelastic

We want negative correlation or positive correlation: r= (x-x)*(y-y) (n-1)*Sx*Sy = n(x*y))-(x)*(y) [n(x 2)-(x)2]*[n(y2)-(y)2]

-1

0 H0:=0 H1:0

tv= r*n-2 = r* n-2 1-r2 1-r2

Number of absence (X) Grade (Y) 6 2 15 9 82 86 43 74

X*Y 492 172 645 666

X2 36 4 225 81

Y2 6724 7396 1849 5476


50 100 75

Statistical Methods Page 17

9 12 5 8 X=57

74 58 90 78 Y=511

666 696 450 624

81 144 25 64

5476 3364 8100 6084 =38993

50 25 0 5 10 15

=3745 =579

r=7*3745 - 57*511 [7*579-(57) 2]*[7*38993-(511) 2] =-0.944 =0.1 Cv=2.015

tv=.944 * [5/(1-.9442) =-6.36

Statistical Methods Page 18

You might also like