Professional Documents
Culture Documents
If X is an rv with cdf F(x), then Y=g(X) is also an rv. If we write y=g(x), the function g(x) defines a mapping from the original sample space of X, , to a new sample space, , the sample space of the rv Y.
g(x):
1
If X is a discrete rv then is countable. The sample space for Y=g(X) is ={y:y=g(x),x}, also countable. The pmf for Y is
fY ( y ) = P (Y = y ) =
xg 1 ( y )
P ( X = x) =
xg 1 ( y )
f ( x)
2
Let Y=X2.
B={0,1,4}
1/ 5, y = 0 p( y ) = 7 / 30, y = 1 17 / 30, y = 4
a) If g is an increasing function on , FY(y) = FX(g1(y)) for y . b) If g is a decreasing function on and X is a continuous r.v., FY(y) = 1 FX(g1(y)) for y .
7
EXAMPLE
Uniform(0,1) distribution
Suppose X ~ fX(x) = 1 if 0<x<1 and 0 otherwise. Find the distribution function of Y=g(X)= log X.
10
EXPECTED VALUES
Let X be a rv with pdf fX(x) and g(X) be a function of X. Then, the expected value (or the mean or the mathematical expectation) of g(X)
g ( x ) f X ( x ) , if X is discrete x E [ g ( X )] = g ( x ) f X ( x ) dx, if X is continuous
EXPECTED VALUES
E[g(X)] is finite if E[| g(X) |] is finite.
g ( x ) f X ( x ) < , if X is discrete x E g(X ) = g ( x ) f X ( x ) dx< , if X is continuous
12
E( X ) = = x i p( x i )
all xi
13
Population Variance
Let X be a discrete random variable with possible values xi that occur with probabilities p(xi), and let E(xi) =. The variance of X is defined by
V ( X ) = = E[( X ) ] = ( x i ) p( x i )
2 2 2 all xi
EXPECTED VALUE
The expected value or mean value of a continuous random variable X with pdf f(x) is
= E( X ) =
all x
xf ( x)dx
The expected value or mean value of a continuous random variable X with pdf f(x) is
2 = Var ( X ) = E ( X ) 2 =
= E( X 2 ) 2 =
all x
( x ) 2 f ( x)dx
all x
( x) 2 f ( x)dx 2
15
EXAMPLE
The pmf for the number of defective items in a lot is as follows 0.35, x = 0 0.39, x = 1 p ( x) = 0.19, x = 2 0.06, x = 3 0.01, x = 4
Find the expected number and the variance of defective items.
16
EXAMPLE
What is the mathematical expectation if we win $10 when a die comes up 1 or 6 and lose $5 when it comes up 2, 3, 4 and 5? X = amount of profit
17
EXAMPLE
A grab-bay contains 6 packages worth $2 each, 11 packages worth $3, and 8 packages worth $4 each. Is it reasonable to pay $3.5 for the option of selecting one of these packages at random? X = worth of packages
18
EXAMPLE
Let X be a random variable and it is a life length of light bulb. Its pdf is f(x)=2(1-x), 0< x < 1 Find E(X) and Var(X).
19
Laws of Variance
V(c) = 0 V(X + c) = V(X) V(cX) = c2V(X)
21
Moments:
* k k = E X the k-th moment
22
SKEWNESS
Measure of lack of symmetry in the pdf.
Skewness =
E(X )
3 = 3/2 2
23
KURTOSIS
Measure of the peakedness of the pdf. Describes the shape of the r,v.
Kurtosis = E( X )
4
4 = 2 2
Kurtosis=3 Normal Kurtosis >3 Leptokurtic (peaked and fat tails) Kurtosis<3 Platykurtic 24 (less peaked and thinner tails)
The measure of central location reflects the locations of all the actual data points.
25
26
27
x=
n n xi i ii= =1 1x
n n
N i=1 x i
Sample size
Population size
28
x=
10 i =1
xi
10
0x1
Example 2
Suppose the telephone bills represent the population of measurements. The population mean is 42.19 45.77 x + x38.45 + ... + x i200 1 2 200 =1 x i = = = 200 200 43.59
29
30
The Median
The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. It divides the data in half. Example 3
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled (exclude, say, the longest time (33)) for the 10 adults of example 1
The Median
Median of 8 2 9 11 1 6 3 n = 7 (odd sample size). First order the data. 1 2 3 6 8 9 11
Median
The Median
The engineering group receives e-mail requests for technical information from sales and services person. The daily numbers for 6 days were 11, 9, 17, 19, 4, and 15. What is the central location of the data?
For even sample sizes, the median is the average of {n/2}th and {n/2+1}th ordered observations.
33
The Mode
The Mode of a set of observations is the value that occurs most frequently. Set of data may have one mode (or modal class), or two or more modes.
For large data sets the modal class is much more relevant than a single-value mode.
34
The Mode
Find the mode for the data in Example 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution
All observation except 0 occur once. There are two 0. Thus, the mode is zero. Is this a good measure of central location? The value 0 does not reside at the center of this set (compare with the mean = 11.0 and the median = 8.5).
35
36
37
38
Measures of variability
Measures of central location fail to tell the whole story about the distribution. A question of interest still remains unanswered:
How much are the observations spread out around the mean value?
39
Measures of variability
Observe two hypothetical data sets:
The average value provides a good representation of the observations in the data set.
Small variability
Measures of Variability
Observe two hypothetical data sets:
The average value provides a good representation of the observations in the data set.
Small variability
Larger variability
The same average value does not provide as good representation of the observations in the data set as before.
41
The Range
The range of a set of observations is the difference between the largest and smallest observations. Its major advantage is the ease with which it can be computed. Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.
But, how do all the observations spread out? The range cannot assist in answering this question
? Range ? ?
Smallest observation Largest observation
42
The Variance
z
This measure reflects the dispersion of all the observations The variance of a population of size N x1, x2,,xN whose mean is is defined as
2 =
2 ( x ) N i i =1
The variance of a sample of n observations x1, x2, ,xn whose mean is x is defined as
s2 =
2 n ( x x ) i i =1
n 1
43
A measure of dispersion
A B
4
The sum of deviations is observation. zero for both populations, 8 9 10 11 12 therefore, is not a good but measurements The mean of both in B measure of dispersion. arepopulations more dispersed is 10...
than those in A.
Sum = 0
4-10 = - 6 16-10 = +6 7-10 = -3
10
13
16
13-10 = +3
Sum = 0
44
The Variance
Let us calculate the variance of the two populations
2 2 2 2 2 ( 8 10 ) + ( 9 10 ) + ( 10 10 ) + ( 11 10 ) + ( 12 10 ) 2 =2 A= 5 2 2 2 2 2 ( 4 10 ) + ( 7 10 ) + ( 10 10 ) + ( 13 10 ) + ( 16 10 ) 2 B = = 18 5 Why is the variance defined as After all, the sum of squared the average squared deviation? deviations increases in Why not use the sum of squared magnitude when the variation deviations as a measure of of a data set increases!! variation instead?
45
The Variance
Let us calculate the sum of squared deviations for both data sets Which data set has a larger dispersion?
Data set B is more dispersed around the mean
A
1 2 3
B
1 3 5
46
The Variance
SumA = (1-2)2 ++(1-2)2 +(3-2)2 + +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 SumA > SumB. This is inconsistent with the observation that set B is more dispersed.
A
1 2 3
B
1 3 5
47
The Variance
However, when calculated on per observation basis (variance), the data set dispersions are properly ranked.
A
1 2 3
B
1 3 5
48
The Variance
Example 4
The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Find its mean and variance
Solution
x= 6
i6 x =1 i
17 + 15 + 23 + 7 + 9 + 13 84 = = = 14 jobs 6 6
]
49
= 33.2 jobs 2
50
Standard Deviation
The standard deviation of a set of observations is the square root of the variance .
2 2
51
Standard Deviation
Example 5
To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7iron) club, and 75 with the new club. The distances were recorded. Which 7-iron is more consistent?
52
Standard Deviation
Example 5 solution
Excel printout, from the Descriptive Statistics submenu.
The innovation club is more consistent, and because the means are close, is considered a better club
Current Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 150.5467 0.668815 151 150 5.792104 33.54847 0.12674 -0.42989 28 134 162 11291 75 Innovation Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 150.1467 0.357011 150 149 3.091808 9.559279 -0.88542 0.177338 12 144 156 11261 75 53
The empirical rule: If a sample of observations has a mound-shaped distribution, the interval
( x s, x + s) contains approximately 68% of the measurements ( x 2s, x + 2s) contains approximately 95% of the measurements ( x 3s, x + 3s) contains approximately 99.7% of the measurements
54
55
x s, x + s x 2s, x + 2s x 3 s, x + 3 s
at least 0% (1-1/12) approximately 68% at least 75%(1-1/22) approximately 95% at least 89%(1-1/32) approximately 99.7%
57
59
Example
Suppose your score is the 60% percentile of a SAT test. Then
60% of all the scores lie here 40%
60
Your score
Sample Percentiles
To determine the sample 100p percentile of a data set of size n, determine a) At least np of the values are less than or equal to it. b) At least n(1-p) of the values are greater than or equal to it. Find the 10 percentile of 6 8 3 6 2 8 1 Order the data: 1 2 3 6 6 8 Find np and n(1-p): 7(0.10) = 0.70 and 7(1-0.10) = 6.3
A data value such that at least 0.7 of the values are less than or equal to it 61 and at least 6.3 of the values greater than or equal to it. So, the first observation is the 10 percentile.
Quartiles
Commonly used percentiles
First (lower) decile= 10th percentile First (lower) quartile, Q1 = 25th percentile Second (middle) quartile,Q2 = 50th percentile Third quartile, Q3 = 75th percentile Ninth (upper) decile = 90th percentile
62
Location of Percentiles
Find the location of any percentile using the formula
P L P = (n + 1) 100 where L P is the location of the P th percentile
63
Q3
Q2
Q3
64
Interquartile Range
This is a measure of the spread of the middle 50% of the observations Large value indicates a large spread of the observations
Interquartile range = Q3 Q1
65
Box Plot
This is a pictorial display that provides the main descriptive measures of the data set:
L - the largest observation Q3 - The upper quartile Q2 - The median Q1 - The lower quartile S - The smallest observation
1.5(Q3 Q1) Q1 Q2 Q3 Whisker L
66
Box Plot
Example 10
Bills 42.19 38.45 29.23 89.35 118.04 110.46 . Smallest =. 0 .
-104.226
198.4438
67
Box Plot
The following data give noise levels measured at 36 different times directly outside of Grand Central Station in Manhattan.
NOISE 82 89 94 110 . . .
Smallest = 60 Q1 = 75 Median = 90 Q3 = 107 Largest = 125 IQR = 32 Outliers =
BoxPlot
75 75-1.5(IQR)=27
60 70 80 90 100
107
110 120 130
68 107+1.5(IQR) =155
Box Plot
NOISE - continued
Q1 75 25% Q2 90 50% Q3 107 25%
60
125
Box Plot
NOISE - continued
125
70
Box Plot
Example 11
A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results
Example 11 solution
Minitab box plot (MINITAB 15 demo: http://www.minitab.com/en-US/products/minitab/freetrial.aspx?langType=1033) To download SPSS18, follow the link http://bidb.odtu.edu.tr/ccmscontent/articleRead/articleId/ articleRead/articleId/390
71
Box Plot
Jack in the box is the slowest in service Hardees service time variability is the largest
100
200
300
C6
72
Box Plot
Times are symmetric
Jack in the Box5 Hardees
C7
4
Jack in the box is the slowest in service Hardees service time variability is the largest
100
200
300
C6
74
Covariance
Population covariance = COV(X, Y) = (x i x )( y i y ) N
x (y) is the population mean of the variable X (Y). N is the population size.
Covariance
Compare the following three sets
xi 2 6 7 x=5 xi 2 6 7 x=5 yi 13 20 27 y =20 yi 27 20 13 y =20 (x x) -3 1 2 (y y) 7 0 -7 (x x) -3 1 2 (y y) -7 0 7 (x x)(y y) 21 0 14 Cov(x,y)=17.5 (x x)(y y) -21 0 -14 Cov(x,y)=-17.5
xi 2 6 7
yi 20 27 13 Cov(x,y) = -3.5
x=5 y =20
76
Covariance
If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive number. If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. If the two variables are unrelated, the covariance will be close to zero. 77
or r =
COV(X,Y)<0
79