You are on page 1of 33

Six Sigma Green Belt Training

Correlation/Regression
2004 American Society for Quality. All Rights Reserved. Recognize Define Measure Analyze Improve Control

About This Module . . .

Correlation Analysis is used to quantify: the degree of association between variables Regression Analysis is used to quantify: the functional relationship between variables

Six Sigma, A Quest for Process Perfection Attack Variation and Meet Goals

\DataFile\Correl.mtw \DataFile\RegressAnova.mtw \DataFile\Correg Your Turn.mtw

Measure

Analyze

Improve

Control

Page 2

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

What We Will Learn . . .

Correlation How to measure the linear relationship between two variables How to interpret the Pearson Correlation Coefficient, r

Regression Y = f(X): how to regress a dependent variable, Y, on an independent variable, X (simple linear regression) How to interpret the Coefficient of Determination, R-Sq

How to interpret the ANOVA table for simple linear regression


How to analyze residuals
Measure Analyze Improve Control 2004 American Society for Quality.
All Rights Reserved.

Page 3

Correlation/Regression

Version 2.1

Real World Examples


ADMINISTRATIVE A software company wants to know the relationship between calls in queue and service time. MANUFACTURING

A quality manager wants to predict the strength of a plastic molding by destructively testing a coupon.
DESIGN

A chemical engineer, designing a new process, wants to investigate the relationship between a key input variable and the stack-loss of ammonia.
2004 American Society for Quality.
All Rights Reserved.

Measure

Analyze

Improve

Control

Page 4

Correlation/Regression

Version 2.1

Terms
Correlation

Used when both Y and X are continuous Measures the strength of linear relationship between Y and X

Metric: Pearson Correlation Coefficient, r (r varies between -1 and +1)


Perfect positive relationship, r = 1 No relationship, r = 0 Perfect negative relationship, r = -1

Regression Simple linear regression used when both Y and X are continuous Quantifies the relationship between Y and X (Y = b0 + b1X) Metric: Coefficient of Determination, R-Sq (varies from 0.0 to 1.0 or zero to 100%) None of the variation in Y is explained by X, R-Sq = 0.0 All of the variation in Y is explained by X, R-Sq = 1.0
Measure Analyze Improve Control Page 5 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

Correlation Coefficients: Illustration


SCATTERPLOT OF Y VERSUS X
103
-98

SCATTERPLOT OF Y VERSUS X

r = 102 +1.0
101

-99

r = -1.0

-100

-Y
100 99 98 98 99 100 101 102 103
-101 -102 -103 98 99 100 101 102 103

SCATTERPLOT OF Y VERSUS X
210

200

Y
190

r = 0.0
200 210 220

180

Measure

Analyze

Improve

Control

Page 6

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Correlation: Minitab Example


Station 1 Station 2 8.6 8.7 8.8 9.0 9.0 9.1 9.1 9.3 9.0 9.1 9.1 9.2 9.1 9.2 9.2 9.4 9.1 9.2 9.1 9.2 9.0 9.2 8.8 9.0 9.0 9.2 9.1 9.2 9.4 9.6 9.3 9.5 8.8 9.0 9.2 9.4 9.0 9.0 8.8 8.9

Voltage for the same power supply is measured at Station 1 and Station 2.

Determine the correlation for voltage between the two stations.

Approach:

Open Datafile\CORREL.mtw (the data are displayed in the Data Window)


Go to Stat > Basic Statistics > Correlation

Measure

Analyze

Improve

Control

Page 7

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Correlation: Minitab Example (Cont)

4 2 5

1. Select C1 Station 1 and C2 Station 2 2. Push Select 3. Observe Station 1 and Station 2 as Variables: 4. Select Display p-values 5. Select OK

Measure

Analyze

Improve

Control

Page 8

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Correlation: Minitab Example (Cont)


Correlations: Station 1, Station 2 Pearson correlation of Station 1 and Station 2 = 0.959 P-Value = 0.000 From Minitab Session Window Null Hypothesis: NO correlation between Station 1 and Station 2 (H0 is false because p is less than 0.05)
Measure Analyze Improve
Scatterplot of Station 1 vs Station 2
9.4 9.3 9.2 9.1

Station 1

9.0 8.9 8.8 8.7 8.6 8.5 8.6 8.8 9.0 9.2 Station 2 9.4 9.6

Graph > Scatterplot

Control

Page 9

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Simple Linear Regression Analysis

Used to fit lines and curves to data when the parameters (bs) are linear The fitted lines Quantify the relationship between the predictor (input) variable (X) and response (output) variable (Y)

Help to identify the vital few Xs (funneling)


Enable predictions of the response Y to be made from a knowledge of the predictor X Identify the impact of controlling a process input variable (X) on a process output variable (Y)

Produces an equation of the form:


b bX Y is an estimate (" fitted value' ) where Y of the populaton value Y
Measure Analyze Improve Control Page 10 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

Regression: Minitab Example


Station 1 Station 2 8.6 8.7 8.8 9.0 9.0 9.1 9.1 9.3 9.0 9.1 9.1 9.2 9.1 9.2 9.2 9.4 9.1 9.2 9.1 9.2 9.0 9.2 8.8 9.0 9.0 9.2 9.1 9.2 9.4 9.6 9.3 9.5 8.8 9.0 9.2 9.4 9.0 9.0 8.8 8.9
Measure Analyze

The voltage at Station 1 is correlated with the voltage at Station 2.

A Green Belt is given the task of predicting voltage at Station 2 from the voltage at Station 1.

Approach:

Open Datafile\CORREL.mtw (the data are displayed in the Data Window) Go to Stat > Regression > Fitted Line Plot

Improve

Control

Page 11

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Regression: Minitab Example (Cont)

3 1 2

5
1. Select C1 Station 1 and C2 Station 2 2. Push Select 3. Observe Station 1 as Response (Y): and Station 2 as Predictor (X):

4. Select Linear as Type of Regression Model 5. Select OK

Measure

Analyze

Improve

Control

Page 12

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Regression: Minitab Example (Cont)


Prediction equation
S R-Sq R-Sq(adj) 0.0557288 92.0% 91.5%

Station 1 = 1.020 + 0.8729 Station 2 9.5 9.4 9.3 9.2

Fitted Line Plot

Station 1

9.1 9.0 8.9 8.8 8.7 8.6

Fitted line: obeys the prediction equation

Coefficient of Determination: use R-Sq for simple linear regression (one X)

8.6

8.8

9.0

9.2 Station 2

9.4

9.6

Measure

Analyze

Improve

Control

Page 13

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Linear Regression of Station 1 on Station 2

How is dependent Station 1 related to independent Station 2 or what is the regression of Station 1 on Station 2? From the Session Window, the regression equation is Station 1 = 1.020 + 0.8729 Station 2 Intercept, b0

Slope, b1

The intercept, b0, is where the fitted line (regression line crosses the Y-axis when X = 0 The slope, b1, is rise over run or DY/DX

The coefficients b0 and b1 are estimates of the population parameters b0 and b1: they are linear coefficients
2004 American Society for Quality.
All Rights Reserved.

Measure

Analyze

Improve

Control

Page 14

Correlation/Regression

Version 2.1

Origin of the Regression Equation


Scatter Plot
100 90

Time to Invoice (Y)

80 70 60 50 40 40 50 60 70 80

???

The best fitted line goes through the means of Y and X (shown by the cross)

90

100

Items Ordered (X)

What is the best fitted line between the Time to Invoice and the Items Ordered ?

Measure

Analyze

Improve

Control

Page 15

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Least Squares Method

Residual, r = Observed Value Predicted Value Fitted Line and Residuals


100 90

Time to Invoice (Y)

The least squares method minimizes the sum of the squared residuals The resulting equations for the intercept and slope are called the normal equations

80

70 60 50 40 40 50 60 70 80 90 100

Items Ordered (X)

Measure

Analyze

Improve

Control

Page 16

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Least Square Method (Cont)


Fitted Line and Residuals
100 90

A residual may be positive, negative, or zero:

Time to Invoice (Y)

Positive 80 Residual
70

Positive: point above the fitted line Zero: point on the fitted line Negative: point below the fitted line

Zero Residual Negative Residual

60
50 40 40 50 60 70

80

90

100

Items Ordered (X)

Measure

Analyze

Improve

Control

Page 17

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Statistical Significance

An analysis of variance (ANOVA) table informs us about the statistical significance of the regression analysis

The null hypothesis, H0: the regression results from common cause variationwhen H0 is true, there is no statistically significant regression and the best prediction of Y is the mean of Y
As before, the p-value is used to evaluate the null hypothesis: if p is less than 0.05, the null hypothesis is false, and the regression is statistically significant

Approach:

Use Datafile\REGRESSANOVA.mtw Go to Stat > Regression >Regression

Measure

Analyze

Improve

Control

Page 18

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

ANOVA for Simple Linear Regression


1. Select Options 2. Select Pure Error in Lack of Fit Tests 3. Select OK
1

Measure

Analyze

Improve

Control

Page 19

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

ANOVA for Simple Linear Regression


Observe the ANOVA (Minitab Session Window)

Regression is significant: p < 0.05

Analysis of Variance
Source Regression Residual Error Lack of Fit Pure Error Total

DF 1 12 3 9 13

SS 32.123 0.534 0.212 0.322 32.657

MS 32.123 0.044 0.071 0.036

F 722.31

P 0.000

1.98 0.188
No lack of fit: p >= 0.05

The sum of squares (SS) for Regression involves each predicted value of Y minus the mean of Y The SS for Residual Error involves each observed value minus the predicted value, that is, the residual SS for Residual Error can be further decomposed into SS Lack of Fit and SS Pure Error SS Pure Error is the within subgroup variation and SS Lack of Fit is the Residual minus the SS Pure Error
Analyze Improve Control Page 20 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

Measure

Degrees of Freedom (Linear Regression)

Every sum of squares (SS) has a number called degrees of freedom, DOF DOF is the number of independent pieces of information, involving the n independent responses Y1, Y2, , Yn, that are needed to compile the sum of squares SS about the mean needs (n 1) pieces of information SS due to regression needs one piece of information, b1

SS of residuals needs (n 2) pieces of information: in general, DOF for SS of residuals equals (number of observations number of parameters estimated)

b0 and b1 for simple linear regression


Measure Analyze Improve Control Page 21 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

ANOVA for Simple Linear Regression (Contd)


Source of Variation and Degrees of Freedom
Source Regression Residual Error Lack of Fit Pure Error Total m = sample size of subgroup p p = number of subgroups Degrees of Freedom 1 n-2 DOF RESIDUAL ERROR - DOF PURE ERROR

p j

(m j )
n-1

Measure

Analyze

Improve

Control

Page 22

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

ANOVA for Simple Linear Regression (Contd)


Identifying subgroups and sample size Observe Minitab Data Window A subgroup contains all of the predictor variables, Xi, that have the same value The sample size is the number of cells in each subgroup
Predictor, Xi Response, Yi 1.00 1.40 1.70 1.80 2.20 2.10 3.30 3.20 3.30 4.10 4.40 5.60 5.40 5.50

DOF

PURE ERROR

Subgroup #1, m =2 Subgroup #2, m =4 Subgroup #3, m =3 Subgroup #4, m =2 Subgroup #5, m =3
Measure Analyze Improve Control

1.00 1.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 4.00 4.00 5.00 5.00 5.00

1+3+2+1+2=9

Each subgroup within subgro variation


2004 American Society for Quality.
All Rights Reserved.

Page 23

Correlation/Regression

Version 2.1

ANOVA for Simple Linear Regression (Contd)


Source of Variation, Sum of Squares, Mean Square, and F-value
Source Regression Residual Error Lack of Fit Pure Error Total Sum of Squares Mean Square MS REGRESSION MS RESIDUAL ERROR MS LACK OF FIT MS PURE ERROR
MS MS
LACK _ OF _ FIT PURE _ ERROR

F
MS REGRESSION MS RESIDUAL _ERROR

i Y ) ( Y i
n

i ) ( Y Y i i
n

SS RESIDUAL ERROR - SS PURE ERROR


p j

m k

(Yk Yj )

( Y Y ) i i n

Measure

Analyze

Improve

Control

Page 24

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals
Residuals are used to test the adequacy of the prediction equation (model) In residual plots, three types of plots indicate model inadequacy The plots will be dramaticnot subtle!

1. Fans
Approach

2. Bands sloping up or down

3. Curved bands

Open Datafile\Residuals

Go to Stat > Regression > Fitted Line Plot

Note: Fitted Line Plot. does not have Lack of Fit Tests.
Measure Analyze Improve Control Page 25 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

Analysis of Residuals (Cont)

1
4

1. In Fitted Line Plot dialog box, Select Graphs 2. Select Four in One Plot 3. Select OK 4. Select OK

2 3

Measure

Analyze

Improve

Control

Page 26

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals (Cont)

Units = - 2.343 + 0.08993 Minutes 20


S R-Sq R-Sq(adj) 1.78117 89.7% 89.2%

Fitted Line Plot

Units

R-Sq is 89.7% 15 The regression is significant Can we do better? 10 residuals look? How do the
5

0 0 50 100 Minutes 150 200

Measure

Analyze

Improve

Control

Page 27

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals (Cont)


Residual Plots for Units
Normal Probability Plot of the Residuals
99 90 50 10 1 -5.0 -2.5 0.0 Residual 2.5 5.0
Residual Percent

Residuals Versus the Fitted Values


4 2 0 -2 0 4 8 Fitted Value 12 16

Histogram of the Residuals


8
Frequency

Residuals Versus the Order of the Data


4
Residual

6 4 2 0 -3 -2 -1 0 1 Residual 2 3 4

2 0 -2 2 4 6 8 10 12 14 16 18 20 22 24 Observation Order

Measure

Analyze

Improve

Control

Page 28

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals (Cont)


Normal Probability Plot of the Residuals
(response is Units)
99

95 90 80

Percent

70 60 50 40 30 20 10 5

p > 0.05 Can assume residuals are normal

Probability Plot of RESI1


1

-4

-3

-2

-1

0 1 Residual

99

Normal

95 90

Mean StDev N AD P-Value

-9.69595E-15 1.742 24 0.336 0.479

Percent

Residuals must be normally distributed. Are they? First, Store residuals, then

80 70 60 50 40 30 20 10 5

Stat > Basic Statistics > Normality Test


Measure Analyze Improve Control

-4

-3

-2

-1

0 1 RESI1

Page 29

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals (Cont)

The plot of Residuals vs. Fits shows a curved band. Try Stat > Regression > Fitted Line Plot and select
Residual

Residuals Versus the Fitted Values


(response is Units) 4 3 2 1 0 -1 -2 -3 0 2 4 6 8 10 Fitted Value 12 14 16 18

Quadratic.

Select Graphs > Four in One Plot.

Measure

Analyze

Improve

Control

Page 30

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Analysis of Residuals (Cont)


Units = 2.672 - 0.02075 Minutes + 0.000466 Minutes**2 20
S R-Sq R-Sq(adj) 1.26903 95.0% 94.5%

Fitted Line Plot

15

Improving the model adequacy increased RSq from 89.7% to 95.0%


Residual Plots for Units

Units

10

Normal Probability Plot of the Residuals


99 90
Residual Percent

Residuals Versus the Fitted Values


2 1 0 -1 -2

0 0 50 100 Minutes 150

50 200 10 1 -3.0 -1.5 0.0 Residual 1.5 3.0

10 Fitted Value

15

20

Histogram of the Residuals


6.0
Frequency Residual

Residuals Versus the Order of the Data


2 1 0 -1 -2

4.5 3.0 1.5 0.0 -2 -1 0 Residual 1 2

How do the residuals look?


Measure Analyze Improve

8 10 12 14 16 18 20 22 24 Observation Order

Control

Page 31

Correlation/Regression

Version 2.1

2004 American Society for Quality.


All Rights Reserved.

Your Turn

Open Datefile\CORREG YOUR TURN Analyze the data sets: 1. Do the variables correlate? 2. What is the prediction equation? 3. Is the regression statistically significant? 4. Does the analysis of residuals indicate anything unusual?

Another approach: Stat > Regression > Regression

> Options > Lack of Fit Tests

Select Pure Error when your data is replicated Select Data Sub setting when you data is not replicated
Measure Analyze Improve Control Page 32 Correlation/Regression Version 2.1 2004 American Society for Quality.
All Rights Reserved.

What We Have Learned . . .

Correlation How to measure the linear relationship between two variables How to interpret the Pearson Correlation Coefficient, r

Regression Y = f(X): how to regress a dependent variable, Y, on an independent variable, X (simple linear regression) How to interpret the Coefficient of Determination, R-Sq

How to interpret the ANOVA table for simple linear regression


How to analyze residuals
Measure Analyze Improve Control 2004 American Society for Quality.
All Rights Reserved.

Page 33

Correlation/Regression

Version 2.1

You might also like