You are on page 1of 62

Hypothesis Testing, Design of Experiments and Regression Analysis

Learning points/objectives:
1. Definition and meaning of Hypothesis and the construction of hypothesis
testing.
2. How and why tests are used to make decisions about a population if mean,
variance or probability of success is specified.
3. Physical meanings of Type I and Type II errors. How to calculate the
probability of Type II or error when a value of alternate hypothesis is given
and the values of , , and n is known.
4. For tests of a single mean, proportion, or variance, know how to compute the
value of the decision criterion (or criteria) for different alternative hypotheses
and know how to use those values to determine whether the null hypothesis
should be rejected.
5. Understand how the values of n, c, , and are related to one another and OC
curve.
6. Meaning, scope, procedure, analysis on design of experiments.
7. Regression analysis for developing statistical relationships between predictor
variables and response variable.

Defining Hypotheses
A statistical hypothesis is a statement or a claim about the parameters of a
probability distribution or the parameters of a model.
Ex. Mean tension bond strengths of two formulations are unequal. So

H 0 : 1 2
H 1 : 1 2

Population parameters can be approximated based on sample statistics. Statistical


response can be described by
y (mean value + a normal random variable that represents noise in a system)
No preconceived notion about the actual value of a parameter. But when testing of
hypothesis on a population parameter, there is a preconceived notion concerning its
value.

There can be two theories/hypotheses notion that can be accepted: hypothesis


proposed by experimenter, i.e., an alternate or research hypothesis denoted by Ha or
H1, and negation of this hypothesis called null hypothesis or H0. Experiments are done
to establish H1 or show evidence to refute H0. These are called simple hypotheses.
Ha: There are significant relationships between the activities of the departments of
commerce, planning, design, procurement and production in manufacturing with the
time to delivery and expenditure in a production process and E-collaboration can
reduce them significantly.
H0: Those relationships are insignificant, and E-collaboration cannot reduce them
significantly.
or
Ha: A new composite of ... and ... material/metal can increase the compressive
strength of ... structure significantly.
H0: The new composite of ... and ... material/metal cannot increase the compressive
strength of ... structure.
or
Ha: An automated production system can increase the production rate of a car door
items significantly.
H0: An automated production system cannot increase the production rate of a car
door items significantly.
Guidelines for Hypothesis Testing (decision procedure): Three guidelines:
H0 pinpoints a specific numerical value that could be the actual value (status
quo) of a population parameter. This is called null value and may be denoted
by 0 (for population) or x (for sample). Statement of equality will always be
included in H0.
Detection or support centered is around the alternate hypothesis.
As H1 is the research hypothesis, evidence should lead to reject H0 and thus
accept H1.
Ex. Many factors affect the productivity improvement performance in manufacturing.
One is the proper ratio of the mix of raw materials. It is thought that over 50% of the
outputs have disproportional mix of raw materials. It is now required to support this
contention statistically. As it is wished to support the statement that p > 0.5, this
contention is taken as alternative or research hypothesis.
2

Lets state the hypotheses and explain their meanings.


H0: p 0.5 (indicates equality, 0.5 is the possible value for p, i.e., null value is p0 (or
0 ) = 0.5)
H1: p > 0.5 (If null hypothesis is rejected, new mixture is necessary)
What you need to determine, know or set in hypothesis testing Sampling (random) and sample size, n
Type of data (sampling) distribution normal, t-, F-, binomial, etc.
Approximation of normal distribution to binomial, and so on
Parameter(s) population mean, standard deviation, etc.
Sample mean, sample variance
Type I and Type II errors
Degree of freedom
Now, Hypothesis Testing (means decision procedure)
Of course, a decision must be made from a sample. It is about the rejection of H0 or
fail to do so. Decision is made under the assumption that the null value is the true
value of population parameter, . The observed value/statistic is called test statistic.
Say, test statistic assumes a value that is rarely seen when 0 but tends towards
H1, then reject H0 and favor H1, otherwise cannot reject H0. So, after any study, you
are exactly in one of the following situations:
1. Shall have rejected H0 although it was true and shall have committed an error,
called Type I error (producers risk).
2. Shall have made the correct decision of rejecting H0 when the H1 was true.
3. Shall have failed to reject H0 when H1 was true. Shall have committed another
error, known as Type II error (consumers risk).
4. Shall have made a correct decision of failing to reject H0 when H0 was true.
Truth
H0:

H1

H0 Decision
is less than equal to 0.5
H0: p 0.5 is not rejected
No error is made
is greater than 0.5
H0: p 0.5 is not rejected
error is made

H1 Decision
is less than equal to 0.5
H0: p 0.5 is rejected
error is made
is greater than 0.5
H0: p 0.5 is rejected
No error is made
3

Ex. Take the previous example and explain the physical meanings of hypotheses.
H0: p 0.5
H1: p > 0.5 (majority of the mix of raw materials are improper)
Meaning: if Type I error is made, H0 is rejected although H0 is true (current mixture is
ok). So one is making a mistake and concludes to have new mixtures, although this
not necessary. But if H0 cannot be rejected, although H0 not is true, committing Type
II error. That is, need a new mixture but cannot conclude in its favor.
Therefore, making a Type I or Type II error is always be a possibility. No way to
avoid this dilemma. How to get rid of it?
The question is the amount of probability associated with each of them. Design a
method for deciding whether or not to reject H0, so that a probability of making either
of them error is minimum.
That is,

Set error = P(rejecting H0 when H0 is true)


Set error = P(failing to reject H0 when H0 is false)

Two methods to decide:


1. Hypothesis testing: based on a given value
Set the size of the test or level of significance, i.e, probability of committing a Type I
error, . Find the value for rejection region based on sample statistic. This is called
critical value, or rejection value, or region for the test.
Ex. Test of hypothesis. Previous example. Now collect data, say a sample of size 20.
Check how many disproportion cases are there.
H0: p 0.5
H1: p > 0.5 (majority of the mix of raw materials are improper)
The random variable (RV) is binomially distributed (BD) so that
E[np] = np = 10.
For H0 has to be true (p = 0.5), for every 20 mixes, on an average, 10 have
disproportions. H0 will be rejected if this exceeds 10 (H1 will be true). So,

P[ x 14p 0.5] 1 P[ x 14p 0.5] 1 P[ x 13p 0.5] 1 0.9423 0.0577


4

Say we agree to reject H0 in favor of H1, if the observed value of the test statistic is 14
or larger. Then, split the possible values into two sets. One A = [1, 2,.,12, 13] and
other A = [14, 15, 16, ., 20]. If any particular value falls in set A, reject H0,
otherwise accept it. If the test is so designed, the probability of committing Type I
error is thus the value obtained for the expression, approximately 0.05 as desired.
This is applicable for other values of p that are < 0.5.
Now about Type II error, : It is possible that the observed value of the test statistic
does not fall into the rejection region, even though H0 is not true and should be
rejected. Type II error will occur at this stage. depends on alternative hypothesis a
value of alternative need to be specified by experimenter.
Ex. The critical region is A = [14, 15, 16, ., 20]. If the true proportion is 0.7, what
is the probability that the test is unable to detect this situation? That is, find for
which H0 will not be rejected.
P[Type II error] P[ fail to reject H 0 p 0.7]

P[ x 13p 0.7] 0.3920


If p = 0.8, find and compare.
Ex. H0: p 0.5 and H1: p > 0.5. n = 20, Test statistic x = no. of successes in 20 trials.
Critical region is C = [9, 10, 11, , 20]
Null value is 0.1. If p = p0, BD E(x) = np0 = 20 (0.1) = 2.
P[Type I error] P[ x 9p 0.1] 1 P[ x 9p 0.1]

1 P[ x 8p 0.1] 1 0.999 0.0001


Power of the test:
By putting the value of p = 0.7, there is a 39.2% chance that we will not be able to
detect the fact. Thus the power of the test to detect the alternative value of p is (1
0.392) = 0.608.
2. Significance testing:
Another method for deciding whether or not to reject a null hypothesis. Getting wide
acceptance due to its reasonable appeal and use of computer. Here, do not use a
preset value or rigid critical region. Then, what?
Evaluate the test statistic (x) and determine the probability of observing a value of test
statistic. This probability has several names, critical level, descriptive level of
significance, or P-value of the test. P-value (0 to 1) indicates the strength of
disagreement between a sample and H0.Therefore it measures the plausibility of H0.
5

Ex. Emission of a toxic gas of 100 mg/s by an engine. A new design (engine) is
proposed that would reduce the emission level. A sample of 50 values from a
modified engine is taken and tested. Sample mean (x-bar) is 92 mg/s and STD is 21
mg/s. Can we produce the new engine based on these sample statistics? Population
mean may be more or less than 92 mg/s.
Smaller P-value indicates that the analyst is more certain not to accept H0
(rejecting null hypothesis).
Larger the P-value, more plausible is H0 true, but not certain that is true.
A thumb rule is that, reject H0 whenever P 0.05 (not scientific). If
P 0.05,
the result is statistically significant at the 5% level. If P 0.01, the result is
statistically significant at the 1% level, and so on.
A test is statistically significant at a significance level if P-value is . Test result
is significant at the 100% level (null hypothesis is rejected). So, reporting on test is
on the P-value, rather than comparing it to 5% or 1%.
P-value is not the probability that H0 is true. In other words, do not accept or reject
H0 directly. H0 is in between (P ). Leave it for a subjective judgment/probability.
General criteria:
Strong evidence against H0: P < 0.01
Moderately strong evidence against H0: 0.01< P < 0.05
Relatively weak evidence against H0: P < 0.10
P value for lower-tailed test is defined as
P-value = (sample mean would be as small as the observed value if H0 is true)
Software solutions find the observed P -value and the judgment is up to the user.

Suggestions for computing P-value


In case of data distribution is normal N ( , 2 ) where 2 is known, and the test is
based on the value of statistic z 0

X and observed value of z is z, then


0
0

/ n

( z ).............lower tailed test

P-value = 1 ( z )........upper tailed test


21 ( z )...two tailed test

Ex. Use the data from the previous exercise.

H 0 : 100
H1 : 100

n = 50,

X 0 92 100 8

X
0

/ n

8
21 / 50

Testing a simple hypothesis against a one-sided alternative


Suppose it is claimed by a manufacturer that the new process produces a product of
mean compressive strength 0 . Otherwise it is defective. Now take a sample
(size n) and find the mean value. Is it enough to accept the claim? Assume that
strength is ND and its STD is known from past experience. So

H0: 0 (no change - 0 is the null value). As sigma is known, H0 is a simple


hypothesis.
H1: 0 (new process is better. As H1 does not specify the distribution
completely, it is a composite hypothesis).
So, choose a number (cutoff value or criterion) say c > 0. Do not reject H0 if
Reject H0 if x c (defines rejection region of the test).

xc
x c

H0 is true: strength from new


process does not exceed the past
value ( = 0)
Correct decision
Type I error

x c.

H1 is true: strength from new


process exceeds the past value
( > 0)
Correct decision
Type II error

Performance of the test (how well the test controls errors) can be analyzed by:
Computing Type-I error and Type-II error probabilities, and
Finding the power function of the test.
x 0 c 0
c 0
P[ x c 0 ] P

1
/ n / n
/ n

(Type I error probability)

x
c
c
P[ x c ] P


/ n / n
/ n

(Type II error probability)

The power function of the test is


c
1
/ n

Power function, ( ) P[reject H0I] = P[ x c ] 1


So, P[accept H0I] + P[reject H0I] = 1.0

The probabilities of four possible correct and incorrect decisions can be listed as
H0 is true: ( = 0)
H1 is true: ( > 0)

xc

x c

c 0

/ n
c 0
1

/ n

/ n
c
1

/ n

Do Type I and Type II errors carry the same significance?


No. people are more concerned with the possibility of rejecting H0. So, more attention
is required to control Type I error than Type II error.
Ex. A manufacturer is more concerned about the type I error (mistakenly using
nonconforming product than mistakenly rejecting conforming one).
Therefore, refers the significance level of the test or size of rejection region. The
size of the rejection region is,

P[ x cH 0 .is.true] P[ x c 0 ] .

Choose the value of c so that type I error is less than (typically 10%, 5% or 1%).

c 0
c 0 z ( )
and (0 )
and 1

/
n
n

Computing the Type II error probability at an alternative > 0


c
( ) P[ x c 0 ]
, and ( ) 1
/ n
So for constructing a level test of H0: = 0 against H1: > 0
Choose c from c 0 z ( )

n
Probability of type I error is P[ x 0 z ( )( / n )0 ]
Now accept H0 if x 0 z ( )( / n ) , or
Reject H0 if x 0 z ( )( / n )

(upper one-sided test)

How to draw an OC Curve and find the sample size?


Say the sample size is n, standard deviation and Type I error - are known. Find
the value of c from c 0 z ( )

. Consider a set of values of and find

n
c
( ) P[ x c 0 ]
z
/ n

For a known levels of Type I and Type II errors c values are c( ) 0 z ( )


and c( ) z ( )

, and the sample size n

2 z ( ) z ( ) 2
0 2

Equivalent test based on a standardized test statistic


One-sided test in this form is
X 0
Z0
/ n
Null distribution of test statistic Z0 is a std normal variable as
X 0
Z0
N (0,1)
when = 0
/ n
Locate the critical value from table. Rejection region is at {Z0 > z()}. That is,
X 0 z ( )( / n )

X 0
z ( ).
Z 0
/ n

So equivalent test of hypothesis H0: = 0 against H1: > 0

X 0
z ( ) , or
/ n
X 0
z ( )
Reject H0 if
/ n
Accept H0 if

Ex. Suppose the current mean is unknown but STD is 6. Standard mean is 50. From
the sample data (n = 9), mean is calculated and found to be 52.5.
Construct the hypothesis. Locate the acceptance and rejection regions for 5% .
Make a decision on H0 hypothesis. Find the type II error function and draw it.
Determine the power function and draw its curve. Do the equivalent test.
9

For lower one-sided test


For constructing a level test of H0: = 0 against H1: < 0
In this case X-bar < c, where
Now accept H0 if X 0 z( )( / n ) , or
Reject H0 if X 0 z( )( / n )
c

/ n

1 P[ X c 0 ] 1

/ n

Power function ( )

Ex. Population data is ND. n = 6, 0 = 0.2, = 0.04. 5% level test. H0:


H1:
Conduct a lower-sided test and make a decision at x 0.17 Draw
power function and find (0.17)
z( ) z(0.05) 1.645 .

c 0 z

0.2 1.645 * 0.04 / 6 0.1731 . x 0.17 < 0.1731, reject H0

0.1731
. Draw power vs value curve.
0.04 / 6

Now, ( )

Equivalent test for lower-sided case


Test statistic Z 0
Accept H0 only if accept H0 if

X 0
.
/ n

X 0
X 0
z ( ) , or reject H0 if
z ( )
/ n
/ n

Two-sided (two-tailed) Test of Hypothesis


So for constructing a level test of H0: = 0 against H1: 0
Rejection region is I x I > c. The type I error is
X 0

P[ X 0 cH 0 .is.true] P[ X 0 c 0 ] P

/ n

H0

H1

c n

Rejection region
x 0
z ( )
/ n

10

Accept H0 if X 0 z / 2

x 0
z ( )
/ n
x 0
z ( / 2)
/ n

and reject it if X 0 z / 2

Or, Equivalently Accept H0 if z0

X 0

z / 2 and reject it if z0

n
c
c
Power function, ( ) 1 0
0

/ n
/ n

X 0

z / 2

Ex. 0 = 0.15 and = 0.005. Population is ND. A sample of size 10 derived


X 0.152 . Conduct a 5% test and identify the rejection region. Draw the power
function.

Power function and sample size


To check if type II error can be reduced. That is about increasing the power of the test
by increasing the sample size at a given . That is,
Power of the test, (0 ) and ( ) 1
2

So,

n
[ z ( ) z ( )] (for lower or upper tailed tests)
0

[ z ( / 2) z ( )]
For two-sided tests n
0

OC Curve
f (n, , ) . For a fixed value of , there can be a set of curves, are usually called

operating characteristic (OC) curves. The vertical and horizontal axes are respectively
and d /

true mean 1 hypothesiz ed

mean 0

OC curves are useful in determining the sample size (n) required to detect a specified
difference with a particular probability.
11

Ex. You wish determine the sample size that will be necessary to have a 0.90
probability of rejecting H 0 : 16.0 , if the true mean is 16.05 at a known/specified
level. The population STD 0.1 .

Large samples tests concerning the mean of an arbitrary distribution


Distribution is not known to be ND. But sample size is large enough (n 30) to
apply CLT. Mean and variance are unknown. So standardized variable Z

X
has
/ n

a approximate standard ND - N(0,1). Estimate sigma by s for such (large) n size.


Approximate level one-sided and two-sided tests displayed based on standardized
test statistic Z

X 0
s/ n

Alternate hypothesis
0

0
0

Rejection region

x 0
z ( )
s/ n
x 0
z ( )
s/ n
x 0
z ( / 2)
s/ n
12

Ex. n = 30, x-bar = 0.09, s = 0.03. If 0.10 , it cannot be used without additional
processing.
Now, H0: 0.10 and H1: 0.10
z observed

X 0

0.09 0.1

1.83
0.03 / 30
Appro P-value = P(Z zobserved ) P(Z 1.83) 0.03 . Since P = 0.03 < 0.05, the
s/ n

observed value is significant at 5% level. Reject H0

Tests concerning the mean of a distribution with unknown variance


Variance is unknown and sample size is small (<30). So, cannot apply CLT.
Approximate the distribution as ND. Apply t-distribution formula and set type I error.

X
s/ n
X c
c

P[ X c ] P

P tn 1
s / n

s/ n s/ n
tn 1

Control type I error using t-distribution instead of ND.


Reject H0 if t0 t / 2, n 1 , the upper percentage
One-sided unknown variance
For constructing a level test for lower-tailed H0: = 0 against H1: < 0, use
standardized t-statistic
t0

X 0
s/ n

X 0
t , n 1
s/ n
X 0

t Pt n 1, t
Power function is ( ) P
s/ n

Rejection region is

tn-1, has a noncentral t-distribution with noncentrality parameter

0
/ n

In case upper one sided, that is, H0: = 0 against H1: > 0. Reject H0 if
X 0
t , n 1
s/ n

Ex. say n = 6, x-bar = 0.17 and sigma is unknown but s = 0.04. Test hypothesis for a
value of 0 0.2 at 5% significance level
13

Two-sided tests: variance unknown


Can use confidence interval for the mean of assumed normal distribution using tstatistic discussed under interval estimation. See if the value is falling within the
range (CIs) in support of acceptance of H0.
X 0
Based on standardized t-statistic tn 1
, rejection regions are:
s/ n
Alternate hypothesis
Rejection region
x 0
0
tn 1, ( )
s/ n
x 0
0
tn 1, ( )
s/ n
x 0
0
tn 1, ( / 2)
s/ n
Test of Hypothesis for differences of means 1 2: variances are known
X1 and X2 RVs for a characteristic in two different populations. Are their means equal
or significantly different? Assume X1 and X2 are ND with the variances

12

and 2 .

H0: 1 2 0
and various possibilities of alternate hypothesis
H1:

X1 and X2 are ND as
ND RV as

0 That is 1 2 0 or H1: 0
N (1 , 1 ) and N ( 2 , 2 ) , 1 2 is also a

say H1:

N ( 1 2 , 12 22 )

. The distribution of

( x1 x2 ) is

12 22

N 1 2 ,

n
n2
1

Test statistic z

( x1 x 2 )

12
n1

22

N (0,1)

n2

14

Decision rules: H0:


H1

1 2 0
Rejection region

( x1 x2 ) c 1 2 z ( )

( x1 x2 ) c 1 2 z ( )

( x1 x2 ) c1 1 2 z ( )

12

n1

12
n1

12

Or ( x1 x2 ) c2 1 2 z ( )

n1

22
n2

22

22

12
n1

n2

n2

22
n2

Ex. A new ball bearing is claimed to have lower frictional resistance under very
heavy loading conditions. 36 of the old pcs and 25 of new are placed for testing.
2
2
64 and new
xold 52 and; xnew 44 ; old
144 . At 10% significance level,

can you support the claim?


What about Type-II error? OC curve.

If sample sizes are equal


Then,

12
n1

2
1

22
n2

12 22

12 22
n 2
1 22

n1

. This implies

n2

22 z ( ) z ( )
0 new / alt 2

Ex.

15

Test of Hypothesis for differences of means 1 2: variances are unknown


For comparing two populations characteristic. Assume X1 and X2 RVs are ND with
the same/identical variance. Assumes two samples are independent. Pooled or
uncorrelated t-test can be used. Can do hypothesis test, significance test or find
confidence interval.
H0: 1 2 0
and various possibilities of alternate hypothesis
H1: 0

say H1: 0 That is

1 2 0

Now, take the samples (n1 and n2) from both populations. Sample means are

x1 and

x 2 , and variances are s12 and s22.

Random variable ( x1 x2 ) is normal with mean of ( 1 2 ) and variance


1
1
V ( x1 x2 ) 2
n1 n2

Standardized normal value for this RV is


z0

( x1 x 2 ) ( 1 2 )

Since

1
1

n1 n2

2 is unknown, this must be estimated from samples by a pooled sample

1
1
variance. The estimated standard error is s( X 1 X 2 ) s p
n1

Pooled estimator of

n2

2
2 is denoted by s p

2 s 2 p

(n1 1) s 21 (n2 1) s 2 2
.
(n1 n2 2)

So ND can be replaced by t-distribution with df = (n1 1) + (n2 1) = (n1 + n2 2).


(n1 1) and (n2 1) degrees of freedom come from first and second samples
respectively.
Pooled t-test statistic is, t / 2,n n 2
1

( x1 x 2 ) ( 1 2 )

sp

1
1

n1 n2

. This t-distribution is called

reference distribution.

Commonly encountered hypothesized value is zero.


16

Alternate hypothesis

Rejection region

1 2 0 (2-tailed test)
1 2 0 (+ve) (Right

t0 t / 2, n1 n2 2 or t 0 t / 2,n1 n2 2

tailed test)

t 0 t / 2,n1 n2 2

t0 t / 2,n1n2 2

1 2 0

Ex. A study of the tensile strength of ductile iron annealed or strengthened at two
different temperatures is conducted. It is thought that the lower temperature will yield
the higher mean tensile strength. The data are
1450F

1650F

n1 10
x1 18,900

n2 16
x2 17,500

s 21 1600

s 2 2 2500

Comparing means of 2 populations if Variances are unknown & unequal


Pooling is inappropriate, if differences are detected when the population variances are
compared. But it is possible to compare means using an approximate t-statistic.
Desired statistic is found by modifying z random variable.
z

( x1 x 2 ) ( 1 2 )

12
n1

22
n2

Each variance is estimated separately and these are not combined. So, the unequal
variance test statistic
t

( x1 x 2 ) ( 1 2 )
s12 s 22

n1 n2

This time, the number of df must be estimated from data. One method for doing so is
the Smith-Satterthwaite procedure.
df

2
1

s12 s 22

n1 n2

/ n1
s2 / n
2 2
n1 1
n2 1

. If not an integer, round down (not up) to take a conservative

approach. If df related to t RVs increases, bell-shaped curve become more compact.

17

Ex. Two types of materials are used in producing a product. It is intended to compare
the strength of one to the other. Is material 1, on the average, better able to withstand
a heavy load than material 2? Data are:
1450F
1650F

n1 25
x1 380lb
s 21 100

n2 16
x2 370lb
s 2 2 400

Write down the hypotheses and conduct the test to answer the question.
Tests comparing two means when data are paired
The random variables (X and Y) are not independent. Each observation in one sample
is naturally or by design is paired with an observation in the other.
Population I: problem
solved using old method

Population II: problem


solved using new method

Sample of size n1

Paired sample size n

x1, x2, , xn
x=?

y1, y2, , yn
y=?
x - y = ?

New RV, D = X Y of sample size n.


x - y = E[X) E[Y] = E[D] = D
H0: X = Y, i.e., D = 0 and H1: X > Y. The t-test statistic is

D 0
sd / n

, where df = n -

1.
Ex. Data
program

10

Old (x)

8.05

24.74

28.33

8.45

9.19

25.20

14.05

20.33

4.82

8.54

New (y)

0.71

0.74

0.74

0.77

0.80

0.83

0.82

0.77

0.71

0.72

difference, d

18

Tests of Hypotheses Concerning variances of 2 NDs: F distribution


Comparison of variability is necessary in process engineering (chemical processes,
health care processes, manufacturing processes). Procedures for tests on variances are
sensitive to normality assumptions.
It is based on chi-square distribution n 12

(n 1) s 2

(reference distribution)

Let s12 and s22 are sample variances corresponding to random samples sizes n1 and n2
taken from two independent NDs
2
[(n1 1) s1 ] / 12
2
n2 1 /( n1 1)
(n1 1)
s1 / 12
F0 2

n 1 /( n2 1) [(n2 1) s 2 2 ] / 22 s 2 2 / 22
(n2 1)
1

If 2 , then, the test statistic as the ratio of sample variances, F0


2
1

s1

s2

Sample variances are the unbiased estimator of population variances.


Null hypothesis H0: 12 2 2
Alternate hypothesis H1: 12 2 2
Rejection region F0 F / 2, n1 1, n2 1 Upper / 2 % or lower
(1 / 2)%

F0 F(1 / 2), n1 1, n2 1 . numerator df = (n1-1) , denominator df = (n2-1)

Indeed, F / 2, n1 1, n2 1
Accept H0 if F / 2, n 1, n
1

2 1

1
F(1 / 2), n1 1, n2 1
F0

1
F / 2), n1 1, n2 1

For one-sided alternative hypothesis


Alternate hypothesis
Rejection region
12 2 2

12 2 2

F0

F0

s2

s1

s1

s2

F ,n2 1,n11 1

F , n1 1, n2 1

Critical values are obtained from the table


Ex. A chemical process. Test is on inherent variability of two processes.
n1= 12; n2= 10, s12 14.5 , s22 10.8
19

Test concerning population proportion p of a BD


Applied in quality control, hospital management, politics, social sciences, etc.
Lets data come from a random sample of size n taken from a Bernoulli distribution.
If n is large ( np 5 if p 0.5 or n(1 p) 5 if p 0.5 ), the normal distribution forms
an excellent approximation of BD. So, ND can be used in testing the hypotheses.

x
(x is number of success events), p is not near zero(0) or 1, and p-hat is appro.
n

normal with p p and p p(1 p) / n . Then testing a single proportion is similar to


testing a single mean.
Null hypothesis H0: p = p0
Alternate hypothesis H1: p > p0. Set a cutoff point, say x > c. Then accept H0 if x c
or reject H0 if x > c.
The power of the test is ( p) P[ X cIp] b( x; n, p) 1 B(c; n, p)
c 1 x n

Test of a population proportion (large sample)


H 0 : p p0 against H1 : p p0 . In this case ND can be approximated. p

x
n

p p0 provides an evidence in favor of H1. So for rejection region is

p p0 z ( )

p0 (1 p0 )
n

z0

p p0
p0 (1 p0 )
n

Accept H0 if z0 z ( ) and reject it z0 z ( )

c p
c p
and ( p)

p(1 p) / n
p
(
1

p
)
/
n

( p) 1

Alternate hypothesis

Rejection region

p p0

p p0
z ( )
p0 1 p0

p p0

p p0
z ( )
p0 1 p0

p p0

p p0
z ( / 2)
p0 1 p0

Ex. 90% of the torsion springs will survive beyond the accepted max. standard of
performance. Investigation revealed that 168 springs in a sample of 200 exceed that
std. Is the claim valid at 5% sig level?
20

Test of a population proportion (small sample)


Say n = 10 and 5%
Rejection region is at x > c. So, ( p0 ) P( x cp0 ) 1 B(c; n, p0 )

Tests of hypothesis concerning two BDs (large sample)


Take two independent sample of sizes n1 and n2. sample proportions p1 & p 2 are
independent and approximately normally distributed. So p1 p 2 is also approximately
ND.
V p1 p 2

p1 (1 p1 ) p2 (1 p2 )

n1
n2

p1 (1 p1 ) p2 (1 p2 )

n1
n2

p1 p 2

s p1 p 2

p1 (1 p1 ) p 2 (1 p 2 )

n1
n2

You can use confidence interval approach to test hypothesis (L, U) at 100(1 )% ,
where
L p1 p 2 z ( / 2)

p1 (1 p1 ) p 2 (1 p 2 )

and
n1
n2

U p1 p 2 z ( / 2)

p1 (1 p1 ) p 2 (1 p 2 )

.
n1
n2

Reject H0 if zero is not in the CI = (L, U). otherwise accept H0.


In terms of test statistic
Z

( p1 p 2 ) ( p1 p2 )
where
p1 (1 p1 ) p2 (1 p2 )

n1
n2

p1

x1
n1

and

p 2

x2
n2

If H0 is true, then p1 = p2 = p and test statistic is


Z

( p1 p 2 )
1 1
p(1 p)
n1 n2

N (0.1)

Alternate hypothesis

x1 x2
n1 n2

Rejection region

p1 p2
p1 p2

p1 p 2
z ( )
s p1 p 2
p1 p 2
z ( )
s p1 p 2

p1 p2

p1 p 2
z ( / 2)
s p1 p 2

21

Design of Experiment for Factor Selection


One powerful technique for improving strength, quality, productivity,
production rate, etc. in manufacturing or other engineering fields. Through
experimentation, changes are introduced into a process or system
variables/inputs (also noise) to observe their effect on performance. It is the
most efficient statistical method used for optimizing changes. Other ways are
ad-hoc observation or heuristic approach (not so reliable).
Experimental Design
Engineering (product and/or process) design generally involves several
variables/factors. A test or series of tests called experiments are done to the
input variables/factors to observe and identify the reasons for changes in
output variable/response.
Experimental design is an active statistical method. Experimentation plays
important role in product realization activities, consist of product design &
development, process development, and process improvement, or designing
a robust/insensitive (affected minimally by external sources of variation)
process. It is a vital part of engineering/scientific method. Well-designed
experiments develop models that are called empirical models.
Experiments are done to study the performance of processes and systems.
Experiment is done by a person, called experimenter.
Variables/factors in a process or system can be broadly classified as
controllable and uncontrollable (limited control) factors, as depicted in
figure below. Process or system variables that can be manipulated by
experimenter

are

called

controllable

variables,

x1 , x2,.... ; observable

y1 , y2 and other variables are


variable(s) or output of interest is y or
uncontrollable/natural.
Knowing these variables, you can try to:
i. Optimize the choice of values of input variables in terms of maximizing
or minimizing output, y.
ii. Find those variables that have the very significant effect on y (then
control them for reducing variation in y).

22

Noise or uncontrollable factors


(unavoidable)
INPUTS
Raw materials,
components,
subassemblies, etc.

y = output

Process

Controllable factors
x1, x2, , xn
Figure: General model of a process or system
Now, Signal factors (usually denoted by y ) are set by designer or operator
to obtain the intended value of the response variable. Noise factors are not
controlled or are very expensive or difficult to control. Both of them are
controlled by a single measure known as signal-to-noise (S/N) ratio. Its
mathematical expression depends on the type of orientation.
Goals and Objectives of experimental analysis
Goals
Determine the input variables and their magnitude that influences the
response, y.
Determine the levels for these variables
Determine how to manipulate these variables to control the response
Objectives may be noted as:
Determine which factors are most influential on the response, y.
Locate where to set the influential xs so that y is near the nominal
requirement
Determine where to set the influential xs so that variability in y is small or
minimum
Locate where to set the influential xs so that effects of
uncontrollable/noise variables are minimum.
In manufacturing, experimental design techniques applied early in process
design and development can result in
23

1.
2.
3.
4.

improved yield
reduced variability and closer conformance to nominal or desired output
reduced development time
reduced overall costs

Guidelines for Designing Experiments


1.
2.
3.
4.
5.
6.
7.

Identification of the problem(s) and clear problem statements.


Scan and choice of factors and their levels
Deciding the output or response variable
Choice of experimental design
Performing the experiment
Data collection and analysis
Results, conclusions and recommendations (act on)

Steps 1 & 2 are pre-experimental planning. Steps 2 & 3 can be done


simultaneously, also in reverse order.
What you need to consider, know, determine or set
Factors classify them into signal and noise factors
Levels of each factor
Number of each factor in consideration
Output or response
Hypotheses and other relevant issues on testing
Signal to noise ratio
Confirmation test

Experimental Approaches
1. One can combine factors arbitrarily, then test them and see the results.
This strategy is called best-guess approach.
2. One-factor-at-a-time approach. Select a starting point (baseline set of
levels) for each factor, and then change the level of one factor keeping
all others at baseline levels. Do the test and report the response. But,
this method cannot locate interaction (of factors) effect.

24

Response

Response

Low
High
Factor x1

Low
High
Factor x2

Figure: One-factor-at-a-time approach (no interaction)


Table: Effects and data for one-factor-at-a-time experiment
TC
Factors/levels
Response
y
A
B
C
D

H
1
1
1
1
1

1
y0 (benchmark or
ref. value)
2
2
1
1
1

1
yA
3
1
2
1
1

1
yB
4
1
1
2
1

1
yC
5
1
1
1
2

1
yD
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
h
1
1
1
1
1
2
yh
TC = treatment conditions; 1 = base level, 2 = changed level

eA y A y0 (for changing from level 1 to level 2)


eB yB y0 , and so on. Which effect is strongest and what is the

Effects:

direction of any effect (+ve or ve)?


Ex. Researchers are studying the effects of tire pressure, gas type, oil type,
and vehicle speed on gas mileage. What combination of factors provide the
best gas mileage? Data are:
Factors
Tire pressure
Speed
Oil
Gas

Level 1
28 psi
55 mph
30 weight
Regular

Level 2
35 psi
65 mph
40 weight
premium

Response variable
Gas mileage

25

3. Factorial approach: The correct approach. Factors are varied together.


Simpler way is to set two levels of each factor. So for k factors, number
of experiments could be 2k. Experimenter can see the main effects
(effects of each factor separately) and interaction (a result that
generates when 2 or more factors are set together) effect. In case of
full factorial for 2 factors and two levels, there are four (22) possible
runs (TC).

How factorial experiment is conducted


Say, there are two factors and both are at two levels. The four runs (TCs)
form the corners of a square. It is a 22 factorial design. Can see main and
interaction effects.

Response

Factor x1

Low
High
Factor x1 and x2

Low
High
Factor x2

Interaction is noticed (left figure). Main effects (vertical and horizontal


arrows) and interaction effects (diagonal arrows) can be computed and
compared.
Figure: Factorial experiment approach
In case of three-factor factorial experiment, 23 = 8 test combinations of these
three factors each at two levels can be represented geometrically as the
corners of a cube. This is called 23 factorial design. If 4 factors, 2 with 2
levels and 2 with 3 levels, number of combinations = 22 x 32 = 36.
Table - Signs for effects and data for a Three-factor experimental design ( is for base level, + is for changed level)
TC

Factors and interactions and their levels


A
-

B
-

C
-

AB
+

AC
+

BC
+

ABC
-

Response
Y
y11, y12, y1n
26

2
3
4
5
6
7
8

+
+
+
+

+
+
+
+

+
+
+
+

y21, y22, y2n


y31, y32, y3n
y41, y42, y4n
y51, y52, y5n
y61, y62, y6n
y71, y72, y7n
y81, y82, y8n

Analysis of Variance for a single-factor experiment


Lets the number of treatments or different levels are a. Say the experiment is
replicated n times. Observed response from each of a treatments is a RV.
Number of observations can be different for each factor level. In the data
layout table below, there is only one level for each factor.
Treatment or Factor
level

1
2
.
.
.
a

Observations

Totals

Averages

yij

yi .

yi .

y1.
y 2.

y1.
y 2.

ya.

ya.

y..

y..

y11

y12

y21

y22

y1n
y2n

y a1

ya2

y an

Observation of an experiment can be written as:

yij i ij
Where yij = jth observation at the ith factor level (i = 1, 2, a, and
j = 1, 2, n). The overall mean is common to all treatments. The treatment
effect particular to a treatment (ith treatment) is i, and
component. The model can also be written as

ij

is the random error

i 1,2,..., a
yij i ij
j 1,2,..., n
27

Random errors are independently and normally distributed with mean zero
and variance 2. So, each treatment generates a normal population with
mean i and variance 2. Thus there are a population means. This is a
completely randomized experimental design and the respective analysis is a
fixed effects model ANOVA. ANOVA is applied to test for equality of
treatment effects. The total treatment effects (deviations from the common
mean )
a

i 1

Say yi. Is the total of observations under ith treatment


n

yi. yij

and its average is

j 1

yi. yi. / n

Grand total and grand mean of all observations are respectively


a

y.. yij

y.. y.. / N , where N = an (total no. of

and

i 1 j 1

observations) .
Hypotheses are

...

2
a and H1: i
H0: 1
for at least one i.
If the null hypothesis is true, changing the level of factor has no effect on the
mean response.
The total variability in data, i.e., total sum of squares, SST is

SST yij y.. y


a

i 1 j 1

ij

i 1 j 1

y 2 ..

ANOVA partitions of SST of sum of squares identity

SST yij y.. n yi. y.. yij yi.


a

i 1 j 1

i 1

i 1 j 1

That is, difference between observed treatments, or treatment sum of


squares

28

SS treatments n yi. y..

i 1

y 2 i. y 2 ..

N
i 1 n
a

and difference between observations within a treatment from the treatment


mean as random error or error sum of squares, SSE

SSE yij yi.


a

i 1 j 1

or
SSE = SST SStretaments
Now, we need to examine the expected values of SStreatments and SSE. This
will lead us to appropriate statistic for testing the null hypothesis.
The expected value of SStreatments
a

E{SS treatments} (a 1) n i

i 1

If null hypothesis is true, each

i is equal to zero and

E{SStreatments} (a 1) 2
If alternate hypothesis is ture,
a

SS

E treatments 2
a 1

n i

i 1

a 1

SStreatments
The ratio
a 1 is called mean square for treatments, MStreatments. If H0
is true, MStreatments is an unbiased estimator of 2. If H0 is not accepted,
MStreatments estimates 2 plus a +ve quantity that comes from variation due to
systematic difference in treatment means.
Similarly, expected value of SSE is

29

ESSE a(n 1) 2 and

its

MSE

SSE
SSE

is a pooled
N a a(n 1)

estimate of common variance within each of a treatments. It is an unbiased


estimator of 2 regardless of whether H0 is true or not.
SST is in normally distributed random variables and SST/2 is distributed as
chi-square with df = (an 1). SSE/2 and SStretaments/2 are also distributed as
chi-square with df = a(n-1) and df = a -1 respectively. W G Cochrans theorem
is used to establish independence of SSE and SStreatments. To be so, for Zl N(0,
1) where l = 1, 2, , and

Z
l 1

2
l

Q1 Q2 ... Qs , where s , and Qk has k df (k = 1, 2, , s). Then

Q1, Q2, and Qs are independent chi-square RVs with df of 1, 2, and s


respectively, if and only if = 1+ 2 + + s.
Then the test statistic or observed F value is

SStreatments
MStreatments
F0 a 1
SSE
MSE
a(n 1)
If null hypothesis is true, both numerator and denominator are equal and F0 =
1. If alternate hypothesis is true, certainly numerator is larger than
denominator. In other words, we should reject H0 for larger value of F0. This
also implies that test is appropriate at upper tail. So reject H0 is

F0 F , a 1, a ( n 1)
Now, you can prepare a summarization table on ANOVA fro a single-factor
experiment
Source of
variation
Treatments
Error
Total

SS

df

MS

SStreatments

(a 1)

MStreatments

SSE
SST

a(n - 1)

MSE

F0
F0

MS treatments
MSE

(an - 1)

Ex. Plasma etching experiment


In integrated circuit manufacturing, wafers are completely coated with a layer
of material such as silicon or a metal. The unwanted material is then
30

selectively removed by etching through a mask thereby creating circuit


patterns, electrical interconnects, and areas in which diffusions or metal
depositions are to be made. A plasma etching process is used. Energy is
supplied by a radio-frequency (RF) generator and causes plasma to be
generated in the tap between electrodes. An engineer is interested in
investigating the relationship between RF power setting and the etch rate for
this a single-wafer plasma etching tool. Engineer run a completely
randomized experiment with four levels of RF power and five replicates. The
data are given in table below:
Radio frequency
(RF) power (W)
160
180
.200
220

Observed etch rate

Totals

Averages

yi .

yi .

575
565
600
725

542
593
651
700

530
590
610
715

539
579
637
685

570
610
629
710

2756
2937
3127
3535

y.. =

y.. =

Conduct a test if the mean values in all 4 levels are equal against any mean
is unequal. Show the results in a F-distribution diagram.
For Two Factors
Observation of an experiment can be written as:

yijk Ai Bj ABij ijk

No. of levels for factor A = a


No. of levels for factor B = b
No. of observations in each run/replicate = n
Total no. of observations N = nab
Factor B
1
2
1
y111, y112,..., y11n
y121, y122,..., y12n
2
y221, y222,..., y22n
. y211, y212,..., y21n
Factor
.
.
A .
.
.
.
.
.

y1b1, y1b 2 ,..., y1bn


y2b1, y2b 2 ,..., y2bn
.
.
.
31

ya11, ya12,..., ya1n

ya 21, ya 22,..., ya 2 n
b

yab1, yab2 ,..., yabn


a

y Ai . yij

y Bi . yij

and

j 1

j 1

i 1

j 1

y.. y Ai y Bj
y 2..
SST yijk
N
i 1 j 1 k 1
a

SST = SSA + SSB + SSAB + SSE


b
a
b
y 2 Bj y 2..
y 2 ij y 2..
y Ai y ..

SS B SS B
SS A

and SS B
and SS AB
N
n
N
N
j 1 an
i 1 j 1
i 1 bn
Then SSE = SST - SSA - SSB - SSAB
df A a 1 , df B b 1 , df AB (a 1)(b 1) , df E N ab
2

MS A

MSE

SS AB
SS A
SS B
; MS B
; MS AB
df AB
a 1
b 1

SSE
N ab
ANOVA table for two-factor experiment

Source
of
variation
A

SS

MS

SS B

AB

SS AB

Error

SSE
SST

F0

F critical Significant
(from
(yes/no)
table)

MS A
MSE
MS B
F0
MSE
MS AB
F0
MSE

SS A

Total

df

F0

N-ab

MSE

(abn - 1)

Orthogonal Arrays (OA)


32

Orthogonal means that experimental design is balanced. It is an efficient


method.
Orthogonal arrays were developed by RA Fisher (England) 1930s, Taguchi
added three OAs to the list in 1956, and then others added another three. Its
a simplified method of putting together an experiment. It can handle dummy
factors and can be modified.
Table : OA8 (8 is for no. of rows or no. of treatment conditions (TC) and df.
Columns 1-7 are (max. no. of) factors; 1-2 factor levels, or -, and +. For more
levels use 3, 4, or 5)
TC
1
2
3
4
5
6
7
8

1
1
1
1
1
2
2
2
2

2
1
1
2
2
1
1
2
2

3
1
1
2
2
2
2
1
1

4
1
2
1
2
1
2
1
2

5
1
2
1
2
2
1
2
1

6
1
2
2
1
1
2
2
1

7
1
2
2
1
2
1
1
2

Orthogonal property of an OA is not compromised by changing the rows or


the columns. Taguchi changed the rows so that TC 1 composed of all level 1s
and could thereby represent the existing conditions. Also columns were
switched so that least amount of change occurs in columns on the left. This
arrangement can provide with capability to assign factors with long setup
times to those columns.
Procedure:
1. Define number of factors and their levels.
2. Determine degrees of freedom.
3. Select an orthogonal array.
4. Consider if there are any interactions.
Step 2: Set Degrees of freedom:
To determine the minimum number of treatment conditions. It is equal to the
sum of
No. of factors(no. of levels 1) +
No. of interactions(no. of levels 1) (no. of levels 1) + (1) one for the
average.
33

Ex. 2-level, 4-factor, A, B, C, and D, expected two interactions are BC and


CD. Determine df. Assume that interactions AB, AC, AD, BD, ABC, ABD,
ACD, BCD, and ABCD would not occur.
df = 4(2 1) + 2(2 1)(2 1) + 1 = 7
For 3-level
df = 4(3 1) + 2(3 1)(3 1) + 1 = 17
So, no. of levels has considerable influence on no. of TCs. But cost!
Max. degrees of freedom or TC, df = lf (l = no. of levels, f = no. of factors).
Ex. l = 2, f = 4, max. df = ?
Table: Max. df for 4-factor, 2-level experimental design
Design space
A
AB
ABC
ABCD
Average

B
AC
ABD

C
AD
ACD

D
BC
BCD

BD

CD

df
4
6
4
1

Sum

1
16

Step 3: Selecting OA
No. of TCs = no. of rows in an OA that must be df. OAs that are now
available, up to OA36 (see table).
OA

OA2
OA8
OA9
OA12
OA16
OA161
OA18
OA25

No.
of
rows

Max.
no. of
factors
can be
used

4
8
9
12
16
16
18
25

3
7
4
11
15
5
8
6

Max. no.
of
columns
available
For 2level
3
7
11
15
1
-

Max. no.
of
columns
available
For 3level
4
7
-

Max. no.
of
columns
available
For 4level
5
-

Max. no.
of
columns
available
For 5level
6
34

OA27
OA32
OA321
OA36
OA361
.
.
.

27
32
32
36
36
.
.
.

13
31
10
23
16
.
.
.

31
1
11
3
.
.
.

13
12
13
.
.
.

9
.
.
.

.
.
.

So, if df = 13, next available OA is OA16. See a geometric progression for the
2-level arrays of OA4, OA8, OA16, OA32, , which is 22, 23, 24, 25, and for
3-level OA, OA27, OA81, . Which is 32, 33, 34, OAs can be modified.

Interaction table
Try to avoid confounding (inability to distinguish among the effects of one
factor from another factor and/or interaction). Try to find which column is to
use for the factors via an interaction table.
Table: OA8 (reprint from the back)
TC
1
2
3
4
5
6
7
8

Column
1
2
3
4
5
6
7

1
1
1
1
1
2
2
2
2

2
1
1
2
2
1
1
2
2

1
(1)

3
1
1
2
2
2
2
1
1

4
1
2
1
2
1
2
1
2

5
1
2
1
2
2
1
2
1

Table: Interaction Table for OA8


2
3
4
5
3
2
5
4
(2)
1
6
7
(3)
7
8
(4)
1
(5)

6
1
2
2
1
1
2
2
1

7
1
2
2
1
2
1
1
2

6
7
4
5
2
3
(6)

7
6
5
4
3
2
1
(7)

35

Assume that factor A is assigned in column 1 and factor B in column 2. If


there is an interaction between these two factors, then column 3 is used for
the interaction AB. Say factor C is assigned in column 4. For an interaction
between factor A (column 1) and factor C (column 4), then interaction AC
will occur in column 5.
The columns that are reserved for interactions are used so that calculations
can be made to determine whether there is a strong interaction. If there is no
interaction, then all the columns can be used for factors. Repeat the actual
experiment that is conducted using columns designated for the factors, which
compose the design matrix. All the columns together are referred to as the
design space.
Control factors and noise factors
A team of the relevant people decides on these types of factors and their
levels. Noise factors are normally uncontrollable or expensive to control. They
should have also levels.
TC
1
2
3
4
5
6
7
8
9
1
2
3
4

A
1
1
1
1
2
2
2
3
3
3
K
1
1
2
2

B
2
1
2
3
1
2
3
1
2
3
L
1
2
1
2

C
3
1
2
3
2
3
2
3
1
2
M
1
2
2
1

D
4
1
2
3
3
1
2
2
3
1
(y)
y1
y2
y3
y4

M
1
2
2
1

(y)
y33
y34
y35
y36

.
.
.
1
2
3
4

K
1
1
2
2

L
1
2
1
2

36

Nine treatments conditions are used for controllable factors (OA9) and an
OA4 is used for noise factors. Four runs are made under TC 1 (all level 1s)
one each of the noise TCs. The results are y1, y2, y3 and y4. Repeat for all 9
TCs.
Ex. A design team identified high nonconformity and average scrap cost
exceeded much beyond the expectation in case of an air filter injection. They
sorted out the factors involved related to materials, machines, mould
(method), people, conditions, and environment and prepared a cause-andeffect diagram and correlation table (below) between defect cause and
phenomenon.
Cause/phenomenon Felt
Mat.
Filter
Broken Glue
Nonflange shortage paper
paper penetration fabric
explosion
fiber
explosion
Folding paper
X
Different altitude
for cutting
X
Incomplete cutoff in cut line
X
Different altitude
for folding paper
X
Different degree
of
color
of
background
Mould
X
X
X
X
Precision-big
gear crevice
X
X
X
X
Uncontrolled
mould
temperature
Precision
of injection m/c
X
X
X
X
Inaccurate
amount of glue
Injection condition
X
X
Pressure, speed,
temperature
37

A total of 11 control factors were listed. There levels are determined.


Factor/level
Level Level Level
1
2
3
A. Injection time
10
18
B. Nozzle temperature
195
200
215
C. Tube temperature
190
205
215
D. Cooling time
15
20
25
E. Injection pressure
59
64
69
F. Injection speed
60
70
80
G. Charge
1040 1070 1100
H. Mould temperature
30
40
50
I. Charge speed
80
90
J. Charge pressure
70
80
90
K. Suck-back speed
30
40
50
For each experiment, 8 sets of control factors are considered; one set of level
2 and 7 sets of level 3.
The average defect rate in the manufacturing process for air cleaners from
Jan 2008 to Sept 2009 was 20184 ppm. From Oct 2009 to Sept 2010, it
decreased to 4017 ppm after implementation of the recommendations of the
Taguchi method. How the method was applied?
Taguchi emphasized to look for strong few interactions and neglect the trivial
ones (not to concentrate on all). An orthogonal array of OA9 (nine rows or
TCs) with levels 1 (low), 2 (medium) and 3 (high) are given below:
OA9 (for controllable variables)
Run A B C D
1
1 1 1 1
2
1 2 2 2
3
1 3 3 3
4
2 1 2 3
5
2 2 3 1
6
2 3 1 2
7
3 1 3 2
8
3 2 1 3
9
3 3 2 1
OA8 (3 noise variables and 4 interactions, each has 2 levels)
Run E F EF G EG FG EFG
38

1
2
3
4
5
6
7
8

1
1
1
1
2
2
2
2

1
1
2
2
1
1
2
2

1
1
2
2
2
2
1
1

1
2
1
2
1
2
1
2

1
2
1
2
2
1
2
1

1
2
2
1
1
2
2
1

1
2
2
1
2
1
1
2

Controllable and uncontrollable (noise) variables should be identified while


designing an experiment. Controllable variables can form an inner array in the
experiment (in this case 3 levels, low, med, and high) and noise variables can
form an outer array and noise variables form outer array (in this case two
levels, low and high).
Outer array OA8 for noise factors
E 1 1 1 1 2 2 2 2
F 1 1 2 2 1 1 2 2
G 1 2 1 2 1 2 1 2
Inner array, OA9 for control factors
Run A B C D Values of response variable, y

1
2
3
4
5
6
7
8
9

1
1
1
2
2
2
3
3
3

1
2
3
1
2
3
1
2
3

1
2
3
2
3
1
3
1
2

1
2
3
3
1
2
2
3
1

76
81
78
72
69
56
72
87
92

67
52
68
74
81
62
71
90
82

87
69
72
51
66
93
71
56
49

78
53
61
77
75
83
90
75
61

62
57
54
67
71
70
86
48
81

65
76
81
89
82
69
71
77
84

Now, find inverse of y2 for each y value.


1
Run A B C D
values of response
y

73
68
69
88
81
63
67
76
82

Rowwise yaverage
71.875
64.75
69.75
75.375
77.125
74.125
76.625
72.75
74

67
62
75
85
92
97
85
73
61

S/NL

variable y
1
2
3

1 1 1
1 2 2
1 3 3

1
2
3

36.9907
35.92
36.66
39

4
5
6
7
8
9

2
2
2
3
3
3

1
2
3
1
2
3

2
3
1
3
1
2

3
1
2
2
3
1

37.14
37.61
36.96
37.55
36.69
36.82

1 n 1

n i 1 y 2 i

1 n
In case of smaller-is-better: S / N S 10 log y 2 i
n i1

In case of larger-is-better: S / N L 10 log

Lets we prefer larger-is-better.


You can Microsoft Excel to complete the tables.
A
B
Average
Low
68.792 74.625
y
Medium
High
SNL
Low
Medium
High
Now,

Low
High

Low
70.17

C
72.917

D
74.333

A
Med High
71.58 74.67

Interaction effect between A & E


Average
response
for
variable
A
Low,
Med,
High

Low

High

Similarly, we can draw such diagrams for other variables. Finally we draw:
40

Average
response
y for all
control
variables
separately
(connect
between
the dots by
straight
lines)

Low

Med
Levels

High

Low

Med
Levels

High

And
Average
S/NL for
all control
variables
separately
(connect
between
the dots by
straight
lines)

And, see both the last charts and for maximizing the response variable y, the
optimal values of the control variables are:
Controllable
variable
A
B
C
D

Level

You can do the confirmation test afterward and validate your experimental
results.

41

REGRESSION ANALYSIS
A methodology to develop and utilize relation between two or more quantitative
variables to predict one variable from other or others. It has a wide range of use.
Difference between correlation and regression
Correlation shows direction and strength of association between two random
variables (they are perfectly normally distributed). Here, no need to consider which
variable is independent or independent. So, it does not say how change of a unit of X
or Y change another.
Linear regression is about dependency of one variable (Y) on one or more other
variables (Xs). It expresses the change in response due to So regression analysis
serves model building purpose.
Model representation of the world (but not really perfectly). Model is not really
duplication.
Why? Error!
Data = model + error/residual. If data = model (perfect/duplication)
Statistical relation
No perfect relation. Say, year end evaluation of employees performance is a
dependent or response variable, Y. Mid-year evaluations as independent, explanatory,
regressor or predictor variable, X. X is a conditional RV. So

yx . Its estimated

mean is yx f (x) , x is known in advance. Any value of X can be used, within


a known range, to predict Y, where x values are real numbers and each Y is a RV.

{( x1, Y1 ), ( x2 , Y2 )...( xn , Yn )} in n pairs of data. xi are observed values of X.


You can draw scatter diagram/plot and see the trend. You can probably assume a
line/curve but definitely cannot obtain perfect relation.
Each point in this plot is called a trial or a case.

Construction of Regression Models


Selection of predictor variable(s)
Functional form of regression relation
42

Scope of model to some interval or range of x or xs.


Simple Regression Model with distribution of unspecified error terms
By simple, we mean only one predictor variable and regression function is linear (in
the parameter of X). Its model is

Yi 0 1 xi i

where i represents ith trial (i = 1,2 3, , n), betas are regression parameters/

1 is the slope of the regression line along Y-axis for one


additional unit of value in X-axis, and 0 is the intercept of regression line (you can
draw a figure and explain), i is random error with mean E{ i } 0 and variance
2{ i } 2 . i and j are uncorrelated so that their covariance is zero, (i.e.,
i , j 0 for all i, j, i j ).
regression coefficients,

This model is also called the first-order model.


Its main features are
1. Yi is a random variable as it is the sum of two components constant term and
random term.
2. E{Yi } E{0 1 xi i } 0 1 xi E{ i } 0 1 xi
So, the regression function for the above model is E{Y } 0 1 x
3. Response function of ith trial exceeds or falls short of the value of regression
function by the error term
4.

i .

is assumed to be normally distributed and have zero mean and constant

2
2
variance { i } . So, response Yi is also normally distributed and has the
2
2
same constant variance: {Yi } . That is

2 Yi 2{0 1 xi i } 2{ i } 2 . So, i N (0, 2 )

That is, probability distributions of Y have the same variance , regardless of the
level of X.
5. Error terms are uncorrelated. Outcome in any one trial has no effect on the
error term for any other trial. So are the response terms, say Yi and Yj.
6. Summarily, responses Yi come from probability distributions whose mean
2

E{Yi } 0 1 xi and variance is 2 , i.e. Yi N (0 1 xi , 2 ) . Random

variables Yi are independently and normally distributed.


43

Ex. Suppose that the following regression model is appropriate in certain case

Yi 9.5 2.1xi i

E{Y } 9.5 2.1x


If xi = 45 in a given time, then E{Yi} = 104
If error term is 4, then Yi = 104 + 4 = 108

So, i Yi E{Yi } for a single value of x.


Probability distribution of Yi can be shown.
Alternative versions of Regression Model
Equivalent form of the simple linear regression model is

Yi 0 x0 1xi i , where x0 0 (dummy variable)

Another alternative version


Yi 0 1 ( xi x ) 1 x i ( 0 1 x ) 1 ( xi x ) i

0 1 ( xi x ) i
Data for Regression Analysis
Observational data non-experimental/preselected or uncontrolled data. Need
adequate care to draw cause-and-effect relationship.
Experimental data controlled experiment. Productivity vs length of training.
Take several workers; train them, and record performance for several weeks.
Length of training is called a treatment. Completely Randomized data every
experimental unit has an equal chance to receive any one of the treatments.
Overview of steps in Regression Analysis

44

Start

Exploratory data
analysis

Identify most
suitable model

Stop

Develop one or more tentative


regression models

Yes

Make inferences
on the basis of
regression model

Is one or more
of the regression
models suitable
for data at hand?

No
Revise and develop
new models

Least Square Method


Step 1: Construct a scatter plot ( xi , Yi ), i 1,2,...n to fit a straight line to set a
bivariate data. See if the points appear to approximate a straight line. Otherwise,
consider an alternative model for the data.

Step 2. Does the predicted straight line pass through the origin Yi 1 xi ? Or, it
fits equation like Yi 0 1 xi . Then, what are the good estimated values of 0
and 1 to have a best-fit line. To solve this problem, the method used is named
as least square method.
Line fitting using the Least Square Method
Suppose, the line is the one that best fits the data Yi 0 1 xi . The ith
fitted/predicted value is defined by Yi 0 1 xi .
The ith random error or residual is the difference between ith observed value and
the ith predicted value, which is

i Yi (0 1xi )

Random errors are of positive and negative values (see figure).

45

Y
values

X values
The least square criterion,
2
2
SSE Q ei (Yi Y ) 2 Yi ( 0 1 xi ) . It should be as small as
1i n

1i n

2
possible. Distribution of e is ei N (0, )

For regression model, estimators should have values that minimize Q for any
given set of sample data. Use the rules of summation to obtain the following
normal equations
n

i 1

i 1

n0 1 xi Yi

and

i 1

i 1

i 1

0 xi 1 xi xiYi

(Point) Estimated 0 (intercept) and 1 (slope) are


n
n n
n xi yi xi yi
i 1 i 1

1 i 1
2
and
n
n
2
n xi xi
i 1
i 1
n

In simpler form 1

(x

i 1

x )( yi y )

(x
i 1

x)2

and

0 y 1x

0 y 1x

Normal equations can be derived by calculus.

46

n
Q SSE

2 yi ( 0 1 xi ) 0 and
0
0
i 1

n
Q SSE

2 yi ( 0 1 xi ) xi 0
1
1
i 1
n

i 1

i 1

Or, n0 1 xi yi (1)

i 1

i 1

0 xi 1 x 2 i xi yi . (2)
i 1

n xi yi xi yi
i 1
i 1 i 1

2
Then, 1
and
n
n
2
n xi xi
i 1
i 1
n

0 y 1x

These estimated values of 0 and 1 are logical and thus good estimators of 0 and

1 . Then the estimated regression line is y Yx 0 1 xi . Estimators are


linear combinations of Yi and hence called linear estimators.

Properties of Least square estimators


1. Gauss-Markov theorem states: under the conditions of regression model, the
least square estimators are unbiased and have minimum variance among all
unbiased linear estimators. That is, E ( 0 ) 0 and E ( 1 ) 1 (estimators
do not tend to overestimate or underestimate systematically).
2. Estimators Yi are more precise (i.e., their sampling distributions are less
variable) than any other estimators belonging to the class of unbiased
estimators that are linear functions of observations Y1, Y2, Yn. The estimators
are linear functions of Yi.
Summation properties
n

1.

(x

x) 0

(x

x )( yi y ) ( xi x ) yi

i 1
n

2.

i 1

i 1

3.

(x
i 1
n

4.

(x
i 1

x )( yi y )

i 1

i 1

n xi y i xi y i
i 1

x ) 2 ( xi x )xi
i 1

47

n
n
x
i xi

n
i 1
5. ( xi x ) 2 i 1
n
i 1
n

Distribution of 1
To develop confidence intervals or test hypothesis or slope of regression line, we
need to know its distribution. Least square estimator for 1 is an unbiased estimator
n
n n
n xi yi xi yi
i 1
i 1 i 1

2
for this parameter. Now 1
. Apply summation properties
n
n
2
n xi xi
i 1
i 1
n

2, 3 and 5, then 1

(x

i 1
n

x ) yi

( xi x ) 2

. Let c j

i 1

( xi x )
n

( xi x ) 2

, where j = 1, 2, 3, , n, then

i 1

1 c jY j , it is a linear function of independent random variables Y1, Y2, , Yn.


j 1

Since any linear function of independent random variables is normally distributed,

1 is normal. For different values of 1 with repeated sampling when x is held

constant, a sampling distribution of 1 can be developed. Its mean is E ( ) and


then

variance ( 1 ) n
. So, 1 N 1 , n

2
( xi x ) 2
( xi x )
i 1
i 1

That is, the point estimator 1

( xi x )(Yi Y )
i 1

(x
i 1

x)

(x
i 1
n

(x
i 1

x )Yi
x)

c j Y j and

0 y 1x
Where constant c j

( xi x )
n

(x
i 1

x)

f ( xi ) only. As xi has a fixed value, so cj are

fixed quantities.
Properties of cj:
48

Distribution of

c x
j

1
( xi x ) 2

0
n

This is also normally distributed with E ( 0 ) 0 and variance

xi

i 1
n

n ( xi x )

2.
2

i 1

2
xi

2
i

0 N 0 , n

n ( xi x ) 2

i 1

Estimator of unknown variance,

2 denotes variability of each of random variables Yi about the true regression line.
We use information concerning the variability of data points about the fitted
regression line. Since residual measures unexplained the random deviation of a data
point from the estimated regression line, residuals are used to estimate
Estimator for

is

2 . So,

s 2 2 SSE /( n 2)

Ex. In a small-scale study of persistence, an experimenter gave three subjects of


very difficult task. Data on the age of the subject (x) and on the number of
attempts to accomplish the task before giving up (y) follow:
Subject i:
Age xi
Number of attempts Yi

1
20
5

2
55
12

3
30
10

Apply least square SSE (also called Q) criterion for fit of regression line.
Ex. A company manufactures refrigerator equipment and many replacement parts. In
case of one part (produced periodically in lots of varying sizes), company wishes to
determine the optimum lot size. Production involves unavoidable setting up the
process, machining and assembly operations. One key issue is to determine the
optimum lot size is the relationship between lot size and labor hours required to
49

produce the lot. Data are collected to develop that kind of relationship under a stable
production condition.
Production
run i:
1
2
3
..
23
24
25
Total
Mean

Lot size
xi
80
30
50
..
40
80
70
1,750
70.0

Work
hours Yi
399
121
221

244
242
323
7,807
312.28

xi x

Yi Y

( xi x )x( Yi Y ) ( xi x )2 ( Yi Y )2

70,690

19,800 307,203

Construct the regression equation

If you use software, you will get the table like below:
Predictor coefficient
stdev
t-ratio P
Constant 62.37
26.18
2.38
0.026
x
3.5702
0.3470
10.29 0.000
S = 48.82 R-sq = 82.2%
R-sq (adj) = 81.4%
Write the regression equation:
Point Estimation of Mean Response
Suppose, sample estimators of parameters are computed. The mean (expected)
response of the regression function is

E{ y} 0 1x (mean of the PD of y for a level of x)

Estimated regression function is y 0 1 x (point estimator of the mean


response when level of predictor variable is x. y-hat is an unbiased estimator of
E{y}).
For cases in study y 0 1 xi , i = 1, 2, , n (fitted value for ith trial).
Ex. The estimated regression function/equation for the previous examples is
50

Plot it alongside the scatter plot and comment if it represents a good relationship
between lot size and work hours.
Production
run i
1
2
3
..
23
24
25
Total
Mean

Lot
size xi
80
30
50
..
40
80
70
1,750
70.0

Residuals
The ith residual is

Work
hours
yi

Estimated mean
response
y-hat

Residual
yi y ei

Squared
Residual

399
121
221

244
342
323
7,807
312.28

347.98
169.47
240.88

51.02

2,603.0

7,807

54,825

( yi y )2 e2i

ei Yi Yi Yi 0 1 xi

Residual is different from model error term

i Yi E{Yi } . Corresponding

residual is ei Yi Yi . The former involves the vertical deviation of Yi from the


unknown true regression line and it is also unknown. But residual is the vertical
deviation from the fitted value of Yi-hat on the estimated regression line, and it is
known.
Residual is highly important for studying whether a given regression model is
appropriate for the data at hand.
Properties of fitted regression line
The properties of least square fitted/estimated regression function do not apply to
all regression models and thus deserve attention.
n

1. The sum of the residuals is zero,

i 1
n

2. The sum of the squared residuals

i 1

is a minimum. This is important for

SSE to be minimized equals

e
i 1

when least squares parameters are

estimated.

51

3. Sum of the observed values of Yi equals the sum of the fitted values y i .
n

i 1

i 1

That is Yi y i
n

4. The sum of the weighted residuals is zero in a trial. That is

xe
i 1

i i

5. The sum of the weighted residuals is zero when the residual in a trial is
weighted by the fitted value of the response variable for that trial. That is
n

Y e
i 1

i i

6. The regression line always goes through the point ( X , Y x , y )


Estimation of Error Terms i Variance 2
This is important to obtain an indication of variability of probability distributions of
Y in Yi 0 1 xi i as well as a variety of inferences concerning the regression
function.
For point estimation, consider sampling from a single population of unknown mean.
Variance is estimated by sample variance s2. That is, the unbiased estimator of
variance 2 is the sample (often called a mean square)
yi y 2

2
s
n 1
The numerator is called sum of squares. The variance of each observation yi for the
above regression model is 2, the same as that of error term i . yi comes from
different probability distributions with different means that depend upon the level of
xi. The deviation or residuals of an observation Yi from the estimated mean is

ei yi yi

The sum of squares of residuals, called error (or residual) sum of squares (SSE) is
defined as,
2
2
SSE ei ( yi y i ) 2 yi ( 0 1 xi )
1i n

1i n

SSE is a measure of deviation of data points from the predicted line. The method of
least square is to find those values of 0 and 1 that minimize the SSE. SSE is a
function of 0 and 1 and is given by
2
2
SSE i yi ( 0 1 xi ) f ( 0 , 1 )
1i n

52

SSE has n-2 degrees of freedom. Two degrees of freedom are lost each for each beta.
Hence error mean square or residual mean square is defined
2
SSE ei
MSE

n2 n2
MSE is an unbiased estimator of 2. That is E{MSE} = 2
Ex. Find the variance 2 on previous example.

Regression analysis can also be done by:


SSE can also be partitioned into three squares. These are (each summed is
nonnegative)

S xx ( xi x )2 x 2i nx 2
S yy ( yi y )2 y 2i ny 2

S xy ( xi x )( yi y ) xi yi nx. y
The sum of squares into a sum of three squares is given by

f (0 , 1 ) S yy (1 r 2 ) (1 S xx r S yy )2 n y (0 1x )

Here the second and third summands depend on unknown parameters 0 and 1.
Lets attempt to minimize f ( 0 , 1 ) by choosing these parameters so that 2nd and
3rd terms equal zero. That is,

( 1 S xx r S yy ) 2 0 ... 1

y ( 0 1x ) 0 .......

Solve these two equations and find the values of 0 (intercept) and 1 (slope). The
estimated values of these are

r S yy
S xx

S xy
S xx , which is an unbiased estimator of

1 . This estimator is

2
2
normally distributed with variance of 1 / S xx .

and

0 y 1x , which is an unbiased estimator of 0


n

2
2
2
and this is normally distributed with a variance ( x i ) / nS xx ,
0

i 1

53

2
and S SSE /( n 2) is the unbiased estimator of 2 .

Finally, the least squares line, or estimated regression function of y on x is

y 0 1x
You can reach to the same findings by differentiating SSE with respect to 0 and
2
2
1. SSE i yi (0 1xi )
1 i n

( SSE )
( SSE )
2 yi ( 0 1 xi ) 0 and
2 yi ( 0 1 xi ) xi 0
0
0
Even though the regression equation actually estimates the mean of y for a given
values x, it is used extensively to estimate the value of Y itself. It is an estimated

average value, or

y Y 0 1 x

Interval estimation and hypothesis testing


Hypotheses: H0: 1 0 (no slope) against any of the following three hypotheses: H1:
1 0 (-ve slope, left tailed test); H1: 1 0 (+ve slope, right tailed test); H1:

1 0 (both tailed test)

Test statistic: If H0 is true, then SSR = 0 (regression model explains none of the
variation). So reject H0 if t 0

t / 2,n2 . tn2 t0

S / S xx
S / S xx
value from t-table and make conclusion.

. Find critical

Confidence interval on 1 , slope of the regression line is

1 t / 2,n2 S / S xx
Test statistic on 1 , slope of the regression line is

t n2 t 0

1 01
S / S xx

Confidence interval on 0 , intercept of the regression line is

0 t / 2,n2

nS xx
54

Test statistic H0: 0 0 (no intercept)

t n2 t 0

0
S x2

nS
xx

Inferences about estimated mean:


Distribution of regression line y Yx

0 1 x is Yx N Yx , 1 / n

(x x)2 2
.
S xx

100(1 )% confidence interval for the mean value of Y when X = x is

(x x)2
(x x)2

Yx t / 2,n2 S 1 / n
or 0 1 x t / 2,n2 S 1 / n
S xx
S xx
Prediction interval for the response variable for a given value of x is
(x x)
0 1 x t / 2,n2 S 1 1 / n
S xx

ANOVA method to regression analysis can be applied at this point.


ANOVA method is based on partitioned SSE and degrees of freedom associated with
the response variable Y. The variation is conventionally measured by deviation of yi
from y-bar: yi y

( yi y ) .
The total variation or total sum of squares SST is defined as SST
Magnitude of SST indicates the level of uncertainty. Without taking the predictor
variable X into account, SST can be decomposed as
2

Total deviation yi y = Deviation of fitted regression value around mean, y i y +


Deviation around fitted regression line, yi y i
When using predictor variable X, the measure of variation in yi observations is SSE.
SSE ( y i y i ) 2
1i n

The regression sums of squares (or sum of squared deviation) is termed as SSR. SSR
2
= SST SSE = SSR ( y i y )
Breakdown of Degrees of freedom
Associated with SST, df = n 1. Associated with SSE = n 2 . Associated with SSR df
= 1.
55

SST = SSE + SSR. Their degrees of freedom are also additive, i.e. n 1= n 2 + 1
Mean Squares (MS)
MS is the sum of squires divided by the respective df. The regression mean of square
SSR SSR

SSR
(MSR) is MSR
df
1
SSE
The error mean of square, MSE, MSE
n2
Important note is that the above two means of squares MSR and MSE do not add to
SST divided by its df. That is, mean squares not additive.
Two important implications of MS are:
The mean of the sampling distribution of MSE is 2 whether or not X and Y
are linearly related, i.e., whether or not 1 0 .

The mean of the sampling distribution of MSR is also 2 when 1 0 .


That is sampling distributions of MSR and MSE are located identically and
MSR and MSE will tend to be the same order of magnitude.
If

1 0 , then?

ANOVA Table: Two types of table


1. Basic Table (for simple linear regression)
Source of
variation
Regression
Error

SS

df

MS

E{MS}

SSR ( y i y ) 2

2 12 xi x 2

( y

y i ) 2

(n-2)

SST ( yi y ) 2

SSR
1
SSE
MSE
n2

(n-1)

SSE

1i n

Total

MSR

2. Modified Table: table showing one additional element of decomposition, based on


the fact that SST can be partitioned into two parts, as follows
SST ( yi y ) 2 y 2 i ny 2 . The total uncorrected sum of squares

SSTU y i and correction for the mean of squares is SScorrection


2

nY

56

Source of variation
Regression
Error

SS

SSR ( y i y )
SSE

( y

df
1

y i ) 2

(n-2)

SST ( yi y ) 2

(n-1)
1
n

1i n

Total
Correction for mean
Total, uncorrected

SS correction ny 2
SSTU yi

MS
SSR
1
SSE
MSE
n2
MSR

F Test of 1 0 versus 1 0
For regression analysis, ANOVA approach provides us with a test for
H0:

1 0
1 0

H1:
Test statistic: for ANOVA it is denoted by Fo. It compares MSR and MSE in
the following fashion:

Fo

MSR
MSE

Large values of Fo support H1 and values of Fo near 1 support H0. That is, the
appropriate test is an upper-tail test.
Coefficient of Determination
We have seen that SST is a measure of uncertainty in predicting Y, when no
account of predictor is taken. But SSE measures the variation in the Yi when a
regression model utilizing the predictor variable X is employed. In reducing the
uncertainty in predicting Y with effect of X, is to express the reduction in variation
(SST SSE = SSR), is a proportion of total variation.
R2

SS around least squares line


SSR
SSE
1
1
1
SST
SST
total sum of squares around y

The measure R2 is called coefficient of determination. As

( yi y xi )
( yi y )

0 SSE SST

0 R2 1
We may interpret R2 as the proportionate reduction of total variation associated
with the use of predictor variable X. The larger R2 is, more the total variation of Y
57

is reduced by introducing the predictor variable X. Limiting values of R2occurs as


follows:
When all observations fall on fitted regression line, SSE = 0 and R2 =1. The
closer to 1, the greater is said to be the degree of linear association between X
and Y.

When the fitted regression line as horizontal so that 0 and Yi Y , then SSE
= SST and R2 = 0, here, there is no linear association between X and Y in the
sample data. The predictor variable X is of no help in reducing the variation in
observations Yi with linear regression.
In computer solution, R2 is labeled as R-Sq in percent form and R-sq(adj) is
adjusted coefficient of determination when sum of squares is divided by respective
degrees of freedom.
R2and r carry different meanings for linear regression
A high r does not necessarily mean that a useful prediction can be made.
A high r does not necessarily mean that a regression line is a good fit
A zero value of r does not necessarily indicate that x and y are not related
Adjusted R2
Reduced value of R2 attempts to make an estimate of the value of R2 in the
population.

AdjR2 1 (1 R 2 )

N 1
N k 1 , where N is the population size, k = number of

X variables. If N increases adj R2 decreases.

Lack-of-fit Tests
For experimental data, regression model can be fitted, but the relationship was
previously unknown. So, check if the model is correct. Do lack-of-fit test.
To test divided SSE as such, SSE = SS due to pure error (SSpe) + SS due to lack
of fit (SSlof). You can calculate first one and
SSlof = SSE Sspe.
For SSpe, take repeated observations on y for at least one level/value of x.
Hypo: H0: model adequately fits the data
H1: model does not adequately fit the data
58

Ex.
x
y

2.0
2.4

2.0
2.9

3.0
3.2

4.4
4.9

4.4
4.7

5.1
5.7

5.1
5.7

5.1
6.0

5.8
6.3

6.0
6.5

Repeated observations for levels of x 2, 4.4 and 5.1 on y. Construct the regression
line.
For SSpe
x
2
4.4
5.1

y
2.4, 2.9
4.9, 4.7
5.7, 5.7, 6.0

y-bar

(yi-y-bar)2

df
2-1 = 1
2-1 =1
3-1 = 2

SSpe ( yi y ) 2
Source of variation
Regression
Error/residual
Pure residual, SSpe
Lack-of-fit

SS

df
1
8
4
4

MS

Fo

MSR/MSE
?

MSlof
F
6.39
MSpe From F-table 0.05, 4, 4
Do not reject H0 if Fo F0.05, 4, 4
lack of fit

Multiple Regression Models: Consider several predictor variables


First-order with two predictor variables
The model is: Yi 0 1 X i1 2 X i 2 i
As

E{ i } 0 ,

the

regression

E{Y } 0 1 X i1 2 X i 2

function

for

the

model

is

(it is not a line but a plane called regression

surface or response surface. Error function is defined as

i Yi E{Yi }

First-order with more than two predictor variables: General Linear Model

Yi 0 1 X i1 2 X i 2 ... k 1 X i , k 1 i

59

that E{ i }

0 , the response function


E{Y } 0 1 X i1 2 X i 2 ... k 1 X i , k 1 is a

Assuming

for

the

model

is

hyperplane (plane for

more than two dimensions).


Polynomial Regression Models and transformed variables

Y1 0 1 x1 2 x22 ... p x p 1
Now, we can write down other equations for all xs data (1, 2, , n).
Then write the matrices as above and solve.
Matrix Approach to Least Squares method
In a complex model having several X variables, it eases the regression analysis. We
cana. Express the general linear model in matrix form.
b. Find a matrix expression for normal equations.
c. Find expressions for least square estimates by solving normal equations.
d. Apply the results obtained to the polynomial and multiple linear regression
models.
The general linear model: Yi 0 1 x1i 2 x2i ... k xki i , i 1,2,..., n
Its expanded forms are:

Y1 0 1 x11 2 x21 ... k xk1 1

Y2 0 1 x12 2 x22 ... k xk 2 2


...
Yn 0 1 x1n 2 x2 n ... k xkn n
The three column vectors are:
Y1
Y
2
Response vector: Y (nx1)matrix . ,
.
Y
n
0

2
Parameters vector: (k 1) x1matrix . ,
.

n
60

1

2
Random error vector: .
.

n

Then k-predictor variables matrix, called model specification matrix


1 x11 x21 ... xk1
1 x

x
...
x
21
22
k
2

X nx(k 1) matrix .

1 x
x2 n ... xkn
1n

To change from one model to another, simply change this matrix. The product of
X is a nx1 matrix. Thus multiple regression model matrixes are combined as

Y X

To find the matrix formulation of the normal equations, consider the matrix X X
1
1 ... 1 1 x11 x21 ... xk1
1
x

11 x12 x13 ... x1n 1 x12 x22 ... xk 2


X X x21 x22 x23 ... x2 n 1 x13 x23 ... xk 3

.
.
.
.
.
.
.
.
.
.

xk1 xk 2 xk 3 ... xkn 1 x1n x2 n ... xkn

n
n
x
1i

i 1
n
=
x2 i
i 1
.
n
xki
i 1

x1i
i 1
n

2
1i

i 1

x
i 1

1i

x2i

x2i

i 1
n

x
i 1
n

x
i 1

.
i 1

1i

x2 i
2
2i

1i

xki

x
i 1

2i

xki

i 1

n
... x1i xki

i 1
n

... x2i xki

i 1

.
.
n

... xki2
i 1

...

ki

61

1
x
11
X y x21

.
xk1

yi

1
1 ... 1 y1 ni1

0
x12 x13 ... x1n y2 x1i yi
i 1

x22 x23 ... x2 n y3 n


1


x2i yi , and 3
.
.
.
. i1

xk 2 xk 3 ... xkn yn n

k
xki yi
i1

Normal equations matrix form


( X X ) X y
Therefore, ( X X ) 1 X y

Now estimated model is


y 0 1 x1 2 x2 ... k xk

Ex.
An equation is to be developed on the performance of workers based on their skill
levels and year of services. The data are as follows:
Worker no.
1
2
3
Skill level
1.35 1.90 1.70
Yr in service 17.9 16.5 16.4

4
5
6
7
8
9
10
1.80 1.30 2.05 1.60 1.80 1.85 1.40
16.8 18.8 15.5 17.5 16.4 15.9 18.3

62

You might also like