Professional Documents
Culture Documents
to Multivariate
Calibration and Analysis
handle large quantities of data are be-
coming a necessity for the practicing
analytical chemist. The number of
REPORT
books in this area (1-3) is increasing as
the awareness of the need for more so- any situation where multiple measure-
phisticated techniques grows. Howev- ments are acquired.
er, widespread use of the full set of ana- One example of an analytical prob-
lytical tools will be realized only when lem solved by near-IR reflectance anal-
the analyst is familiar with the general ysis is the estimation of the protein and
goals and advantages of the methods. moisture content of wheat samples. As
This REPORT serves as an introduc- with many analytical methods, this
tion to the area of multivariate anal- procedure consists of two phases: cali-
yses known as multivariate calibration. bration and prediction. The chemist
Our goal is to give the chemist insight begins by constructing a data matrix
into the workings of a collection of sta- (R) from the instrument responses at
tistical techniques and to enable him or selected wavelengths for a given set of
her to judge the appropriateness of the calibration samples. In the case of
techniques for an application of inter- near-IR reflectance analysis, the R ma-
est. The list of methods described is by trix can be constructed from the loga-
no means a complete representation of rithmic reflectances (log(l/ref)) or
those that can be applied to multivari- from some other combination of reflec-
ate data. Also, we do not compare the tances obtained at the various wave-
results obtained using one method over lengths (4). A matrix of concentration
another; instead, we describe the meth- values (C) is then formed using inde-
ods in the line of a tutorial. For this pendent or referee methods such as
reason, the examples chosen illustrate Kjeldahl nitrogen analysis for protein
how the methods work rather than determination and oven-drying for the
Kenneth R. Beebe compare among them. The three multi- determination of moisture content.
Bruce R. Kowalski variate methods that will be discussed The goal of the calibration phase is to
Department of Chemistry, BG-10 are multiple linear regression (MLR), produce a model that relates the near-
Laboratory for Chemometrics principal component regression (PCR), IR spectra to the values obtained by
University of Washington
and partial least squares (PLS). MLR, the independent or referee methods.
Seattle, Wash. 98195
an extension of ordinary least Figure 1 illustrates the resulting ma-
squares, is the easiest to understand trices for this example. R is a 3 X 8
The advent of the laboratory computer and also the most commonly used; matrix with three rows of near-IR spec-
has allowed analytical chemists to col- PCR and PLS have yet to achieve tra from the analysis of three samples
lect enormous amounts of data on a widespread acceptance in chemistry. and eight columns corresponding to
wide variety of problems of interest. the eight near-IR wavelengths chosen
With this ability, however, has come Calibration and prediction for the analysis. (Throughout this RE-
the realization that more data does not Multivariate statistics is a collection of PORT, these columns will be referred to
necessarily mean more information. powerful mathematical tools that can as the original variables for R.) The C
One can collect reams of computer out- be applied to chemical analysis when matrix is 3 X 2 dimensional for this
put without knowing anything more more than one measurement is ac- example. Again, the three samples oc-
about the system under investigation. quired for each sample. For the sake of cupy the three rows and the two col-
Only when the data are interpreted and consistency, near-infrared (near-IR) umns represent the protein and mois-
put to use do they become valuable to reflectance analysis will be used as an ture content of the samples as deter-
the chemist and to society in general; example of such an analytical method mined by the referee methods.
then data become information. For this throughout this REPORT. However, the According to this notation, the terms
reason, data analysis methods that can techniques described can be used in Cji and c^ represent the protein content
0003-2700/87/A359-1007/$01.50/0 ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1007 A
© 1987 American Chemical Society
earities). However, problems can be en sion of the same C as in the first exam
countered with the use of real data. A ple onto this new matrix R 2 is given in
model chosen solely according to this S 2 . For this model, Err = 0.07, where
criterion attempts to use all of the vari the smaller value implies that this sec
ance in the R matrix, including any ir ond model is more effective at model
ΙχJ relevant information, to model C. ing C. To further evaluate these re
When the resulting model is then ap sults, note that when R is multiplied by
plied to a new sample, the model will S, the jth column of R is always multi
assume that the correlation found be plied by the jth row of S. The impor
tween the calibration R and C matrices tance of variable j to the model can
also exists in that sample. Because the therefore be determined by examining
Γ * Ί model was built using irrelevant infor the jth row of S. The nonzero entries in
mation in the R matrix, this assump the fourth row of S 2 reveal that a vari
tion will not be true. Unfortunately, able consisting of random numbers
2 12 1 1 5 3 1 even noise has a very high probability (column number four of R 2 ) was chosen
of being used to build the model. The as a significant contributor to the cali
following example illustrates this bration model. In an ideal analysis, this
12 2 3 7 4 71
point. Consider the matrices random variable would have been ig
nored in the model-building phase of
75 152 102 the analysis. Furthermore, the inclu
12 1 3 6 2 6 2
63 132 82 sion of this column in R 2 has changed
R
96 218 176 the estimated model coefficients so
3 4 that they no longer represent the true
69 157 124_
=
model. The upper 3 X 3 portion of S 2 ,
ΙχΚ 2 7 which represents the model for the first
2 7 5
three variables in R 2 , is not equal to S,
8 6 4 3 3 which represents the true model for the
9 12 3 same variables. The addition of a col
6 8 2 umn of random numbers has resulted
Figure 1. Configuration of R and C in a model that appears to be better, in
matrices. -0.71 0.55 0.48 that it is more effective at reproducing
0.42 - 0 . 4 1 -0.24 C, and yet it does not describe the true
s= relationship. This is because MLR uses
-0.08 0.28 0.05
and moisture content, respectively, for all of the matrix R to build the model,
the ith wheat sample found in the ith MLR was used to determine the S regardless of whether or not it is rele
row of C. matrix, as described earlier. A standard vant in describing the true model.
The next step in the calibration measure of the effectiveness of the Therefore, an erroneous model can be
phase is the foundation of the entire model is the value of Err as presented derived and subsequently used to pre
analysis. The analyst must choose an above. For this example Err = 0.49, and dict the characteristics (e.g., protein
appropriate mathematical method it will be assumed that the matrix S is content) of new samples. Thus MLR
that will best reproduce C given the R an accurate estimate of the true model. alone often will generate misleading
matrix. How the analyst defines best The coefficients in S, therefore, will models with subsequent errors in pre
will ultimately determine the method closely approximate the true relation diction.
that is chosen. ship between the variables in R and C. The remainder of this R E P O R T in
MLR assumes that the best ap To illustrate how a calibration meth vestigates the advantages of PCR and
proach to estimating C from R is to od can be inappropriate, a column of PLS over MLR. To understand these
find the linear combination of the vari random numbers ranging from zero to advantages, it is necessary to under
ables in R that minimizes the errors in 100 was added to the R matrix. The stand how the methods work. Graphi
reproducing C. It proposes a relation addition of this column is analogous to cal representations of many of the con
ship between C and R such that C = the inclusion of a wavelength in near- cepts have been included so that the
RS + E, where S is a matrix of regres IR analysis that has no useful informa reader not well versed in linear algebra
sion coefficients and Ε is a matrix of tion for describing the protein or mois will be able to follow the discussions.
errors associated with the MLR model. ture content of the samples. The result
S is estimated by linear regression, and ing matrix R 2 and MLR model are Notation
the term
75 152 102 9l" Standard linear algebra notation will
be employed throughout this article.
63 132 82 36
Err=|;^(c i k -c i k )2=f;5\ i k 2 R2 - Bold upper case letters refer to matri
96 218 176 74 ces; plain upper case letters are used for
k= 1i=1 k=1i=1
_69 157 124 51. their row and column dimensions. For
(1)
example, the letter R refers to the ma
is minimized. In this expression, Cjk is "2 7 5~ trix of responses and is I X J dimen
the actual concentration element in the 4 3 3 sional. Bold lower case letters signify
ith row and kth column of C, C;k is the C = column vectors such as r 2 , which repre
9 12 3
MLR estimate of the same element us sents the second column of matrix R.
ing S, and «ii; is the corresponding ele _6 8 2_ By this convention r 2 T represents the
ment of the matrix E. second column of R written as a row
This approach appears to be reason 0.71 0.18 0.42 vector. Plain lower case letters are sca-
able, and in fact it is the best method -0.42 - 0 . 1 9 -0.20 lars, representing single elements of
when the analyst is dealing with well- >2 -
0.24 0.20 0.03 vectors (a;), matrices (ry), or other con
behaved systems (linear responses, no .-0.12 0.03 0.01 stants such as regression coefficients
interfering signals, no analyte-analyte (b). The transpose of a matrix or vector
interactions, low noise, and no collin- The model resulting from the regres is represented by a superscript T. (See
Row2
(4,6)
6-
/ ~*— Column 3
/ , Column 1
/ S s Column 2
(2,1) /
ι
1
ι
1
ι
1 1 1 1
' ι Sim
ι I I I I -7-6-
-7-6-5- t-3-2-1 1
2 3 4 5 6 7 Column 1
Row 1
-6-
Column 2
Figure 2. The matrix R in row space. Figure 3. The matrix R in column space.
H H
Dithiatopazine: The First Stable 1,2-Ditbietane
K. C. Nicolaou, et. ai, J. Amer. Chem. Soc. 1987, 109, 3801
stood if one refers back to two or three small weights when forming R'. Column 1
dimensions and considers situations of This is only one of the advantages of I '
higher dimensionality as expansions of a factor-based calibration method over -2 -1 / 1 2
these simpler cases. the methods that simply use the raw
Projections. Another concept that data. The other possible advantages /
is important to understand is that of can be understood by examining one of
projecting either a point or a vector the factor methods mentioned in the 7-2-
onto a vector or plane. Each of these introduction: PCR (3, 7,8). The follow-
can be viewed as being the perpendicu- ing discussion concerns the prelimi-
lar shadow of one object onto another. nary step of PCR: principal component f -3-
Figure 4 illustrates the result of pro- analysis (PCA). An intuitive feel for
jecting the vector a onto a plane in how PCA works can be gained by con- -4-
three-dimensional space. sidering the result of PCA performed
Factors. The last concept that is on the 5 X 2 matrix R shown below.
very basic to understanding both PCR
and PLS is that of a factor. This is 2 4
Figure 5. The matrix R plotted in col-
because both PCR and PLS are factor- 1 2
umn space with the first eigenvector.
based modeling procedures. For our R = 0 0
purposes, a factor is defined as any lin- 1 -2
ear combination of the original vari- 2 In real analyses, the columns of R are
ables in R or C. It can be shown that -4_
often mean-centered and scaled. Mean
given J factors for an I X J matrix R, In column space, the matrix is five centering simply involves subtracting
one can also represent the variables in points in two dimensions as shown in the average of a column from each of
R as a linear combination of these same Figure 5. the entries in that column. Scaling,
which gives equal weights to each vari-
able, involves dividing each entry by
the variance of the column. PCA is then
performed on the covariance matrix
R T R formed from the mean-centered
and scaled matrix. The first eigenvec-
tor corresponding to the largest eigen-
value is, by definition, the direction in
the space defined by the columns of R
that describes the maximum amount of
variation or spread in the samples. Fig-
ure 5 shows the data and the direction
of the first eigenvector where the space
defined by R is a plane.
In this example, all of the variation
in the data can be described using one
eigenvector. The samples all fall on a
line in column space, and therefore all
of the variation lies in one direction.
When all of the variation in the sam-
ples cannot be accounted for using only
one eigenvector, a second eigenvector
can be found that is perpendicular or
orthogonal to the first and describes
the maximum amount of residual vari-
ation (not described by the first eigen-
vector) in the data set. Figure 6 is the
plot of a 30 X 2 matrix and the associat-
ed first two eigenvectors. The direction
of the first eigenvector describes the
maximum amount of variation or
spread in the data. In this particular
case, the samples in column space hap-
pen to fall within a two-dimensional
ellipse, and the first eigenvector corre-
Figure 4. The projection of a vector onto a plane. sponds to the major axis of the ellipse.
Figure 6. A 30 X 2 matrix plotted in column space with Figure 7. The plane formed by the two columns of
the first two eigenvectors. matrix R plotted in row space.
References
(1) Sharaf, Μ. Α.; Illman, D. L.; Kowalski,
B. R. Chemometrics; Wiley: New York,
1986.
(2) Malinowski, E. R.; Howery, D. G. Fac Kenneth R. Beebe is a research assis
tor Analysis in Chemistry; Wiley: New
York, 1980. tant at the University of Washington. For information call
(3) Draper, N.; Smith, H. Applied Regres He received his B.S. in chemistry in (617)647-9400.
sion Analysis, 2nd éd.; Wiley: New York, 1981, and his M.S. in 1985 from the
1981. University of Washington. Currently
(4) Wetzel, D. L. Anal. Chem. 1983, 55, GROTON
1165A. he is enrolled in the Ph.D. program in
(5) Strang, G. Linear Algebra and Its Ap- the analytical division of the chemis TECHNOLOGY
plications; Academic Press: New York, try department. His research interest
1980. is in the application of various chemo
INCORPORATED
(6) Wesson, J. R. Lessons in Linear Alge- Waltham, MA 02154
bra; Charles E. Merrill Publishing Co.: metrics techniques to optimize multi
Columbus, 1974. variate calibration procedures.
CIRCLE 61 ON READER SERVICE CARD
ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1017 A