You are on page 1of 23

Analytica Chimica Acta 391 (1999) 203225

Comparison of alternative measurement methods


Siriporn Kuttatharmmakul, D. Luc Massart, Johanna Smeyers-Verbeke*
ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090, Brussel, Belgium
Received 19 February 1998; received in revised form 14 July 1998; accepted 14 July 1998

Abstract
A procedure to compare the performance (precision and bias) of an alternative measurement method and a reference method
has been extensively described. It is based on ISO 5725-6 which has been adapted to the intralaboratory situation. This means
that the proposed approach does not evaluate the reproducibility, but considers the (operatorinstrumenttime)-different
intermediate precision and/or the time-different intermediate precision. A 4-factor nested design is used for the study. The
calculation of different variance estimates from the experimental data is carried out by ANOVA. The Satterthwaite
approximation is included to determine the number of degrees of freedom associated with the compound variances. Taken into
account the acceptable bias, the acceptable ratio between the precision parameters of the two methods, the signicance level
and the probability to wrongly accept an alternative method with an unacceptable performance, the formulae to determine
the number of measurements required for the comparison are given. For the evaluation of the bias, in addition to the point
hypothesis testing, the interval hypothesis testing is also included as an alternative. Two examples are given as an illustration
of the proposed approach. # 1999 Elsevier Science B.V. All rights reserved.
Keywords: Comparison; Alternative measurement method; Bias; Precision; Repeatability; Time-different intermediate precision;
(Operatorinstrumenttime)-different intermediate precision; Nested design; ANOVA; Satterthwaite approximation; Interval hypothesis
testing

1. Introduction
When a laboratory wants to replace an existing
analytical method by a new method (e.g. because
the latter is cheaper or easier to use) it has to show
that the new method performs at least as good as the
existing one. A comparison of the performance (precision and bias) of both methods has therefore to be
performed. One of the most advanced guidelines for
the comparison of two methods can be found in ISO
5725-6 [1]. However the ISO guideline is based on
*Corresponding author. Tel.: +32-2477-4737; fax: +32-24774735; e-mail: asmeyers@vub.vub.ac.be

interlaboratory studies and is therefore not applicable


in the intralaboratory situation. Indeed within a single
laboratory, the reproducibility, as evaluated by ISO,
cannot be determined but intermediate precision conditions, such as changes in operator, equipment and
time should be considered since they contribute to the
variability of measurements performed in the laboratory.
In the ISO guideline the reference method is an
international standard method that was studied in an
interlaboratory test program and its precision (2) is
assumed to be known. This assumption is reasonable
since the precision is obtained from a large number of
measurements. In the intralaboratory situation a

0003-2670/99/$ see front matter # 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 0 0 3 - 2 6 7 0 ( 9 9 ) 0 0 1 1 5 - 4

204

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

laboratory has developed a rst method and later on


wishes to compare a new method to the older already
internally validated method. For the latter, referred to
as the reference method, only an estimate of the
precision (s2) will be available since the precision is
determined from a rather limited number of measurements. This of course determines the statistical tests to
be used in the comparison of the performance characteristics of both methods.
Moreover, the ISO standard is meant to show that
both methods have similar precision and/or trueness
whereas a laboratory that performs a method comparison study is interested to evaluate whether the new
method is at least as good as the reference method.
This implies that some two-sided statistical tests
included in the ISO guideline are not appropriate
for the comparison of two methods in a single laboratory, where example in the evaluation of the precision
one-sided tests have to be considered.
In the decision making concerning the new alternative method it is important (i) not to reject an
alternative method which in fact is appropriate, and
(ii) not to accept an alternative method which in fact is
not appropriate. The former is related to the a-error of
the statistical tests used in the comparison and is
controlled through the selection of the signicance
level. The latter is related to the b-error and when it is
considered it is generally taken into account by including sample size calculations. This approach is also
included in the ISO guideline.
In this article we propose an adaptation of the ISO
guideline to the intralaboratory comparison of two
methods. It is also applicable to the situation in which
two laboratories of, e.g., the same organisation are
involved, each laboratory being specialized in one of
the methods. For the evaluation of the bias, in addition
to the point hypothesis testing, interval hypothesis
testing [2] in which the probability of accepting a
method that is too much biased is controlled, is also
included.
Due to the specied acceptance criteria for the
alternative method, the proposed approach might lead
to a large number of measurements to be performed.
An alternative approach (which will be described in a
next article) is to perform the method comparison with
a user-dened number of measurements and to evaluate the probability that a method with an unacceptable performance will be accepted.

2. Methods
All symbols and abbreviations used in this paper are
dened in Table 1.
2.1. Experimental design
A 4-factor nested experimental design is used
[37]. This design is also one of the designs recommended by ISO [3]. The schematic layout of the
design is given in Fig. 1. The four factors represent
four sources of variation that contribute to the variability of the measurements within one laboratory. The
factors considered are operator, instrument, time, and
random error. The experimental approach can be
described as follows. For each analytical method,
the sample is analysed by m operators. Each operator
performs, on each of q instruments, n replicated
measurements on each of p different days. To avoid
an underestimation of the day effect, the set of p
different days during which the measurements are
performed on each of the q instruments must be
different, i.e. two instruments cannot be operated on
the same day.

Fig. 1. Schematic layout for the 4-factor nested experimental


design applied. Only the nested structure under the ith operator, jth
instrument and the kth day is shown here. The nested structure
under other operators, instruments and days has the same pattern.
(instruinstrument, repreplicate).

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

205

Table 1
Definition of symbols and abbreviations applied in the document
d
D
E
FI(OIT)
FI(T)
Fr
F B ;A
F A ;B
I
m
M
MS
n
n
N
O
p
q
s
s2
tcal
t /2
t
UCL
y
y
yi
yij
yijk
yijkL
z /2






2
I(OIT)
I(T)

Absolute difference between the grand means obtained with two methods
Component of day effect in a test result
Random error component occurring in every test result
Calculated F-value obtained from the comparison of (operatorinstrumenttime)-different intermediate precision (variance)
Calculated F-value obtained from the comparison of time-different intermediate precision (variance)
Calculated F-value obtained from the comparison of repeatability variance
Value of the F-distribution with  B degrees of freedom associated with the numerator and  A degrees of freedom associated with
the denominator; represents the portion of the F-distribution to the right of the given F-value
Value of the F-distribution with  A degrees of freedom associated with the numerator and  B degrees of freedom associated with
the denominator; represents the portion of the F-distribution to the right of the given F-value
Component of instrumental effect in a test result
Number of operators
General mean (expectation) of the test results
Mean squares
Number of replicates performed on each day
Average number of replicates performed on each day
Total number of measurements
Component of operator effect in a test result
Number of days
Number of instruments
Estimate of 
Estimate of 2
Calculated t-value obtained from the comparison of the means obtained with two methods
Two-sided tabulated t-value at significance level and degrees of freedom 
One-sided tabulated t-value at significance level and degrees of freedom 
Upper confidence limit
Test result
Grand mean of test results
Arithmetic mean of the test results obtained from the ith operator
Arithmetic mean of the test results obtained from the ith operator and the jth instrument
Arithmetic mean of the test results obtained from the ith operator, the jth instrument and the kth day
Particular test result related to the Lth replicate of the kth day, the jth instrument and the ith operator
Two-sided tabulated z-value of the standard normal distribution at significance level
Significance level (type I error probability)
Type II error probability
Detectable difference between the means obtained from the two methods
Numbers of degrees of freedom
Detectable ratio between the repeatability standard deviations of method B and method A
True value of a standard deviation
True value of variance
Detectable ratio between the square roots of the (operatorinstrumentday) mean squares (or the (operatorinstrumenttime)different intermediate precision (standard deviation)) of method B and method A
Detectable ratio between the square roots of the between-day mean squares (or the time-different intermediate precision
(standard deviation)) of method B and method A

Symbols used as superscripts and subscripts


A
Method A
B
Method B
d
Difference between the grand means obtained with two methods
D
Between-day
E
Residual
I
Between-instrument
I(T)
Time-different intermediate precision
I(OIT)
(Operatorinstrumenttime)-different intermediate precision
i
Index for a particular operator

206

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Table 1 (Continued )
j
k
L
m
nijk
pij
qi
O
OID
r

Index for a particular instrument


Index for a particular day
Index for a particular test result performed by the ith operator, on the jth instrument and kth day
Number of operators
Number of replicates performed by the ith operator on the jth instrument and kth day
Number of days performed by the ith operator on the jth instrument
Number of instruments performed by the ith operator
Between-operator
(Operatorinstrumentday)
Repeatability

2.2. Basic statistical model


To understand the following statistical approach, it
is necessary to briey explain the basic statistical
model. More details can be found in [3].
Here, we assume that every test result y obtained
with a particular analytical method is the sum of ve
components
y M O I D E;

(1)

where M is the general mean (expectation) of the test


results, O the random effect caused by changing the
operator, I the random effect caused by changing the
instrument, D the random effect caused by the fact that
measurements are performed on different days, and E
is the random error occurring in every measurement
under repeatability conditions.
These four factors (operator, instrument, time, and
random error under repeatability conditions) are
selected for our approach, since they are the main
sources that contribute to the variability of the measurements within a laboratory. The precision of the
method is then determined by the contribution from
the variance (2) of each factor, i.e. 2O ; 2I ; 2D and
2E , which are estimated as s2O ; s2I ; s2D and s2E , respectively. Since it can be assumed that these estimated
variance components are not related, the estimation of
the overall precision parameter also called the (operatorinstrumenttime)-different intermediate precision S2IOIT can be obtained by the sum of all
variance components: s2O s2I s2D s2E . It is an estimate of the variance of an individual measurement
made by an arbitrary operator on an arbitrary instrument. When in the laboratory, the analyses are performed by the same operator on a single instrument,
the overall precision corresponds with the time-dif-

ferent intermediate precision which is obtained as


s2IT s2D s2E . The intermediate precision is useful
for indicating the ability of the analytical method to
repeat the test result under the dened conditions.
2.3. Calculation of the variance estimates
In analogy with ISO guidelines [3], the calculation
of different variance estimates is carried out by
ANOVA (see Table 2). In case that the numbers of
replicates per day (nijk), as well as the numbers of
instruments performed by each operator (qi), are equal
for all i1, 2, . . ., m, j1, 2, . . ., q and k1, 2, . . ., pij,
the calculation is simplied as shown in Table 3. The
number of days (pij) might not be constant for different
operators and instruments if the detection of outlying
day means yijk leads to the rejection of some data. If
all i and j then the terms
however
pij is equal
PmforP
P
m Pq
q
j1 pij and
j1 pij 1 which appear
i1
i1
in Table 3 are simply replaced by mqp and mq(p1),
respectively. Throughout the rest of the text the calculations as represented in Table 3 will be considered.
No calculation is given for the individual variance
component for operators s2O and for the individual
variance component for instruments s2I in the
ANOVA tables. Since the number of operators and
instruments within a single laboratory is generally
limited, a small value for the degrees of freedom
associated with the variance components, s2O and s2I ,
is to be expected. Consequently, poor estimates for s2O
and s2I will be obtained. Therefore, besides the timedifferent intermediate precision s2IOIT , the (operatorinstrumenttime)-different intermediate precision s2IOIT is estimated as shown in Table 3. This
estimate includes the calculation of MSOID which is
obtained from the sum of squared differences between

Table 2
Calculation of the variance components (ANOVA table)
Source

Mean squares

Estimate of

pij
qi X
m X
X

Operatorinstrumentday
MSOID

n2OID
2r 

nijk yijk y2

i1 j1 k1
qi
m X
X

pij 1

i1 j1
pij
qi X
m X
X

Day

MSD

n2D
2r 

pij 1

i1 j1

pij X
nijk
qi X
m X
X
yijkL yijk 2

Residual

MSE

i1 j1 k1 L1
pij
qi X
m X
X

2r

nijk 1

i1 j1 k1

0
B
B
Byijk
B
@

nijk
X
L1

pij
X

yijkL

nijk

; yij

nijk yijk

k1
pij
X

pij
qi X
m X
X

; y

i1 j1 k1

nijk

0
nijk yijk

BN
B
; 
nB
B
@

k1

Calculation of the variance estimates


The repeatability variance
The between-day variance component
The (operatorinstrumentday) variance component
Time-different intermediate precision (variance)
(Operatorinstrumenttime)-different intermediate precision (variance)

!1
1
pij
qi X
m X
X
2
nijk =N C
C
p
qi X
ij
m X
X
C
C
i1 j1 k1
C
C; N
n

total
number
of
measurements
ijk
q
m
i
C
C
XX
A
A
j1
i1
k1
pij 1
i1 j1

s2r MSE ;

pij
qi X
m X
X
nijk 1
i1 j1 k1

MSD MSE
if s2D < 0 set s2D 0
n

MSOID MSE
s2OID
if s2OID < 0 set s2OID 0

n
MSD 
n 1MSE
s2IT s2D s2r
n

MSOID 
n 1MSE
2
2
2
sIOIT sOID sr

n
s2D

pij
qi X
m X
X
yijk y2

Variance of the day means yijk

s2yijk

i1 j1 k1
qi
m X
X

pij 1

MSOID
s2OID s2r =
n;

n

qi
m X
X

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

nijk yijk yij 2

i1 j1 k1
qi
m X
X

pij 1

i1 j1

i1 j1

207

nijk is the number of replicates on the kth day performed on the jth instrument by the ith operator (L1, 2, . . ., nijk); pij the number of days performed on the jth instrument by the ith
operator (k1, 2, . . ., pij); qi the number of instruments performed by the ith operator (j1, 2, . . ., qi); m is the number of operators (i1, 2, . . ., m).

Source

Mean squares

208

Table 3
Calculation of the variance components in case of equal nijk and equal qi for all i1, 2, . . ., m, j1, 2, . . ., q and k1, 2, . . ., pij. Only pij that may be unequal for different operators
and instruments due to possible rejection of some discordant data (ANOVA table)
Estimate of

pij
q X
m X
X
n
yijk y2
i1 j1 k1
q
m X
X

MSOID

Operatorinstrumentday

2r n2OID

pij 1

i1 j1

Day

MSD

i1 j1 k1
q
m X
X

pij
i1 j1
pij X
q X
m X
n
X
Residual

MSE

2r n2OID

yijkL yijk 2
i1 j1 k1 L1
q
m X
X
n 1

2r

pij

i1 j1

0
B
B
Byijk
B
@

n
X
L1

pij
X

yijkL

; yij

yijk

k1
pij

pij X
q X
m X
n
X


; y

i1

yijkL C
C
j1 k1 L1
C
q
m X
C
X
A
n
p
ij

i1 j1

Calculation of the variance estimates


The repeatability variance
The between-day variance component
The (operatorinstrumentday) variance component
Time-different intermediate precision (variance)
(Operatorinstrumenttime)-different intermediate precision (variance)

s2r MSE ;

 n 1

q
m X
X

MSD MSE
if s2D < 0 set s2D 0
n
MSOID MSE
s2OID
if s2OID < 0 set s2OID 0
n
MSD n 1MSE
s2IT s2D s2r
n
MSOID n 1MSE
2
2
2
sIOIT sOID sr
n

s2D

pij
q X
m X
X
yijk y2

Variance of the day means yijk

pij

i1 j1

s2yijk

i1 j1 k1
q
m X
X

pij 1

MSOID
s2OID s2r =n;
n

q
m X
X

pij 1

i1 j1

i1 j1

n is the number of replicates (L1, 2, . . ., n); pij the number of days performed on the jth instrument by the ith operator (k1, 2, . . ., pij); q the number of instruments (j1, 2, . . ., q);
m is the number of operators (i1, 2, . . ., m).

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

pij
q X
m X
X
n
yijk yij 2

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

the day means yijk and the grand mean y. This
might result in an underestimation of the effects of the
instrument and the operator since those parameters are
not changed for every yijk obtained. However, this is
the best possible approach to estimate the intermediate
precision s2IOIT with small numbers of operators and
instruments and although it might not adequately
reect the true precision it is useful for comparison
studies as long as the number of operators and instruments for the methods being compared are equal.
Considering the formulae to calculate the betweenday variance component s2D and the (operatorinstrumentday) variance component s2OID in
Table 3, negative values for those parameters can be
obtained. For example, if due to random effects MSD
is smaller than MSE, we will get a negative value for
s2D . In that case, the negative estimates of variance are
given the value 0. This is the usual practice which is
also considered by ISO [8] if a negative value for the
between-laboratory variance s2L is obtained. Another
approach to deal with negative variance estimates is
reported in [9]. It applies the method of pooling
minimal mean squares with predecessors.
2.4. Number of measurements
As mentioned earlier the probability to accept an
alternative method, which in fact is not appropriate
( -error) because it is not precise enough or too
much biased in comparison with the reference method,
can be controlled by determining the number of
measurements required to detect a certain bias as well
as a certain difference in precision (if it exists). This
implies that an acceptable difference between the
means of the two methods as well as an acceptable
ratio between the precision parameters of the two
methods have to be specied. The former is called
by ISO the detectable difference between the biases of
the two methods, , and is dened as the minimum
difference between the means of the two methods that
the experimenter wishes to detect with high probability. The latter is called by ISO the detectable ratio
between the precision parameters of the two methods.
It is dened as the minimum ratio of precision parameters that the experimenter wishes to detect with
high probability from the results obtained with the two
methods. In analogy with what is given in ISO, the
detectable ratio to be considered in the intralaboratory

209

situation are:
r
 B for the comparison of repeatabilities;
r A
s
MSDB
for the comparison of timeIT
MSDA

IOIT

different intermediate precisions;


s
MSOIDB

for the comparison of


MSOIDA
operator instrument time-different
intermediate precisions:

Due to the complexity in the determination of the


degrees of freedom associated with I(T) and I(OIT)
(see further), the detectable ratios I(T) and I(OIT) are
given in terms of the mean squares.
It is recommended to use a signicance level of
0.05 in the comparison of the precision parameters
and the means ( represents the probability that the
alternative method B is rejected when in fact its
performance is not worse than that of the reference
method A). ISO recommends that the risk of failing to
detect the chosen minimum ratio of standard deviations or the minimum difference between the means is
set at 0.05. For the intralaboratory situation this
might be too stringent and therefore 0.05 as well as
0.2 will be considered. The latter is inspired by the
requirement in bioequivalence studies [10], where it is
demanded that the statistical tests have 80% power
(power100(1 )).
2.4.1. Determination of the minimum number of
measurements required for the detection of 
In the ISO document [1], the precision (2) is
assumed to be known and the repeatability variance
as well as the between-laboratory variance is included
in the calculation for the optimal number of measurements. In what follows, this is adapted to the situation
in which only an estimate of the precision (s2) is
available. This requires the use of t-values instead
of z-values (applied in ISO). Moreover, the repeatability variance as well as the (operatorinstrument
day) variance component is considered.
The following equation is used for the determination of the minimum number of measurements
required for the detection of .

210

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

s


mA qA pA 1s2OIDA s2rA =nA mB qB pB 1s2OIDB s2rB =nB
1
1
  t =2 t

;
mA qA pA mB qB pB
mA qA pA mB qB pB 2
where the subscript A and B refer to method A and
method B, respectively, t /2: two-sided tabulated
t-value at signicance level and degrees of freedom
mAqApAmBqBpB2, t ,: one-sided tabulated
t-value at signicance level and degrees of freedom
mAqApAmBqBpB2.
This expression is based on the t-test for the comparison of two means and therefore assumes that the
precision of both methods are equal. This assumption
should be acceptable for an estimation of the optimal
number of measurements. If the precision of the
alternative method B is unknown which might often
be the case, it is substituted by the precision of the
reference method A.

cates per day is equal to 2 (n2) and to focus on the


number of days required since this will lead to a
balanced design in which the number of degrees of
freedom associated with the repeatability is almost the
same as the number of degrees of freedom associated
with the between-day component. Therefore, the
minimum number required is mostly determined only
for the number of days pA (pB) which can be
obtained by nding the smallest value for pA that
satises Eq. (4).
The equations above are only approximates which
could be further simplied by replacing (t /2t ) by a
constant value. Indeed for 0.05 and 0.05,
(t /2t ) varies between 3.6 (1) and 3.9 (14,

s


mA qA pA 1s2OIDA s2rA =nA mB qB pB 1s2OIDA s2rA =nB
1
1
  t =2 t

mA qA pA mB qB pB
mA qA pA mB qB pB 2
where  is the acceptable difference between the
means, which one wants to detect with (1 )100%
condence from a two-tailed t-test performed at the
signicance level . The t-distribution of the non-zero
mean difference is a non-central t-distribution. Therefore, instead of (t /2t ), the non-centrality parameter
of the non-central t-distribution should be used. An
evaluation of the effect of approximating this by
means of the central t-distribution indicated that very
similar results are obtained. Therefore the central
t-distribution is used.
As indicated earlier, it is strongly recommended to
have the same numbers of operators (mAmB) and
instruments (qAqB) for both methods. If moreover,
the number of days as well as the number of replicates
are taken the same for both methods, i.e. pApB and
nAnB, Eq. (3) simplies to
s
2s2OIDA s2rA =nA
:
(4)
  t =2 t
m A qA pA
Generally, the number of operators (mAmB) and
instruments (qAqB) will be xed by practical constraints. It is recommended that the number of repli-

(2)

(3)

i.e. mqp2) and therefore a constant value equal to


4 could be used. For 0.05 and 0.2, (t /2t )
varies between 2.8 (1) and 3.0 (14, i.e.
mqp2), thus a constant value of 3.0 could be
applied. Eq. (4) then becomes
s
2s2OIDA s2rA =nA
4
m A qA pA
when 0:05 and 0:05;
s
2s2OIDA s2rA =nA
3
m A qA pA

(5)

when 0:05 and 0:2:

(6)

2.4.2. Determination of the minimum number of


measurements required for the detection of the
minimum ratio of precision parameters
In the ISO document [1], values of the minimum
detectable ratio of the precision parameters corresponding to the chosen degrees of freedom ( A,  B)
are given for the signicance level 0.05 and the
power (1 )0.95. Since ISO applies a two-sided
F-test to check whether the two methods have

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Tables 4 and 5 give the minimum ratios of precision


parameters (, I(T) or I(OIT)) as a function of the
degrees of freedom  A and  B for ( 0.05, 0.05)
and ( 0.05, 0.2), respectively. If the method
precision is known, the degrees of freedom  equal
to 200 can be used.
With mAmB, qAqB and nAnB2, the minimum
numbers of days required for the detection of the
minimum ratio , I(T) or I(OIT) can be obtained by
rst nding the smallest values for the degrees of
freedom ( A and  B) that satisfy Eq. (7) and the
associated minimum number of days can be calculated
from Eq. (8) or Eq. (9) or Eq. (10) depending on
which precision parameters are considered. When
the values of and considered correspond to those
given in Table 4 or Table 5, the minimum values for
the degrees of freedom are directly obtained by looking for the tabulated , I(T) or I(OIT) that is closest to
(preferably smaller than) the given detectable ratio ,
I(T) or I(OIT) and nding its associated numbers of
degrees of freedom ( A,  B).
The minimum number of measurements required is
computed for the minimum difference , as well as for
the minimum ratios , I(T) and I(OIT) and the largest
value is chosen to perform the method comparison.

different precision, these values are obtained based on


a two-sided F-test. In our approach, the objective is to
demonstrate that the precision of the alternative
method B is at least as good as that of the reference
method A. Therefore, a one-sided F-test is applied to
compare the precision of both methods. Consequently,
the calculation of the minimum ratio of precision
parameters  or  corresponding to the given values
of ( A,  B, , ) can be computed as
p
(7)
; IT or IOIT F A ;B  F B ;A ;
where
A mA qA pA nA 1

211

and

B mB qB pB nB 1 in case that  is considered;


(8)
A mA qA pA 1 and
B mB qB pB 1 in case that IT is considered;
(9)
A mA qA pA 1 and
B mB qB pB 1 in case that IOIT is considered:
(10)

Table 4
Values of ( A,  B, , ) or I(T)( A,  B, , ) or I(OIT)( A,  B, , ) for ( 0.05, 0.05)
B

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
50
200

A
5

10

11

12

13

14

15

16

17

18

19

20

25

50

200

5.05
4.66
4.40
4.22
4.08
3.97
3.88
3.81
3.75
3.70
3.66
3.62
3.59
3.56
3.54
3.52
3.43
3.27
3.15

4.66
4.28
4.03
3.85
3.72
3.61
3.53
3.46
3.40
3.36
3.31
3.28
3.25
3.22
3.20
3.17
3.09
2.93
2.81

4.40
4.03
3.79
3.61
3.48
3.38
3.29
3.23
3.17
3.12
3.08
3.05
3.02
2.99
2.96
2.94
2.86
2.70
2.59

4.22
3.85
3.61
3.44
3.31
3.21
3.13
3.06
3.00
2.96
2.92
2.88
2.85
2.82
2.80
2.78
2.70
2.54
2.42

4.08
3.72
3.48
3.31
3.18
3.08
3.00
2.93
2.88
2.83
2.79
2.75
2.72
2.70
2.67
2.65
2.57
2.41
2.29

3.97
3.61
3.38
3.21
3.08
2.98
2.90
2.83
2.78
2.73
2.69
2.66
2.62
2.60
2.57
2.55
2.47
2.31
2.19

3.88
3.53
3.29
3.13
3.00
2.90
2.82
2.75
2.70
2.65
2.61
2.58
2.55
2.52
2.49
2.47
2.39
2.23
2.11

3.81
3.46
3.23
3.06
2.93
2.83
2.75
2.69
2.63
2.59
2.55
2.51
2.48
2.45
2.43
2.41
2.33
2.16
2.05

3.75
3.40
3.17
3.00
2.88
2.78
2.70
2.63
2.58
2.53
2.49
2.46
2.42
2.40
2.37
2.35
2.27
2.11
1.99

3.70
3.36
3.12
2.96
2.83
2.73
2.65
2.59
2.53
2.48
2.44
2.41
2.38
2.35
2.33
2.30
2.22
2.06
1.94

3.66
3.31
3.08
2.92
2.79
2.69
2.61
2.55
2.49
2.44
2.40
2.37
2.34
2.31
2.29
2.26
2.18
2.02
1.90

3.62
3.28
3.05
2.88
2.75
2.66
2.58
2.51
2.46
2.41
2.37
2.33
2.30
2.28
2.25
2.23
2.15
1.98
1.86

3.59
3.25
3.02
2.85
2.72
2.62
2.55
2.48
2.42
2.38
2.34
2.30
2.27
2.24
2.22
2.20
2.12
1.95
1.83

3.56
3.22
2.99
2.82
2.70
2.60
2.52
2.45
2.40
2.35
2.31
2.28
2.24
2.22
2.19
2.17
2.09
1.92
1.80

3.54
3.20
2.96
2.80
2.67
2.57
2.49
2.43
2.37
2.33
2.29
2.25
2.22
2.19
2.17
2.15
2.06
1.90
1.77

3.52
3.17
2.94
2.78
2.65
2.55
2.47
2.41
2.35
2.30
2.26
2.23
2.20
2.17
2.15
2.12
2.04
1.87
1.74

3.43
3.09
2.86
2.70
2.57
2.47
2.39
2.33
2.27
2.22
2.18
2.15
2.12
2.09
2.06
2.04
1.96
1.78
1.65

3.27
2.93
2.70
2.54
2.41
2.31
2.23
2.16
2.11
2.06
2.02
1.98
1.95
1.92
1.90
1.87
1.78
1.60
1.45

3.15
2.81
2.59
2.42
2.29
2.19
2.11
2.05
1.99
1.94
1.90
1.86
1.83
1.80
1.77
1.74
1.65
1.45
1.26

212

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Table 5
Values of ( A,  B, , ) or I(T)( A,  B, , ) or I(OIT)( A,  B, , ) for ( 0.05, 0.2)
B

A

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
50
200

10

11

12

13

14

15

16

17

18

19

20

25

50

200

5.22
4.76
4.51
4.35
4.24
4.15
4.09
4.04
4.00
3.97
3.94
3.92
3.90
3.88
3.87
3.86
3.84
3.83
3.79
3.71
3.65

4.41
3.98
3.74
3.59
3.49
3.41
3.35
3.30
3.26
3.23
3.21
3.18
3.17
3.15
3.13
3.12
3.11
3.10
3.06
2.98
2.92

4.00
3.59
3.35
3.21
3.10
3.03
2.97
2.92
2.88
2.85
2.83
2.80
2.79
2.77
2.75
2.74
2.73
2.72
2.68
2.60
2.54

3.76
3.35
3.12
2.97
2.87
2.79
2.74
2.69
2.65
2.62
2.60
2.57
2.55
2.54
2.52
2.51
2.50
2.49
2.45
2.37
2.31

3.60
3.19
2.96
2.82
2.71
2.64
2.58
2.53
2.50
2.47
2.44
2.42
2.40
2.38
2.37
2.35
2.34
2.33
2.29
2.21
2.15

3.48
3.08
2.85
2.70
2.60
2.53
2.47
2.42
2.38
2.35
2.33
2.30
2.28
2.27
2.25
2.24
2.23
2.22
2.17
2.09
2.03

3.39
2.99
2.77
2.62
2.52
2.44
2.38
2.34
2.30
2.27
2.24
2.22
2.20
2.18
2.17
2.15
2.14
2.13
2.09
2.00
1.94

3.32
2.92
2.70
2.55
2.45
2.38
2.32
2.27
2.23
2.20
2.17
2.15
2.13
2.11
2.10
2.08
2.07
2.06
2.02
1.93
1.87

3.27
2.87
2.65
2.50
2.40
2.32
2.26
2.22
2.18
2.15
2.12
2.10
2.08
2.06
2.04
2.03
2.02
2.01
1.96
1.88
1.81

3.23
2.83
2.60
2.46
2.36
2.28
2.22
2.17
2.14
2.10
2.08
2.05
2.03
2.01
2.00
1.98
1.97
1.96
1.92
1.83
1.76

3.19
2.79
2.57
2.42
2.32
2.24
2.18
2.14
2.10
2.07
2.04
2.02
1.99
1.98
1.96
1.95
1.93
1.92
1.88
1.79
1.72

3.16
2.76
2.54
2.39
2.29
2.21
2.15
2.11
2.07
2.03
2.01
1.98
1.96
1.94
1.93
1.91
1.90
1.89
1.85
1.76
1.69

3.13
2.74
2.51
2.37
2.26
2.19
2.13
2.08
2.04
2.01
1.98
1.96
1.94
1.92
1.90
1.89
1.87
1.86
1.82
1.73
1.66

3.11
2.71
2.49
2.34
2.24
2.16
2.10
2.06
2.02
1.98
1.96
1.93
1.91
1.89
1.88
1.86
1.85
1.84
1.79
1.70
1.63

3.09
2.69
2.47
2.32
2.22
2.14
2.08
2.04
2.00
1.96
1.94
1.91
1.89
1.87
1.86
1.84
1.83
1.82
1.77
1.68
1.60

3.07
2.68
2.45
2.31
2.20
2.13
2.07
2.02
1.98
1.95
1.92
1.89
1.87
1.85
1.84
1.82
1.81
1.80
1.75
1.66
1.58

3.05
2.66
2.44
2.29
2.19
2.11
2.05
2.00
1.96
1.93
1.90
1.88
1.86
1.84
1.82
1.81
1.79
1.78
1.73
1.64
1.56

3.04
2.65
2.42
2.28
2.17
2.10
2.04
1.99
1.95
1.91
1.89
1.86
1.84
1.82
1.80
1.79
1.78
1.76
1.72
1.62
1.55

2.99
2.60
2.37
2.22
2.12
2.04
1.98
1.93
1.89
1.86
1.83
1.81
1.78
1.76
1.75
1.73
1.72
1.71
1.66
1.56
1.48

2.89
2.49
2.27
2.12
2.02
1.94
1.88
1.83
1.78
1.75
1.72
1.69
1.67
1.65
1.63
1.62
1.60
1.59
1.54
1.43
1.33

2.81
2.42
2.20
2.05
1.94
1.86
1.80
1.75
1.70
1.67
1.64
1.61
1.59
1.56
1.55
1.53
1.51
1.50
1.44
1.32
1.19

2.5. Evaluation of test results


For each test sample, the following parameters are
to be computed:
s2rA ; s2rB
s2DA ; s2DB
s2OIDA ; s2OIDB
s2IT ; s2IT
A

s2IOIT ; s2IOIT
A

s2yijk ; s2yijk
A

estimates of the repeatability variance


for methods A and B, respectively
estimates of the between-day variance component for methods A and
B, respectively
estimates of the (operatorinstrumentday) variance component for
methods A and B, respectively
estimates of the time-different intermediate precision (variance) for
methods A and B, respectively
estimates of the (operatorinstrumenttime)-different intermediate
precision (variance) for methods A
and B, respectively
estimates of the variance of the daymeans yijk for methods A and B,
respectively

yA ; yB

grand means obtained from methods


A and B, respectively.
Calculation of all these parameters are given in
Tables 2 and 3.

2.5.1. Comparison of precision


As mentioned before, it is important to show that the
precision of method B is at least as good as that of
method A. Therefore a one-sided F-test is applied here
instead of the two-sided test used in ISO [1]. The null
hypothesis H0 is that the precision of the alternative
method B is better than or equal to the precision of the
reference method A H0 : 2B  2A and the alternative hypothesis H1 is that the precision of the alternative method B is worse than the precision of the
reference method A H1 : 2B > 2A .
2.5.1.1. Comparison of repeatability. To compare the
repeatability of two methods, the sample statistic Fr is
calculated as follows:
Fr

s2rB
s2rA

(11)

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Fr is then compared with F rB ;rA , where F rB ;rA is


the value of the F-distributionP
with P
degrees of freedom
qB
mB
of numerator rB nB P1 P
j1 pijB and denoi1
mA
qA
minator rA nA 1 i1 j1 pijA , represents
the portion of the F-distribution to the right of the
given value; 0.05.
If Fr > F rB ;rA , method B has worse repeatability
than method A at the (1 )100% (i.e. 95%) condence level and therefore the repeatability of the
alternative method B is not acceptable.
If Fr  F rB ;rA , the repeatability of the alternative
method B is acceptable, which means that it is at most
a factor  worse than that of method A.
2.5.1.2. Comparison of time-different intermediate
precision. For the comparison of the time-different
intermediate precision, we need the number of degrees
of freedom associated with the precision estimates.
Since these estimates are not directly estimated from
the data but are calculated as a linear combination of
two mean squares, MSD and MSE (see Table 3), it is
not evident how to determine the number of degrees of
freedom associated with this compound variance. The
Satherthwaite approximation [11] (see further) can
then be used.
However, to avoid the complexity in the determination of the degrees of freedom associated with s2IT , the
comparison of time-different intermediate precisions
can be performed in an indirect way by comparing the
day mean squares MSD [12]. Indeed since (see
Table 3):
MSD ns2D s2r

and

s2IT s2D s2r :

It follows that
MSD ns2IT ns2r s2r ns2IT n 1s2r :
Therefore provided that the repeatabilities of both
methods are equal 2rA 2rB and the number of
replicates per day for both methods is equal (nAnB),
the day mean squares MSD are considered instead of
s2IT . If nAnB, the equality of the repeatabilities for
both methods is rst to be tested H0 : 2rA 2rB ;
H1 : 2rA 6 2rB by means of a two-sided F-test. The
results obtained from the comparison of the repeatabilities in Section 2.5.1.1 cannot be used here, since
a one-sided F-test has been considered to test the
hypotheses: H0 : 2rB  2rA ; H1 : 2rB > 2rA . A
non-signicant test, which means that the repeatability

213

of method B is acceptable, does not necessarily imply


that the repeatabilities of both methods are equal; the
repeatability of method B can be better (smaller) than
the repeatability of method A. Therefore, the repeatabilities of both methods have to be compared again
by applying a two-sided F-test. Fr is obtained here as
follows:
Fr

s21
s22

(12)

with s21 the largest of s2rA and s2rB .


Fr is then compared with F =2 r1 ; r2 , where
F =2 r1 ; r2 is the value of the F-distribution with
degrees of freedom of numerator r1 and denominator
r2 , /2 represents the portion of the F-distribution to
the right of the given value; 0.05.
r2 rB if s2rA > s2rB ;

 r 1  rA

and

 r 1  rB

and r2 rA if s2rB > s2rA ;

rB nB 1
rA nA 1

qB
mB X
X
i1 j1
qA
mA X
X
i1 j1

pijB

and

pijA :

If Fr  F =2 r1 ; r2 , there is no evidence that the


two methods have different repeatabilities and therefore the equality of the repeatabilities for both methods can be assumed.
If Fr > F =2 r2 ; r2 , the repeatabilities of the two
methods are signicantly different at the signicance
level of (i.e. 5%). In that case the following simplied approach to compare the time-different intermediate precisions cannot be applied. The
Satherthwaite approximation to estimate the number
of degrees of freedom associated with s2IT has then to
be used (see further).
In the situation where the number of replicates per
day for both methods are equal (nAnB) and the
equality of the repeatabilities for both methods can
be assumed 2rA 2rB , the comparison of time-different intermediate precision is performed by calculating FI(T) as follows:
!
nB s2IT nB 1s2rB
nB s2DB s2rB
MSDB
B

FIT
:
MSDA
nA s2DA s2rA nA s2IT nA 1s2rA
A

(13)

214

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

FI(T) is compared with F ITB ;ITA , where


with
F ITB ;ITA is the value of the F-distributionP
mB

degrees
of
freedom
of
the
numerator

IT
B
Pi1
Pq B
A
pijB 1 and the denominator ITA m
i1
Pj1
qA
j1 pijA 1, represents the portion of the Fdistribution to the right of the given value; 0.05.
If FIT  F ITB ;ITA , method B has worse timedifferent intermediate precision than method A at the
(1 )100% (i.e. 95%) condence level and therefore
the time-different intermediate precision of the alternative method B is not acceptable.
If FIT  F ITB ;ITA , the time-different intermediate precision of the alternative method B is
acceptable, which means that it is at most a factor
IT worse than that of method A.
In the situation where the number of replicates per
day for both methods are not equal (nA6nB) or the
equality of the repeatabilities for both methods cannot
be assumed 2rA 6 2rB , the comparison of the timedifferent intermediate precisions cannot be performed
by Eq. (13) but it must be investigated through the
comparison of s2IT . Then FI(T) is calculated as follows:
FIT

s2ITB
s2ITA

(14)

As mentioned earlier, the number of degrees of


freedom associated with s2IT for both methods is
obtained from the Satterthwaite approximation:
ITB
ITA

MSDB =nB 2 =

Pm B Pq B

MSDA =nA 2 =

i1

FIOIT

MSOIDB
MSOIDA

!
2
2
nB s2OIDB s2rB nB sIOITB nB 1srB

:
nA s2OIDA sr2A nA s2IOIT nA 1s2rA

The comparisonPof FP
I(OIT) with F IOITB ;IOITA ,
qB
mB

where

IOIT
j1 pijB 1 and IOITA
i1
PmA PqA B
i1 j1 pijA 1 is then performed in analogy
with the comparison of the time-different intermediate
precision mentioned earlier when Eq. (13) is
considered.
In the situation where the number of replicates
per day for both methods are not equal (nA6nB)
or the equality of the repeatabilities of both
methods cannot be assumed 2rA 6 2rB , the comparison of (operatorinstrumenttime)-different intermediate precision must be investigated through
the comparison of s2IOIT . Then FI(OIT) is calculated
as follows:

s2IT 2
B

PmB PqB

s2IT 2
A

2
j1 pijA 1 nA 1MSEA =nA =nA 1

When the non-integer value is obtained for the  I(T),


round the number down to the nearest integer.
The comparison of FI(T) with F ITB ; ITA ,
where ITB and ITA are computed from Eqs. (15)
and (16), respectively, is then performed in the same
way as mentioned earlier when Eq. (13) is applied.
2.5.1.3. Comparison of (operatorinstrumenttime)different intermediate precision. In analogy with the
comparison of the time-different intermediate

(17)

2
j1 pijB 1 nB 1MSEB =nB =nB 1

PmA PqA
i1

precision, the comparison of (operatorinstrument


time)-different intermediate precision can be performed in an indirect way by comparing the
(operatorinstrumentday)-mean squares MSOID. If
the number of replicates per day for both methods is
equal (nAnB) and the comparison of the repeatabilities in Eq. (12) does not give evidence against the
equality of the repeatabilities of both methods (i.e.
2rA 2rB ), FI(OIT) is calculated as follows:

FIOIT

s2IOIT
s2IOIT

i1

j1

PmA PqA
i1

pijB

j1

pijA

(15)
:

(16)

(18)

Again, the further steps to compare FI(OIT) with


F IOITB ; IOITA , where IOITB and IOITA are
computed from Eqs. (19) and (20), respectively, are in
analogy with the comparison of the time-different
intermediate precision mentioned earlier when
Eq. (13) is considered.

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

IOITB
IOITA

215

s2IOIT 2
B
;
P
P B PqB
P
qB
B
MSOIDB =nB 2 = m
p

1MSEB =nB 2 =nB 1 m


B
i1
j1 ijB
i1
j1 pijB
MSOIDA =nA 2 =

PmA PqA
i1

j1

s2IOIT 2
A

pijA 1 nA 1MSEA =nA 2 =nA 1

2.5.1.4. Comment concerning the ISO 5725-6


procedure. In ISO [1], the comparison of the overall
precision (a term which is not really explained) is
performed in analogy with Eq. (17) without preevaluating the equality of the repeatabilities of both
methods. The fact that the same number of replicates
per laboratory for the two methods (nAnB) is
required is not taken into consideration either. If the
overall precision refers to the reproducibility, the
indirect comparison of the latter for both methods
by a comparison of the variance of the laboratory
means is only possible if the repeatability and the
number of replicates per laboratory (n) is the same for
both methods. If this is not the case, a direct
comparison of the reproducibility obtained with the
two methods should be performed, possibly in analogy
with Eq. (18).

PmA PqA
i1

j1

pijA

(19)
;

(20)

d, where d represents the standard deviation of the


difference between the means of methods A and B.
This is obtained from  z =2 z d which with
0.05 becomes (1.961.645)d4d. In the
evaluation of the bias d is estimated from the experiments as sd. If the estimated bias d jyA yB j =2,
the 95% condence interval (CI) around d can be
calculated as
=2  1:96sd

or 2d  2sd :

2.5.2. Evaluation of the bias

If sd is exactly equal to d, the lower limit of the


condence interval is equal to 0 and the upper limit is
equal to  (see Fig. 2(a)). Since 0 is just included in
the CI, the bias is not signicantly different from zero.
(Evaluation of the bias by means of the z-test, as done
in ISO, would of course lead to the same conclusion
since the test statistic, d/sd2d/sd2). Moreover, the
probability that the true absolute bias, as estimated by

2.5.2.1. Comments concerning the ISO 5725-6


procedure. The evaluation of the bias is performed
by comparing the grand means obtained with both
methods. In ISO [1] the comparison is based on the ztest, since the sample statistic is compared with 2 (an
approximation of the two-sided tabulated z-value of
z z0.051.96). This implies that the estimated
standard deviations used in the comparison are
obtained from large samples and therefore that they
are sufficiently good estimates of the true standard
deviations. If the sample statistic is larger than 2, the
difference between the means obtained with the two
methods is statistically significant. In that case, to
avoid the rejection of the method with an acceptable
bias, it is further examined whether the estimated bias
d can be considered acceptable. ISO concludes that the
bias is significant, but acceptable if its absolute value
is not larger than /2.
However, this approach is questionable for the
following reasons. ISO species  to be four times

Fig. 2. Different situations of a bias evaluation. d: The estimated


absolute difference between the grand means obtained with the two
methods, : the acceptable bias; () 95% confidence interval.
(a) sdd and d/2; (b) sd<d and d>/2; (c) sd>d and d</2.

216

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

The use of t-test requires that the variances of the


day means obtained with the two methods are equal
(i.e. 2yijk 2yijk ). This equality must be rst tested
B
A
by applying a two-sided F-test. The degrees
freeP of
P
qA
A
dom associated with s2yijk and s2yijk are m
j1
i1
B
PmB PqB A
pijA 1 and i1 j1 pijB 1, respectively.
If there is no evidence against the equality of 2yijk
A
and 2yijk , sd is calculated by applying the pooled
B
variance s2p as follows:
v
!
u
u
1
1
Pm B Pq B
; (22)
sd ts2p PmA PqA
p
j1 ijA
j1 pijB
i1
i1

d/2, exceeds  is only 2.5%. ISO considers a


signicant bias to be acceptable if d is smaller than
/2. However, if sd equals d there is no point in
comparing d with /2 since with d larger than /2 there
is no chance that the signicant difference can be
acceptable.
If sd is different from d, as is to be expected,
the comparison can lead to wrong conclusions.
Indeed if sd is smaller than d, which will be, e.g.
the case if the acceptance criteria for the precision
measure are dening the number of measurements
to be performed, considering the bias to be unacceptable if d is larger than /2 can lead to the rejection
of a method with an acceptable bias (see Fig. 2(b)).
On the other hand if sd is larger than d, an unac-

s2p

Pm A Pq A
i1

2.5.2.2. Adapted approach. In our approach which is


intended for the intralaboratory situation, the standard
deviations are generally estimated from a relatively
small sample size, and therefore the t-test is more
appropriate than the z-test. A two-sided test (H0:
AB; H1: A6B) is considered since the
difference between the two means can be positive

s2y

ijk A

PmA PqA
i1

j1

pijA 2 =

Pm A Pq A
i1

j1

jyA yB j
;
sd

PmB PqB
i1

j1

pijB 1s2yijk

d

qA
mA X
X
i1 j1

qB
mB X
X
i1 j1

pijB 2;

(24)

The number of degrees of freedom  d associated with


sd is then calculated by applying the Satterthwaite
approximation:

ijk B

where sd represents the estimated standard deviation


of the differences between the means obtained with
the two methods.

pijA

When the equality of the variances of the day means,


2yijk and 2yijk , cannot be assumed, the variances
A
B
cannot be pooled and sd is calculated as follows [13]:
v
u
u
s2y
s2y
ijk A
ijk B
P

sd tPmA P
:
(25)
qA
mB PqB
p
ij
A
j1
j1 pijB
i1
i1

pijA 1 s2y

(21)

(23)

with

s2d 2

as well as negative. Therefore


tcal

pijA 1s2yijk

j1

d

ceptable bias can lead to a non-signicant test and


therefore to acceptance of the method (see Fig. 2(c)).
Therefore, it is more appropriate to compare the
one-sided upper 95% condence limit around d with
 to conclude on the acceptability of the method
(see further).

d

where

Pm B Pq B
i1

j1

P B PqB
pijB 2 = m
j1 pijB 1
i1

(26)

The tcal is compared with t =2;d , where t =2;d is the


two-sided tabulated t-value at the significance level
0.05 and the degrees of freedom  d as indicated in
Eq. (24) or Eq. (26).
If tcal > t =2;d , the difference between the means
obtained with the two methods is statistically signicant. Though the difference is signicant it might not

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

be relevant to the application. Therefore it could be


further evaluated whether the difference found can be
considered acceptable. The one-sided (1 )100%
(i.e. 95% for 0.05) upper condence limit (UCL)
around the absolute difference d is compared with the
acceptance limit . The UCL is obtained as follows:
UCL jyA yB j sd  t ;d ;

(27)

where sd is as shown in Eq. (22) or Eq. (25) and t ;d is


the one-sided tabulated t-value at the signicance level
0.05 and the degrees of freedom  d (as shown in
Eq. (24) or Eq. (26)).
If UCL, the bias although statistically signicant
is acceptable since there is a smaller than (or at most)
probability that the true absolute difference as
estimated by jyA yB j is larger than .
If UCL>, the bias is not acceptable since there is a
larger than probability that the true absolute difference as estimated by jyA yB j is larger than .
If tcal  t =2;d , the difference between the means
obtained with the two methods is statistically insignicant. However, if the precision estimates (s2r and
s2OID ) of the two methods obtained experimentally are
larger than those used for the calculation of the
minimum number of measurements required for the
detection of  in Eqs. (2)(6), an unacceptable bias
can lead to a non-signicant test.
Therefore, to limit the risk of adopting a method
with an unacceptable bias, the interval hypothesis
testing as proposed by Hartmann et al. [2] should
be more appropriate for the evaluation of the bias than
the approach mentioned above (Eqs. (21), (22), (23)
(27)). The procedure is as follows.
Calculate the one-sided (1 )100% (i.e. 95% for
0.05) upper condence limit (UCL) around the
absolute difference d:
UCL jyA yB j sd  t ;d ;

(28)

where is 0.05, sd is the same as Eq. (22) or Eq. (25)


and  d is the same as Eq. (24) or Eq. (26).
Since in interval hypothesis testing, the null and
alternative hypotheses are reversed, the roles of and
are also reversed. Therefore, here corresponds to
the probability that a method that is biased to an
unacceptably large extent will be accepted.
If the UCL is not larger than the acceptable bias ,
the difference between the grand means of method A
and method B is considered acceptable at the

217

(1 )100% condence level and the bias of method


B is acceptable. If the UCL is larger than the acceptable bias , the bias of method B is not acceptable.
With this approach the probability of accepting a
method that is too much biased is controlled at 5%.
The evaluation of the bias described above is based
on nested designs performed separately for methods A
and B. If it is possible to design a simultaneous
experiment (e.g. same days and same operators) a
paired comparison [6], for which a smaller sd is to be
expected, could be preferable.
3. Examples
Two examples will illustrate the approach discussed. In the rst example measurements are performed under (operatorinstrumenttime)-different
intermediate precision conditions while in the second
example only time-different intermediate precision
conditions are considered.
3.1. Example 1: quantification of diazepam in
diazepam tablets (the example is fictitious)
3.1.1. Background
Method A is a HPLC method, method B is a UV
(second derivative) method for the quantication of
diazepam in diazepam tablets. A laboratory uses
method A but developed method B as an alternative.
The laboratory wants to compare the performance of
both methods. The results are expressed as percentage
of the labelled amount (%). For method A an estimate
of the precision (sr and sOID) is available: sr1%,
sOID2%.
3.1.2. Requirements
The acceptable bias  is 2%. The acceptable ratio of
the standard deviations between the two methods, ,
I(T) or I(OIT) is 2. The statistical tests are performed
at the signicance level 0.05. The probability to
wrongly adopt the method with an unacceptable performance is set at 0.2.
3.1.3. Experimental design
It is decided that the number of operators, instruments and replicates per day for each method is 2 and
the number of days for both methods is equal (pApB).

218

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

From one batch of diazepam tablets, 300 tablets are


randomly taken. They are powdered and kept in a cool,
dry place, e.g. desiccator. Each day during pA days,
two replicates (nA2) prepared from the powdered
samples are analysed with method A on the rst
instrument by the rst operator. The analysis is
repeated independently in the same way during
another pA days but on the second instrument. The
second operator performs the procedures in the same
way as the rst operator does. The experiments for
method B are designed in the same way as those for
method A.
3.1.4. Determination of the minimum number of days
(1) For the detection of . Since pApB and
nAnB2, Eq. (4) is used:
s
2s2OIDA s2rA =nA
;
2  t =2 t
m A qA p A
s
222 12 =2
:
2  t =2 t
4pA
With pA4 and [2mAqApA2][2(224)2]
30, (t /2t )2.896 and the right side of the equation above equals 2.172; with pA5 and 
[2mAqApA2][2(225)2]38, (t /2t )
2.876 and the right side of the equation above equals
1.929. Hence pApB5. (The use of a constant multiplication factor equal to 3 would yield pApB6.)

(2) For the comparison of precision measures. From


Table 5 it can be seen that  (or I(T) or I(OIT))2 is
given by  A B14.
To compare repeatability,
A mA qA pA and B mB qB pB ;
so pA pB 14=4 3:5 4:
To compare time-different intermediate precision,
A mA qA pA 1 and B mB qB pB 1;
so pA pB 14=4 1 4:5 5:
To compare (operatorinstrumentday)-different
intermediate precision,
A mA qA pA 1 and B mB qB pB 1;
so pA pB 15=4 3:75 4:
(3) Conclusion. The minimum number of days
required for both methods (with two operators, two
instruments and two measurements per day) is 5.
3.1.5. The data
The data for methods A and B are summarized in
Tables 6 and 7, respectively.
3.1.6. Investigation of outliers
Grubbs' tests were applied to the day means [8]. No
single or double stragglers or outliers were found for
both methods.

Table 6
Data obtained with method A (example 1)
Operator

Instrument 1

Instrument 2

Day

yi1k1

yi1k2

yi1k

1
2
3
4
5

97.13
101.23
97.13
97.17
96.82

98.81
100.68
96.63
95.82
97.46

97.970
100.955
96.880
96.495
97.140

1
2
3
4
5

100.71
101.26
98.49
97.06
101.85

99.37
103.78
100.87
98.92
99.77

100.040
102.520
99.680
97.990
100.810

Grand mean99.890

yi1

Day

yi2k1

yi2k2

yi2k

yi2

97.888

1
2
3
4
5

98.98
102.84
99.65
98.67
97.08

99.52
100.93
99.29
98.46
97.11

99.250
101.885
99.470
98.565
97.095

99.253

100.208

1
2
3
4
5

101.55
99.66
102.54
104.95
103.10

104.04
98.70
100.60
102.61
104.36

102.795
99.180
101.570
103.780
103.730

102.211

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

219

Table 7
Data obtained with method B (example 1)
Operator

Instrument 1
Day

Instrument 2
yi1k1

yi1k2

yi1k

yi1

Day

yi2k1

yi2k2

yi2k

yi2

1
2
3
4
5

97.86
93.73
98.12
101.45
99.68

98.40
95.43
98.42
99.61
99.54

98.130
94.580
98.270
100.530
99.610

98.224

1
2
3
4
5

100.86
105.23
100.66
98.44
100.75

103.98
102.35
99.98
97.32
101.58

102.420
103.790
100.320
97.880
101.165

101.115

1
2
3
4
5

101.61
99.54
101.45
104.38
103.42

103.85
97.34
102.28
103.84
104.58

102.730
98.440
101.865
104.110
104.000

102.229

1
2
3
4
5

104.73
102.94
105.64
103.24
103.78

107.77
103.58
107.14
101.28
101.73

106.250
103.260
106.390
102.260
102.755

104.183

Grand mean 101.438

3.1.7. Calculation of the variance estimates


Tables 8 and 9 summarize the calculation of the
variance estimates for methods A and B, respectively.
3.1.8. Comparison of precision
3.1.8.1. Repeatability. The repeatabilities are compared according to Eq. (11):
Fr

1:4810
1:20:
1:2317

This is to be compared with F0.05(20,20)2.12. Since


Fr<2.12 the repeatability of the alternative method B is

acceptable, which means that it is at most a factor 2


worse than that of method A.
3.1.8.2. Time-different intermediate precision. (i)
Check whether 2rA 2rB (see Eq. (12))
1:4810
1:20:
1:2317

Fr

This is to be compared with F0.025(20,20)2.46.


Since Fr<2.46 there is no evidence that the
repeatabilities of both methods are different.
(ii) Since the equality of the repeatabilities for both
methods can be assumed 2rA 2rB and the number

Table 8
Calculation of the variance estimates for method A (example 1) (ANOVA table)
Source

Mean squares

Estimate of

Operatorinstrumentday

MSOID10.5459

2rA nA 2OIDA

Day

MSD6.3345

r2A nA sDA 2

Residual

MSE1.2317

s2rA

Calculation of the variance estimates


The repeatability variance

Time-different intermediate precision (variance)

s2rA 1:2317;
 20
6:3345 1:2317
2
2:5514
sDA
2
10:5459

1:2317
4:6571
s2OIDA
2
2
2
2
sIT sDA srA 3:7831

(Operatorinstrumenttime)-different intermediate precision (variance)

s2IOIT s2OIDA s2rA 5:8888

Variance of the day means yijk

s2yijk s2OIDA s2rA =nA 5:2730;

The between-day variance component


The (operatorinstrumentday) variance component

 19

220

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Table 9
Calculation of the variance estimates for method B (example 1) (ANOVA table)
Source

Mean squares

Estimate of

Operatorinstrumentday

MSOID17.9587

2rB nB s2OIDB

Day

MSD9.7042

2rB nB s2DB

Residual

MSE1.4810

2rB

Calculation of the variance estimates


The repeatability variance

Time-different intermediate precision (variance)

s2rB 1:4810;
 20
9:7042 1:4810
2
sDB
4:1116
2
17:9587 1:4810
2
8:2389
sOIDB
2
2
2
2
sIT sDB srB 5:5926

(Operatorinstrumenttime)-different intermediate precision (variance)

s2IOIT s2OIDB s2rB 9:7199

Variance of the day means yijk

s2yijk s2OIDB s2rB =nB 8:9794;

The between-day variance component


The (operatorinstrumentday) variance component

of replicates per day for both methods is equal


(nAnB), the comparison of time-different intermediate precision is performed according to Eq. (13):
FIT

MSDB 9:7042
1:53:

MSDA 6:3345

This is to be compared with F0.05(16,16)2.33. Since


FI(T)<2.33 the time-different intermediate precision
of the alternative method B is acceptable, which
means that it is at most a factor 2 worse than that
of method A.
3.1.8.3. (Operatorinstrumenttime)-different intermediate precision. In analogy with the com-parison
of time-different intermediate precision, the comparison of (operatorinstrumenttime)-different
intermediate precision is performed as follows (see
Eq. (17)):
FIOIT

MSOIDB 17:9587
1:70:

MSOIDA 10:5459

This is to be compared with F0.05(19,19)2.17. Since


FI(T)<2.17 the (operatorinstrumenttime)-different
intermediate precision of the alternative method B is
acceptable, which means that it is at most a factor 2
worse than that of method A.
3.1.9. Evaluation of the bias
This is done by comparing the grand means of
methods A and B by a t-test.

 19

(i) Check whether 2yijk 2yijk


8:9794
1:70:
F
5:2730

This is to be compared with F0.025(19,19)2.53.


Since F<2.53 there is no evidence that the variances
of the day means obtained with the two methods are
different.
(ii) Therefore the variances can be pooled
(Eq. (23)) and sd is obtained from Eq. (22):
19  5:2730 19  8:9794
7:1262;
38
s


1
1

sd 7:1262
0:8442:
20 20

s2p

The test statistic tcal is obtained as given in Eq. (21):


tcal

jyA yB j j99:890 101:438j


1:83:

sd
0:8442

This is to be compared with t0.025;382.02. Since


tcal<2.02 the difference between the grand means of
the two methods is statistically insignicant.
(iii) In the interval hypothesis testing approach, the
one-sided 95% upper condence limit (UCL) around
the absolute difference between the grand means is
calculated as follows (Eq. (28)):
UCL jyA yB j sd  t0:05;38
j99:890101:438j0:84421:69 2:97:

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

This is to be compared with the acceptable bias


2. Since UCL>2 the bias of method B is not
acceptable. Notice that in this approach the probability
of accepting a method that is too much biased is
controlled at 5%.
3.2. Example 2: determination of moisture in cheese
(the example is fictitious)
3.2.1. Background
Method A is a Karl Fischer method, method B is a
vacuum oven method for the determination of moisture in cheese. A laboratory uses method A but developed method B as an alternative. The laboratory wants
to compare the performance of both methods. The
results are expressed as percentage moisture. For
method A an estimate of the precision (s2r and s2D )
is available: s2r 0:023; s2D 0:08
3.2.2. Requirements
The acceptable bias  is 0.50%. The acceptable
ratio of the standard deviations between the two
methods,  or I(T) is 3. The statistical tests are
performed at the signicance level 0.05. The probability to wrongly adopt the method with an unacceptable performance is set at 0.05.

221

s
2s2DA s2rA =nA
0:5  t =2 t
;
pA
s
20:08 0:023=2
:
0:5  t =2 t
pA
With pA10 and [2mAqApA2][2(1110)
2]18, (t /2t )3.835 and the right side of the
equation above equals 0.519; with pA11 and 
[2mAqApA2][2(1111)2]20, (t /2t )
3.811 and the right side of the equation above equals
0.492. Hence pApB11. (The use of a constant multiplication factor equal to 4 would yield pApB12.)
(2) For the comparison of precision. From Table 4 it
can be seen that 3 or I(T)3 is given by  A B10.
To compare repeatability standard deviations,
 A m A qA p A

and B mB qB pB ;

so pA pB 10:
To compare between-day mean squares,
A mA qA pA 1 and B mB qB pB 1;
so pA pB 10 1 11:
(3) Conclusion. The minimum number of days
required (with two measurements per day) is 11.

3.2.3. Experimental design


The material is a cheese, analysed with both
methods.
It is decided that the number of replicates per day
for each method is two and the number of days for
both methods is equal (pApB). Each day during pA
days, two independent samples (nA2) from the
cheese are analysed with method A by the same
operator using the same instrument. Each day during
pB days, two independent samples (nB2) from the
cheese are analysed with method B by the same
operator using the same instrument.

3.2.5. The data


The data for methods A and B are summarized in
Table 10.

3.2.4. Determination of the minimum number of days


(1) For the detection of . Since pApB and
nAnB2, Eq. (4) is used. Since in the comparison
only time-different intermediate precision conditions
are considered, the number of operators, mAmB1
and the number of instruments, qAqB1. Consequently, as can be derived form Table 3, MSOID and
s2OID are equal to MSD and s2D , respectively.

which is to be compared with Grubbs' critical values


for p11 at 5% (2.355) and 1% (2.564). Therefore,
since this observation is considered as a straggler it is
retained but indicated with a in Table 10.

3.2.6. Investigation of outliers


Grubbs' tests were applied to the day means [8]. No
single or double stragglers or outliers were found for
method A. For method B the single Grubbs' test
applied on the mean of day 9 is signicant at the
5% level but not at the 1% level. Indeed
G

39:845 39:486
2:544;
0:1411

3.2.7. Calculation of the variance estimates


Tables 11 and 12 summarize the calculation of the
variance estimates for methods A and B, respectively.

222

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

Table 10
Data for the example 2
Day

1
2
3
4
5
6
7
8
9
10
11

Method A

Method B

y11k1

y11k2

y11k

y11k1

y11k2

y11k

39.68
39.08
40.39
39.87
39.70
39.93
39.78
39.92
40.34
40.12
39.43

39.77
39.38
40.33
39.98
39.95
39.95
39.97
40.20
39.89
40.26
39.54

39.725
39.230
40.360
39.925
39.825
39.940
39.875
40.060
40.115
40.190
39.485

39.29
39.51
39.45
39.59
39.41
39.45
39.55
39.29
39.82
39.44
39.45

39.36
39.38
39.49
39.51
39.41
39.54
39.55
39.36
39.87
39.45
39.53

39.325
39.445
39.740
39.550
39.410
39.495
39.550
39.325
39.845a
39.445
39.490

Grand mean39.884
a

Grand mean39.486

Straggler.

Table 11
Calculation of the variance estimates for method A (example 2) (ANOVA table)
Source

Mean squares

Estimate of

Day

MSD0.2050

2rA nA s2DA

Residual

MSE0.0239

2rA

Calculation of the variance estimates


The repeatability variance

Time-different intermediate precision (variance)

s2rA 0:0239;
 11
0:2050 0:0239
2
0:0906
sDA
2
2
2
2
sIT sDA srA 0:1145

Variance of the day means yijk

s2yijk s2DA s2rA =nA 0:1025;

The between-day variance component

 10

Table 12
Calculation of the variance estimates for method B (example 2) (ANOVA table)
Source

Mean Squares

Estimate of

Day

MSD0.0397

2rB nB s2DB

Residual

MSE0.0024

2rB

Calculation of the variance estimates


The repeatability variance

Time-different intermediate precision (variance)

s2rB 0:0024;
 11
0:0397 0:0024
2
0:0187
sDB
2
2
2
2
sIT sDB srB 0:0211

Variance of the day means yijk

s2yijk s2DB s2rB =nB 0:0199;

The between-day variance component

 10

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

(i) Check whether 2yijk 2yijk :

3.2.8. Comparison of precision

3.2.8.1. Repeatability. The repeatabilities are compared according to Eq. (11):


0:0024
0:10:
Fr
0:0239
This is to be compared with F0.05(11,11)2.82. Since
Fr<2.82, the repeatbility of the alternative method B is
acceptable, which means that it is at most a factor 3
worse than that of method A.
3.2.8.2. Time-different intermediate precision. (i)
Check whether 2rA 2rB (see Eq. (12))
Fr

0:0239
10:
0:0024

FIT

S2IT

0:0211
0:18:

0:1145

ITA

This is to be compared with F0.025(10,10)3.72.


Since F>3.72 there is evidence that the variances of
the day means obtained with the two methods are
different. Therefore, the standard deviation sd and its
associated degrees of freedom  d are calculated as
follows (see Eqs. (25) and (26)):
r
0:1025 0:0199

0:1055;
sd
11
11
0:01112
13:
d
0:1025=112 =10 0:0199=112 =10

0:02112
0:0397=22 =10 0:0024=22 =11
0:11452
0:2050=22 =10 0:0239=22 =11

tcal

jyA yB j j39:884 39:486j


3:77:

sd
0:1055

This is to be compared with t0.025;132.16. Since


tcal>2.16 the difference between the grand means of
the two methods is statistically signicant at 0.05.
To further evaluate whether the difference found
can be acceptable, the UCL is calculated according to
Eq. (27):
UCL j39:884 39:486j 0:1055  t ;13
0:398 0:1055  t0:05;13
0:398 0:1055  1:771 0:585:

The number of degrees of freedom associated with


s2IT for both methods is obtained from the Satterthwaite approximation (Eqs. (15) and (16)):
ITB

0:1025
5:15:
F
0:0199

The test statistic tcal is obtained as given in Eq. (21):

This is to be compared with F0.025(11,11)3.47.


Since Fr>3.47 there is evidence that the
repeatabilities of both methods are different (in fact
the repeatability for method B is better than for
method A).
(ii) Since the repeatabilities of both methods are
different 2rA 6 2rB , the comparison of time-different
intermediate precision is performed according to
Eq. (14):
S2IT

223

11;
12:

FI(T) is to be compared with F0.05(11,12)2.72. Since


FI(T)<2.72 the time-different intermediate precision
at the alternative method B is acceptable, which
means that it is at most a factor 3 worse than that
of method A.
3.2.9. Evaluation of the bias
This is done by comparing the grand means of
methods A and B by a t-test.

Since the UCL is larger than the acceptable bias


0.50, there is evidence that the difference between
the means of the two methods is unacceptable.
(If the probability is allowed to be 0.2, the UCL
would be (0.398(0.10550.870))0.490 which is
smaller than 0.50 and the bias would be acceptable.)
In the interval hypothesis testing approach, the onesided 95% upper condence limit (UCL) around the
absolute difference between the grand means is calculated as follows (Eq. (28))):
UCL j39:884 39:486j 0:1055  t ;13
j39:884 39:486j 0:1055  t0:05;13
0:398 0:1055  1:771 0:585:
This is to be compared with the acceptable bias
0.5. Since UCL>0.5 the bias of method B is not

224

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225

acceptable. Notice that in this example, the interval


hypothesis leads to the same conclusion as the point
hypothesis (with the inclusion of the -error consideration) does. This is due to the fact that with the point
hypothesis testing, a signicant difference between
the means of both methods is detected and that in the
further evaluation whether the difference found can be
considered acceptable, the probability considered is
equal to the probability in the interval hypothesis
testing.
4. Conclusion
An approach for the comparison of an alternative
method to a reference method has been proposed
for the intralaboratory situation. Instead of the reproducibility (as included in the ISO guidelines), the
(operatorinstrumenttime)-different intermediate
precision is considered in the comparison. The proposal includes:
1. the experimental design (i.e. the determination of
the number of measurements required to perform
the comparison),
2. the estimation of different precision parameters
and the comparison of these precision measures
for both methods, and
3. the statistical approach for the evaluation of the
bias in which the interval hypothesis testing has
also been proposed as an alternative.
The comparison of the bias and precision described
in this article is performed at a single concentration
level. If the alternative method is intended for use over
a rather broad concentration range, the comparison
should be performed at more concentration levels (e.g.
low, middle and high). Due to the problem with
multiple comparison [5], the present approach is not
recommended if the methods are to be compared at
more than three levels. For trace analysis, an evaluation of the detection and quantication limit should
also be performed.
The proposal is an optimal approach in the sense
that it is based on sample size calculations. The
number of measurements to be performed are such
that there is a high probability (1 ) that an alternative method with an unacceptable performance will
not be adopted. This of course is of utmost importance

but might require a number of measurements that the


laboratory is not able (or not willing) to perform
because of time and cost involved. If this is the case,
an alternative approach based on a number of measurements that in practice is feasible, is required. Two
approaches can be conceived. The rst is to perform
the comparison, based on a user-dened number of
measurements, in the classical way using point
hypothesis testing and to evaluate the b-error. In this
way the laboratory would at least have an idea of the
probability that an alternative method with an unacceptable performance has been accepted and thus of
the risk that is run that the method will not perform as
expected during routine use of the method.
Another approach is to control the probability that a
method with unacceptable performance characteristics will be adopted by using interval hypothesis
testing. The latter was already included here as an
alternative for the evaluation of the bias but can also be
considered in the comparison of precision measures.
After it was proposed for the evaluation of the bias in
method validation studies by Hartmann et al. [2],
interval hypothesis testing has been considered by
the SFSTP in a guideline for the validation of bioanalytical methods [14].
Acknowledgements
This work has received nancial support from the
European Commission (Standards, Measurements and
Testing Programme Contract SMT4-CT95-2031) and
the Belgian government (The Prime Minister Services
Federal Ofce for Scientic, Technical and Cultural
Affairs, Standardisation Programme Research Contract no/03/003).

References
[1] International Standard, Accuracy (Trueness and Precision) of
Measurement methods and results, ISO 5725-6, Geneva,
1994.
[2] C. Hartmann, J. Smeyers-Verbeke, W. Penninckx, Y. Vander
Heyden, P. Vankeerberghen, D.L. Massart, Anal. Chem. 67
(1995) 4491.
[3] International Standard, Accuracy (Trueness and Precision) of
Measurement methods and results, ISO 5725-3, Geneva,
1994.

S. Kuttatharmmakul et al. / Analytica Chimica Acta 391 (1999) 203225


[4] International Standard, Statistics (Vocabulary and symbols):
Design of experiments, ISO 3534-3, Geneva, 1985.
[5] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De
Jong, P.J. Lewi, J. Smeyers-Verbeke, Hand book of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam, 1997.
[6] D.C. Montgomery, Design and Analysis of Experiments, 4th
ed., Wiley, New York, 1997.
[7] J. Mandel, The Statistical Analysis of Experimental Data,
Dover, New York, 1964, p. 359.
[8] International Standard, Accuracy (Trueness and Precision) of
Measurement methods and results, ISO 5725-2, Geneva,
1994.
[9] W. Gerisch, D. Abraham, Comput. Stat. Quarterly 4 (1989)
299.

225

[10] D.J. Schuirmann, J. Pharmacokinet. Biopharm. 15 (1987)


657.
[11] F.E. Satterthwaite, Biomed. Bull. 2 (1946) 110.
[12] G.T. Wernimont, Use of Statistics to Develop and Evaluate
Analytical Methods, in: W. Spendley (Ed.), AOAC, Arlington,
VA, 1985, p. 39.
[13] G.W. Snedecor, W.G. Cochran, Statistical Methods, 7th ed.,
The Iowa state University Press, Ames, Iowa, 1982, p. 96.
[14] E. Chapuzet, N. Mercier (Presidents), S. Bervoas-Martin, B.
Boulanger, P. Chevalier, P. Chiap, D. Grandjean, P. Hubert, P.
Lagorce, M. Lallier, M.C. Laparra, M. Laurentie, J.C. Nivet,
S.T.P. Pharma Prat. 7 (3) (1997) 169.

You might also like