You are on page 1of 14

Br. J. clin. Pharmac.

(1979), 8, 7-20

THE TWO-PERIOD CROSS-OVER CLINICAL TRIAL


M. HILLS
British Museum (Natural History), Cromwell Road, London SW75BD

P. ARMITAGE
Department of Biomathematics, Oxford University, Pusey Street, Oxford OX1 2JZ

Introduction
A cross-over clinical trial is one in which the effects of different treatments are compared on the same subject during different treatment periods. Such trials are useful when the treatments are intended to alleviate a condition, rather than effect a cure, so that after the first treatment is withdrawn the subject is in a position to receive a second. Two common examples are the comparison of anti-inflammatory drugs in arthritis and the comparison of hypotensive agents in essential hypertension. In both of these the symptom treated returns after the treatment is stopped, although the return may not be immediate or to the same degree. It would be most unwise to give the treatments in the same order for all the subjects, so order should always be chosen at random. Sometimes the choice is completely random but more commonly a restriction is introduced to ensure a balance between the treatments and orders. The main advantages and disadvantages of the cross-over design in clinical trials are well known. On the positive side a comparison of treatments on the same subject is expected to be more precise than a comparison between subjects and therefore to require fewer subjects for the same precision. On the negative side the task of disentangling treatment effects from both time and carry over effects from the previous treatment period can be difficult or even impossible. Although these points are essentially statistical they are rarely presented in precise statistical terms. In fact the published results of cross-over trials suggest that the appropriate statistical methods are less well known, or less -well understood, than those for parallel group trials. The substance of this paper is an explicit account of the models and methods for analysing a two-period cross-over design. It is not suggested that the results of all cross-over trials should be submitted to the whole range of statistical tests discussed. On the other hand any analysis should be clearly based on one of these models if it is to carry conviction. For a more mathematical treatment of the problems considered in this paper the reader should consult Grizzle (1965) and Gart (1969). A very careful discussion of the principles on which the design of
cross-over trials are based is given in Cox (1958). The last reference should also be consulted for details of designs suitable for more than two periods.
1. The response to treatment In a clinical trial the response to treatment will usually be measured in several different ways. Each measurement must be on a properly defined scale. Treatments are compared using the distributions of the observed values for each measurement. The problem of how best to reduce a mass of information on a subject's state of health to a few measurements of response which can be separately used to compare treatments is by no means negligible. Even a simple response such as blood pressure presents some choices. Should the actual blood pressure after treatment be used or should it be the fall in blood pressure from some pre-treatment base-line? If several readings are taken during a treatment period should the last be taken to measure the response or should an average over the whole (or part) of the period be used? These are largely statistical questions and will be discussed further in Section 5. For the moment we shall assume that a single measure of response to treatment has been chosen. This measure is taken to summarize the response of a subject during the treatment period and may well be the difference or average of other measures obtained for that subject. In a cross-over trial the response is measured in this way for each treatment period and usually for an initial pre-treatment period as well. The scale on which the basic response is measured is important when deciding on the kind of statistical analysis to carry out. For the purpose of this paper it is enough to distinguish three scales. 1. Quantitative - such as blood pressure or the number of incidents in a given period of time. 2. Ordinal - such as poor (1), fair (2), good (3). 3. Binary - such as yes/no or present/absent. A binary scale is a special case of a qualitative scale,

0306-5251/79/070007-14 501.00

.W-11" Macmillan Joumals Ltd 1979

M. HILLS & P. ARMITAGE

the scale having only two points. Qualitative scales with three or more points do occur in clinical trials, particularly in psychiatric studies, but their use in cross-over trials is rather rare.

2. The basic model The simplest kind of cross-over trial is one in which each patient is treated for two periods of time using a different treatment in each period. The design looks like this: Group B Group A Period 1 Treatment X Treatment Y Period 2 Treatment Y Treatment X Subjects are allocated randomly to the two groups A and B. Those in group A receive treatment X in the first period and treatment Y in the second, while in group B the order is reversed. Equal numbers of subjects may be initially allocated to the two groups although the numbers completing the trial in each group may well differ. Subjects are assessed during each period and the two periods should be equal in length with the same length for all subjects. The method of analysis varies with the type of response, although the basic idea stays the same. Details for analysing a quantitative response are presented first since the underlying theory is clearer in this context than any other.
2.1 Analysis for a quantitative response If a particular quantitative response, y, is being used to assess the subjects, then y may show considerable changes with time, quite regardless of any treatments which are given. These changes will vary from subject to subject but there may be an overall tendency for the trend to be in a particular direction. To illustrate this, suppose that subjects in a trial are given no treatment at all but are still assessed during period 1 and 2 giving responses Yi and Y2. For each subject the difference d = Yi -Y2 represents the change from period 1 to period 2. If d were evaluated for a number of subjects it might turn out that the mean change was significantly greater than zero, indicating that the level of response does, on average, fall with time. It is worth noting at this point that there are usually consistent differences between subjects as far as the overall level of response is concerned, but these will not affect the time trends which are based on changes within a subject. The problem of the cross-over trial is to compare treatments whose effects are superimposed on the underlying time trend within each subject. In order to disentangle the treatment effect from the effect of time it is necessary to make certain assumptions about the way the two combine to make up the response. There are two crucial assumptions. The first, common to all statistical analyses of experiments, is

that of unit/treatment additivity. This means that the effect of treating a subject during a specific period is to change the response by a fixed amount which depends upon the treatment and is the same (apart from a random component) for all subjects. The second assumption is that the fixed amount is the same for both periods. It is this second assumption which is important for cross-over trials and it is quite a strong assumption. In particular it implies that the response to a treatment during the second period should not be influenced by the treatment which was given during the first period. Methods for testing the assumption, and possible reasons for its failure are discussed in Section 3. To show how the response is determined by these assumptions it is necessary to use a little algebra. We shall suppose that, in the absence of treatment, the response is el for period 1 and e2 for period 2. If the fixed amounts by which the treatments change the response are Tx and Ty then the outcome of the trial should look like this: Group A subject Group B subject Yi = Tx+eI Yi = Ty+el Y2 = Tx+e2 Y2= Ty+e2 The values of y1 Y2, el, e2 change for each subject but the values of Tx, Ty are constant. For group A the X Y treatment effect is assessed from the difference Y1 - Y2 and forgroup B from the differencey2-Yi . We have dA = Yl-Y2 dB = Y2-Y1
=

Tx-Ty+(el-e2)

Tx-Ty-(el-e2)

The value of e1 - e2 represents the time trend for a subject, and if there is any consistent time trend the values of e1 -e2 over all subjects will have a non-zero mean 3. Because of the variation in time trends the individual values will vary about 3 with a standard deviation of, say, a. Now suppose that out of the subjects who complete the trial there are nA in group A and nB in group B. The mean difference for group A will estimate Tx - Ty + 3 with standard error equal to a/i/nA and the mean difference for group B will estimate Tx-Ty-3 with a standard error equal to a/V,;B. In other words, writing the mean differences as 4 and 4 we have the result: J(dA+dB) estimates Tx-Ty I(A-JB) estimates 3. The standard error for each estimate is

J1(a2/nA + a2/nB).
The value of Tx - Ty is a measure of the relative efficacy of the two treatments and the value of 3 is a measure of the time trend between periods. The expression for the standard error of the estimates is derived from

TWO-PERIOD CROSS-OVER CLINICAL TRIAL

s.d.

{I(JAdB)}
=

Group A (drug/placebo)

= iV{s.e3 (aA)+ s.e.2 (4)} ij/{sf2/nA+ a2/nB}


mean

Period 1
17 8-12 3.84

Period 2
17 5.29 4.25

Difference 1-2
17 2.82
3.47

Example 1. In a trial of a new drug for the treatment of enuresis, each of 29 patients were given the drug for a period of 14 days and a placebo for a separate period of 14 days, the order of administration being chosen randomly for each patient. The data below show the number of dry nights out of 14 on the drug and placebo.
Group A (drug/placebo) Serial number
1 3 4 6 7 9 11 13 16 18 19 21 22 24 25 27 28

s.d.
s.e.

0.84

Group B (placebo/drug) Period Period 1 2


mean

Difference 2-1
12 1.24 2.99 0.86

Period 1
Yi

Period 2 Y2
5 10 0 7 6 5 0 0 12 2 5 13 10 7 0 6 2

Difference dA =YI-Y2 3 4 8 2 5

s.d.

12 7.67 2.99

12 8.92 2.81

8 14 8 9 11 3 6 0 13 10 7 13 8
7 9 10 2

s.e.

-2
6 0 1 8 2 0 -2 0 9 4 0

Serial number
2 5 8 10 12 14 15 17
20

Group B (placebo/drug) Period 1 Period 2


Yi

Difference

Y2

d = Y2-Y
-1 2 -4

23 26 29

12 6 13 8 8 4 8 2 8 9 7 7

11 8 9 8 9 8 14 4 13 7 10 6

0 1
4 6 2 5 -2 3 -1

The first step in the analysis is to summarize the basic data, and the differences within each subject, by groups. An adequate summary for these data is provided by means and standard deviations.

The only parts of this summary used in comparing the drugs are the two columns based on differences within subjects. The summaries for period 1 and period 2 separately are useful in interpreting the results because they provide a level against which the assess the change. The standard deviations for the separate periods are not used in this analysis because they include a component due to differences in overall level of response from one subject to another. The standard deviation of the difference is calculated from the individual differences, thus excluding the subject component. The standard error of the mean (s.e. for short) is obtained from the standard deviation of a single observation "(s.d.) by the formula s.e. mean = s.d./jn. Referring back to the algebraic notation, aA = 2.82, B = 1.24, so we may estimate the difference in treatment effects by i(A + 4B) = 2.03 dry nights out of 14 and the period effect 6 by i(dA-JB) = 0.79 nights. The easiest way of finding the standard deviation of error in these estimates is from the standard errors, as iJ{0.842 +0.862} = 0.60 nights. The size of the treatment difference may be tested to see whether it is significantly different from zero by referring 2.03/0.60 = 3.38 to tables of the distribution of the normal deviate. The 10%, 5% and 1% significance points of this distribution are 1.64, 1.96 and 2.56 so that the difference in this rase is significant at less than 1%, i.e. P < 0.01. The 95% confidence interval for the true difference is obtained from 2.03 + 1.96 x 0.60 = 0.85 to 3.21 days. The period effect may be tested in the same way, and for this example 0.79/0.60= 1.32 which is not significant at the 10% level.

10

M. HILLS & P. ARMITAGE

There is a slightly more refined way of carrying out this last part of the analysis which is useful when sample sizes are small. Because the standard deviations of the difference in each group are both estimating the same thing, namely a, they may be pooled to give a single estimate. The pooled estimate (referred to as s) is given by the formula
s2

=(nA 1)s. + (nB-l ) s.d B SdA


nA + nB-2

For this example s2= (16x3.472+11 x2.992)/27 = 10.78. The standard error of the estimates is now

ij{S2/17+S2/12} = 0.62. This is slightly higher than the first value of 0.60 because in forming the pooled estimate of a the larger s.d. is weighted by 16 (after squaring) and the smaller by 11. The value of 2.03/0.62 = 3.27 is now referred to tables of the t distribution on 27 (= nA+ nB-2) degrees of freedom. As before P <0.01 but the confidence intervals are now 2.03 + 2.05 x 0.62 = 0.76 to 3.30 nights.
The first test, based on the distribution of the normal deviate, is a large sample test and gives the correct significance level no matter what the underlying distribution of the basic observations (in this case a difference) provided the sample sizes are large enough. The second test, based on the t distribution, is an exact test and gives the correct significance level no matter how small the samples are provided the underlying distribution of the basic observation is normal. The distinction between large sample and exact tests is not of great importance with quantitative responses provided sample sizes are 25 or more. It becomes more important with binary responses. For a moderate size of sample both the t-test and the normal deviate test are reasonably insensitive to non-normality, but occasionally this may be extreme, perhaps because of clustering at the
end of the scale or very distant outliers. In such cases the Wilcoxon 2 sample rank test can be used (Armitage, 1971, p. 398). For properly balanced cross-over trials (i.e. with equal numbers in groups A and B) the analysis of variance is mathematically equivalent to the method presented here. However, this method of doing the analysis cannot be recommended for clinical crossovers. One reason is that, because of drop-outs, clinical cross-over trials rarely end up balanced even though they may start that way. An analysis of variance can still be performed, and is equivalent to the t-test described here, but it is no longer as simple as for a balanced design (see Appendix). Another reason is that the analysis of variance can be rather

difficult to understand and there seems no point in using a confusing and difficult method of analysis when an explicit method exists. These remarks apply only to the two period cross-over; with more than two periods there are advantages in using the analysis of variance although the difficulty of dealing with an unbalanced design remains. There is one important point which has not yet been considered. What happens if the order is ignored when the data are analysed, so that the treatment contrast is estimated by J, the overall difference, ignoring the distinction between groups A and B? This is a legitimate thing to do provided the allocation of groups A and B was completely at random. However there is a penalty in accuracy. The error in the analysis just described has standard deviation a where a is the common standard deviation of the difference d within groups A and B. The penalty for ignoring the order in which treatments are given is to increase the standard deviation of error to 1(a2 + 32). In the present example this makes very little difference-a change from 3.28 (the pooled estimate of a) to

1(3.282+0.692) = 3.35 -but this will not always be the case. When 3 is large compared to a there will be a large increase, and without the analysis taking order into account there can be no evidence about the size of 3.
2.2 Analysis for a binary response

It is more difficult to envisage a time trend within a subject for a binary response than for a quantitative response. For example, if the response is whether or not relief from pain is obtained (coded 0 for no and 1 for yes), then one must envisage a trend in a variable which can only change from 0 to 1 and from 1 to 0. In fact the time trend is only evident in the changing probability of a particular response with time. It may be that regardless of treatment, subjects are more likely to obtain relief in the first period than in the second. The method of analysis is very similar to that for a quantitative response but since the number of different responses is limited it is convenient to start by tabulating the data as below. Response Preference x Y Group A Group B 0 0 None aoo boo x 1 0 alo blo 1 0 y ao0 bol 1 1 None all bl
nA

nB

In the absence of any treatment differences and of any time trend we would expect the proportion of

TWO-PERIOD CROSS-OVER CLINICAL TRIAL

11

subjects in row 2 (prefers X) to be the same as the proportion in row 3 (prefers Y) but we would have no preconception about the proportion falling in rows 1 and 4. A trial might produce 5% in each of rows 2 and 3 leaving 90%/O for rows 1 and 4, or it might produce 40%/O for each of rows 2 and 3 leaving 20% for rows 1 and 4. In both cases the treatments would be regarded as equally effective. Very often rows 1 and 4 are pooled under the heading 'no preference' so t iat the three categories of interest are 'prefers "', 'no preference' and 'prefers Y'. Each subject must all into one of these categories.
Example 2. In the comparison of two active drugs (X and Y) for the treatment of rheumatoid arthritis one of the responses was the severity of morning stiffness. When tabulated as preferences the data were: Group A Group B Total 14 8 6 Prefers X No preference 11 6 17 2 1 Prefers Y 3 18 Total 34 16 If allocation to groups is random then we may concentrate on the third column representing the combined experience of both groups, although by doing this we risk some loss of precision if the order effect is large. Under the hypothesis of no difference between treatments the percentage preferring X should be similar to that preferring Y but this hypothesis tells us nothing about the percentage showing no preference. A way out of this difficulty is to consider just the subjects showing a definite preference and to test for a 1: 1 ratio between X and Y. This is done using the sign test (often referred to as McNemar's test in this context). The observed ratio was 14: 3 and it is fairly easy to calculate the probability of observing such a ratio or one even further from 1: 1 when the true ratio actually is 1: 1. In this instance the probability is 0.012. In practice tables showing the ratios corresponding to P = 0.05 and P = 0.01 are used, so that all calculation is avoided. For a total of 17 subjects showing a preference the ratios corresponding to P = 0.05 and P = 0.01 are 13:4 and 15: 2 (Documenta Geigy tables, 1962, p. 105), so that the value of P for 14: 3 is between 0.05 and 0.01. The large sample equivalent of the sign test compares the observed proportion 14/17 with the value of 1/2 using the normal deviate ND = (14/17-1/2)/j{0.5 x 0.5/17}. A computationally simpler form for the normal deviate is ND = (14-3)I1(14+3) = 2.67 giving a probability of 0.008. Using a continuity correction of 1 gives

ND = ((14-3)-1)/A/(14+3) = 2.43 which gives a probability of 0.015, slightly closer to the result of the exact test (0.012) than 0.008. A confidence interval for the true probability of preferring X could be calculated but this is less useful than the corresponding interval in the quantitative case because the probability is conditional on a preference being expressed and is not necessarily a good measure of the difference between treatments. If there is good reason to expect a time trend then it would be sensible to carry out a test which takes account of the order in which treatments are given. The first thing to notice is that if subjects have been allocated randomly to groups and if there is no difference between treatments then (apart from random variation) the proportions preferring period 1 (and period 2) should be the same for both groups. If the data are tabulated according to preferences for periods we get: Group A Group B Prefers period 1 8 1 No preference 11 6 Prefers period 2 2 6 Total 16 18 Note that the second column is in the reverse order from when it was tabulated by preference for treatment whereas the first column is in the same order. If there is no difference between treatments then column 1 expressed as proportions of 16 should be the same (apart from random variation) as column 2 expressed as proportions of 18. One possible test of this is the test for association in a 3 x 2 contingency table, but as this is a quite general test for any type of association it is likely to be rather insensitive to the particular association which will arise as a result of genuine differences between treatments. Once again the answer lies in ignoring the 'no preference' row and testing the resulting 2 x 2 table for association. In this example the 2 x 2 table is: Group A Group B Prefers period 1 8 1 Prefers period 2 2 6 Total 10 7 and we wish to test whether the difference between 8/10 and 1/7 is evidence of a difference between the corresponding true probabilities. This may be done using the exact test for a 2 x 2 contingency table (Documenta Geigy, 1962, p. 109), and for this data P = 0.05. The large sample test is x2 on I degree of freedom or we can use the normal deviate for the comparison of two proportions. The pooled proportion is (8+1)/(10+7) = 53% so the difference between the proportions which is 80%- 14% = 66% has s.e. = j{53 x 47(1/10+1/7)) = 25%

12

M. HILLS & P. ARMITAGE

giving a normal deviate of 66/25 = 2.64 and a P of 0-01. With such small sample sizes as 10 and 7 the exact test is advisable. The 2 x 2 test for a treatment difference taking account of order was first proposed by Mainland (1963, p. 237) and Gart (1969) developed the theory of the test. In a recent paper Prescott (1979) has discussed the use of the 3 x 2 test for association and proposed a test which is sensitive to the particular change we are interested in but which does not discard the no preference group. It is interesting to note that for large samples, Prescott's test is equivalent to the test for quantitative responses if differences of 1 (for preferences) and 0 (for no preference) are used.

3.1 Testing for an interaction


In the notation of Section 2.1, suppose that there is no interaction. Consider two identical subjects with el = 20 and e2= 25 (the units are quite arbitrary), and suppose Tx = 2 and Ty = -2. Then, if one subject is in group A and the other in group B, the observed responses will be: Group B subject Group A subject Yi =-2+20= 18 Yi = 2+20 = 22 Y2 = 2+25 = 27 Y2 = -2+25 = 23 Note that the mean responses for the two subjects are the same: J(22+23) = J(18+27) = 22.5. In general, the mean response for either subject will be m = i(el +e2)+i(Tx + Ty). In practice, of course i(el +e2) will vary from subject to subject, but if the subjects are allotted randomly to the two groups, the mean responses over the two groups should not differ significantly. Suppose now that there is an interaction, so that the treatments change the responses in the first period by Tx, and Ty, and in the second period by TX2 and TY2. Then the mean response for the group A subject will be
mA = i(el + e2) +i(TxI + TY2) and that for the group B subject will be MB = i(el +e2)+j(TX2+ Ty1). The two values mA and mB will not be equal unless TX1 + Ty2 = TX2 + T

2.3 Analysis for an ordinal response

Ordinal scales are those for which order has a meaning but magnitude does not. For example a scale of severity of a symptom from 0 (absent) to 3 (severe) is ordinal. Data on an ordinal scale look just like quantitative data and the first stage of the analysis is to take differences Y1 -Y2 for group A and Y2 -Y1 for group B. For a scale from 0 to 3 these differences could theoretically run from -3 to + 3. In practice it is often the case that most of the differences take the values -1, 0 and + 1 and if this is so then the data should be analysed as though they were preference data. If the scale is in the direction 0 (good) to 3 (bad) then a negative difference favours X and a positive difference favours Y. A zero difference corresponds to no preference. If the distribution of differences is wider, with appreciable numbers of + 2's or + 3's it is probably best to analyse the differences as though they were quantitative. There is rather a dilemma here because using only the sign of the difference, as in preference data, obviously wastes some information yet the differences are not, in fact, quantitative. On the assumption that scales are chosen so that the change from 0 to 1 is roughly equivalent to that from 1 to 2 and so on, it seems reasonable to treat the differences as quantitative, at least for the purpose of testing for significant differences in treatment.

i.e.

TX1 -TYI = TX2-TY2


which is the condition for no interaction. A test for the equality of the mean values of mA and mB is thus a sensible way to detect an interaction if it is present, although it might be a rather insensitive test as it is based on the variability of m between subjects. We describe the mechanism of such a test in Section 3.3, but we first consider how an interaction might arise, and how it should be interpreted.
3.2 Reasons for an interaction An interaction might arise for various reasons: (i) There may be an inadequate wash-out period, so that a drug used in period 1 may still be present in period 2. The response in period 2 will then be affected by the second treatment, but there will be an additional, residual, effect of the first treatment. If the residual effects of each drug are different then there will be an interaction.

3. Interactions between

treatment and period

In Section 2.1 we stressed the importance of the assumption that the treatment effect, denoted there by Tx - Ty, is the same for both periods. The assumption may, of course, be false; if so, there is said to be an interaction between treatment and period. We now discuss ways of testing the assumption that there is no interaction, and in Section 4 we consider what to do if an interaction is found.

TWO-PERIOD CROSS-OVER CLINICAL TRIAL Response


Response
Response

13

Group B
X Group B
x
V

Group B

Group A

--

f Y

Group A

Group A

Period
a No interaction

Period

Period
C Interaction

b Interaction

Figure 1 Three hypothetical outcomes to a two-period cross-over trial. Subjects in group A receive treatment X followed by treatment Y, those in group B receive the treatments in the reverse order. The response to treatment is plotted on the vertical axis (y). See below.

(ii) Even when the wash-out is completely effective, the physiological or psychological state induced by the first treatment may to some extent persist, so that the subjects are no longer comparable in their clinical state at the start of period 2. The effect may then be rather similar to that of (i). (iii) The treatment effect may vary according to the general level of response; for example subjects with a high value of m may show a greater treatment difference, Tx- Ty, than those with smaller values of m. Suppose also that there is a period effect, so that on the average m2 is higher than ml. Then we should expect TX2-TY2 to be higher than Txi-Tyl. It is often difficult to distinguish between these possible mechanisms on the basis of the mean responses. Figure 1 shows three possible diagrams, in each of which there is a period effect, with period 2 giving the higher responses. In Figure la the mean responses show a pattern similar to that postulated at the beginning of Section 3.1: there is no interaction, the mean responses for the two groups coinciding, but there is a treatment effect, with X giving higher responses than Y. Figure 1(b) could be interpreted as in (i) and (ii) above, with X exerting a large effect in period 1 which persists in period 2. Alternatively, as in (iii), it may be that at the high level of response shown in period 2 one cannot normally expect much of a treatment difference: the response may have reached some natural upper limit. Figure 1(c) could be interpreted as in (iii): at the higher levels of response seen in period 2, a higher treatment effect is found. Alternatively, as in (i) or (ii), one could suggest that Y produces a residual effect in period 2 for group B; this is perhaps less plausible than the first

explanation because the direct effect of Y in period 1 is lower than that for X. If the explanation (iii) of an interaction is correct, that the treatment difference varies with the level of response, it should be possible to demonstrate this by studying variation within the groups. For each subject in group A we could calculate the change in response, dA = Yi -Y2, and the mean response mA = l(YI +Y2), and see whether there is any association between these two variables. This could be done by plotting a scatter diagram between the change and the mean response, and perhaps following this with a regression or correlation analysis to calculate the significance of the association. Similarly, in group B, the association between the change dB = Y2-Yi and the mean MB = l(Yl +Y2) would be studied. If there is no such association in either group, the evidence will favour an explanation in terms of a residual effect. Suppose, though, there is association within the groups. In Figure 1(b), for instance, suppose that in group A the higher values of mA are associated with the lower values of dA. This might be explained, as in (iii), by a saturation effect, high responses not being susceptible to treatment changes. Alternatively a residual effect, as in (i) or (ii), might vary from subject to subject, being greater for the subjects with high responses where perhaps the direct effect is higher. The issue might then be resolved by group B, where the saturation hypothesis would lead to a negative association between dB = Y2 -Yi and mB, whereas the residual effect hypothesis would not require an association (since no residual effect of y is postulated). These arguments are clearly somewhat speculative,

14

M. HILLS & P. ARMITAGE

largely because there are so many ways in which the simple model of Section 2 could break down. We take up these points again in Section 4. 3.3 Test for interaction: quantitative response

The test has already been outlined in Section 3.1. For each subject, calculate m = 4(YI +Y2), and the mean values of m for the two groups, denoted by -A, The test is based on the difference -A-AB The standard deviation of error is s.e. (MA-MB
=

/{s.e.2 (*A) +s.e.2 (msB)} V'{OYN/nA + al/nB}

giving a normal deviate of 3.17/2.29 = 1.38 for which 0.1 < P < 0.2. Alternatively the pooled estimate of standard deviation is 6.47 giving a s.e. = 2.44 and t = 3.17/2.44= 1.30. Finally, the ratio - 3.17/2.44 =-1.30 is referred to tables of the t-distribution on 27( = nA + nB-2) degrees of freedom. The value is not significant (0.1 < P < 0.2), and there is no good evidence of an interaction. An alternative and equivalent approach is by the analysis of variance (see Appendix). If extreme nonnormality is suspected the Wilcoxon two-sample rank test can be used instead of the t-test.
3.4 Test for interaction: binary response

if the m's can be assumed to have the same standard deviation a,, in each group. As in the corresponding test for treatment effects in Section 2.1, a. can either be estimated separately for each group, leading to an approximate test using the normal distribution; or by using a pooled estimate s2 leading to a t-test which is exact if the m's are normally distributed. In practice it will often be a little simpler to do the calculations in terms of the sum of the two readings rather than the mean. In the example of Section 2.1, concerning the treatment of enuresis, the totals are as follows:
Group A : Drug/placebo Total Serial
number response

As in Section 2.2 the data can be summarized by tabulating the number of subjects who show the four possible combinations of responses to the two treatments. Following the general line of argument used in Section 3.1, the effect of an interaction would be to cause the two groups A and B to differ in the mean level of response, and this would be shown by the relative proportions of subjects falling into the categories (0,0) and (1, 1). That is, an interaction should lead to the frequencies
aoo boo all bl, being disproportionate, and this could be tested in the usual way by the exact test for a 2 x 2 table, or by the large sample tests using the standard error of a difference in proportion or X2 (cf. the 2 x 2 table in

Group B : Placebo/drug Total Serial


number response

1 3
4 6 7 9
11

13 16 18 19
21 22 24 25 27 28
n

13 24 8 16 17 8 6 0 25 12 12
26 18 14 9 16 4
17 13.41 7.32 1.78

2 5 8 10 12
14 15 17

20 23 26
29

23 14 22 16 17 12 22 6 21 16 17 13

Section 2.2). The theory underlying this test for interaction is somewhat inexact. Strictly we should state whether the interaction is present on a probability scale or on a log-odds (or logit) scale (Armitage, 1971, p. 359). If the result of the test is significant on the rather small numbers usually available in cross-over trials, the disproportion in the frequencies will be quite large, and an interaction is likely to be present on either scale.
3.5 Test for interaction: ordinal response

Mean s.d.
s.e. mean

12 16.58 4.98 1.44

The difference in means is 16.58-13.41 = 3.17 with s.e. V{1.782 + 1.442} = 2.29
=

The extension to ordinal responses is straightforward. The total score is obtained for each subject. If, for example, the ordinal scale runs from 0 to 3, the total score runs from 0 to 6. It will usually be adequate to apply the methods of Section 3.3, but again, if serious non-normality exists, particularly through clustering at one end of the scale, the Wilcoxon two-sample test may be used (with the appropriate adjustment for tied readings).

TWO-PERIOD CROSS-OVER CLINICAL TRIAL

15

4. The procedure if a treatment x period interaction is present

y1's.

The tests for treatment effect described in Section 2 assume no interaction, and the effect is therefore estimated jointly from the results of period 1 and period 2. If an interaction is present, these methods are inappropriate. In this situation there is a qualitative difference between the reliability of the results from period 1 and period 2. In period 1 the responses for X and Y can be compared quite fairly because of the random allocation of subjects to the two treatments. In period 2, on the other hand, the subjects start in a potentially dissimilar state because of the different experience they have had in period 1. We should therefore place much more reliability on the comparison of treatments in period 1, and it seems sensible to study these alone. The response in the two groups in period 1 may be compared by a two-sample t-test like that applied in Section 2.1 to the d's (or, again, by a Wilcoxon twosample test if non-normality causes concern). Note, though, that the random variation between readings in the same group is between subjects, whereas the d's were calculated within subjects and are unaffected by variation in the general level of response for different subjects. The treatment effect may therefore be less precisely estimated when the calculations are based only on period 1 than when the methods of Section 2.1 are used; indeed this is one of the arguments in favour of the cross-over design. However, the presence of an interaction may force us to restrict the calculations to period 1. If the response is measured during an initial pretreatment period (period 0) then denoting this by yo we can calculate for each subject z = y1 -yo. Since subjects are allocated randomly to the two groups, the mean value of yo should be the same for the two groups apart from random fluctuations, and the difference between the mean values of z, iA-zB, should be equal to Tx - Ty (again apart from random fluctuations). The treatment effect can therefore be estimated by ZA -Z-B and its significance tested by a two-sample t-test on the z's. The advantage of this procedure over that using yl's is that the z's are (like the d's) differences within subjects, and may therefore be expected to show less random variation than the

scoring 0 in period 0, and (b) for subjects scoring 1 in period 0. If the differences between groups A and B in the proportion of l's in period 1 is similar for (a) and (b) the contrasts can be combined by Cochran's test (Armitage, 1971, p. 370) or the similar method described by Hills (1974, p. 66). If not, some useful information will have been gained about the interaction between treatment effect and initial score. Returning to quantitative responses, suppose the interaction is of type (c) in Section 3.2; that is, the effect of treatment varies with the general level of response. It may be possible to transform the scale of measurement in such a way that this tendency is removed. For example, if, as in Figure l(c), the treatment effect seems to be higher at higher responses, a transformation such as the logarithm or square root of the original response measurement may remove the interaction. A certain amount of trial and error may be needed to find a new scale on which the interaction is absent. This approach may give rise to difficulties because the investigator may not wish to report results in terms of the transformed scale. For example, if the original response is the duration of symptoms (ip days) it may be preferable to report the presence of an interaction and to estimate the treatment effect from period 1, on this scale, rather than to report no interaction and do the calculations of Section 2.1 on, say, the square root of the time period.

5. The use of additional observations


The discussion so far has been in terms of a response which is supposed to summarize the state of a subject for the period under consideration. Very often, in practice, this response is based on an underlying variable which is observed at intervals throughout the trial. For example the variable (say blood pressure) might be observed when the subject enters the trial (initial value) and then at weekly intervals broken down as follows: (i) weeks 1-2: run-in period (ii) weeks 3-6: first treatment period (iii) weeks 7-8: wash-out period (iv) weeks 9-12: second treatment period. The run-in period is intended to accustom the subject to being in a trial and to provide an opportunity lo withdraw any additional medication which could interfere with the trial. The wash-out period is included to provide time for the effects of the first treatment to wear off by the beginning of the second treatment period. The usual times to take base-line readings are at the end of the run-in and the end of the wash-out. These are referred to as yo and Yo respectively. In the absence of a run-in period the initial value serves for yo but

Similar arguments apply to binary and ordinal data. For binary data, for example, if an interaction has been established by the methods of Section 3.4, the first period responses alone could be used to compare the treatments. This would involve comparing the proportion of I's in the two groups by the usual methods for 2 x 2 tables (the exact test, or the standard error, or X2 approximations; cf. Section 2.2). If period 0 readings are available the same form of comparison could be performed (a) for subjects

16

M. HILLS & P. ARMITAGE

there can be no second base-line, y'0, without a washperiod. The values at the ends of weeks 3-6 all contribute to Yl, the response for period 1. Obvious choices for Yi are the value at week 6 and the average of all four readings. If the basic measurement shows a lot of variability within a subject (from time to time) then it will usually be better to take the average. If the treatment is expected to take a week or more to achieve its full effect then the average could be restricted to the last three or even the last two readings. The first base-line value, yo, serves several purposes. It is useful for describing the actual sample of patients who entered the trial. It is helpful for assessing the size of any difference between the effects of treatments; and it may be used to stratify patients into groups with roughly similar base-line values within groups. In some trials the initial base-line is even used to select patients for the trial so that only those whose base-line value matches some criterion are entered. This practice can be rather misleading when the measurement displays a lot of variation within a subject and should, where possible, be avoided. One important use of an initial base-line has already been mentioned in Section 4 where, because of interaction, the comparison is based on the results of period 1 only. Then the use of y -yo will usually give a more precise comparison of treatments than y, alone, although this will not always be the case. If Y, and yo have the same standard deviation a and have correlation coefficient p then s.d. (Y1 -yo) = aV{2(1 -p)} s.d. (Yl) = Hence the use of y1 -yo will be more precise than Yi provided p is greater than 0.5. A more general approach is to plot a scatter diagram of Yi versus yo for each group. If the plots are roughly linear then they should have the same slope and a straight line with common slope b can be fitted to each. The treatments are then compared using yI -byo in place of Yi both for estimating treatment effects and for estimating error. The straight line is usually fitted by least squares (regression analysis) and at the cost of a little more trouble it is possible to carry out an analysis called analysis of covariance which takes account of the error in estimating the slope of the line. Further details can be found in Armitage (1971, p. 288). This general technique includes Yi -yo and Yi as special cases and it is interesting to note that when y, and yo have the same standard deviation a then b equals p, the correlation between y, and yo, and s.d. (YI -Ppyo)= aj/(l p2). This is equal to a when p = 0 but is otherwise always less than both a and aV {2(1 -p)}. Note that p = 0 corresponds to using Yi and p = 1 corresponds to
out
a.

using Y -yo. For p = 0.5 both Yi -yo and y, have the same standard deviation a while Y1- 0.5yo has s.d. = 0.87af. If a second base-line, y'0, is established between periods 1 and 2, then there are several ways in which it can be used. If the average values of yo-y'O for groups A and B differ from zero, but not from each other, then this suggests that there is a time trend, although it should be noted that there may well also be equal residual effects from the treatments which are indistinguishable from a time tend. If the average values of yo-y'0 differ significantly between groups then this suggests unequal residual effects. This test for unequal residual effects is not an alternative to the test for interaction described in Section 3. The concept of interaction is more general than that of unequal residual effects which is just one of its possible causes. The proper assessment of interaction must take account of the treatment effects during the second period. If there is no interaction then the treatment effects should be estimated using both periods and we are faced with the problem of whether yo or y'0 or both can be used to increase precision. There is no very clear answer to this. Using a simple model with a consistent subject effect and independent errors within a subject it is clear that it is better to use the absolute level of response than the level measured from base-line, that is Yl -Y2 is better than (Y1-Yo)-(Y2-y'O). This is because the consistent subject effect is removed in both, yet the latter has additional errors from the yo and y'o. However, the subject effect may not stay at the same level throughout the trial, in which case it might be better to measure from the base lines. For example, any residual effects from the first period should disappear in the difference Y2 -y'0. It is difficult to give guidance here apart from pointing out that the more cautious approach is to measure from the base lines. The worst this can do is add extra variability but it could remove bias in the estimated treatment difference. These remarks apply mainly to measurements on a quantitative or ordinal scale. The concept of base line is not so useful with binary scales. Cross-over trials are not necessarily restricted to two treatment periods. Additional observations can arise from additional treatment periods. These may be used for further observations of the treatments already compared or for the comparison of additional treatments. Cross-over trials with more than two periods are considerably less common than those with two because of the practical problems of keeping a trial going for long enough and ensuring that subjects follow the rules about treatment. There are more possibilities for allocating treatments in different orders and allocation is usually balanced using latin squares. This ensures that each treatment appears the same number of times in each period over

TWO-PERIOD CROSS-OVER CLINICAL TRIAL

17

response and the size of the true difference which actually exists between treatments. A plot of the power against the true difference, for a given trial, is known as the power curve, or sometimes the sensitivity curve of the trial. Most statistical tests involve relating a normal deviate to tables of the normal distribution, i.e. declaring a difference to be significantly different from zero at the 5% level if the normal deviate is numerically greater than 2 (1.96 to be precise). The normal deviate takes the form ND = sample difference/standard error. For tests of this kind the power of the trial will depend only upon the value of D/s.e. = true difference/standard error. A plot of power versus values of D/s.e. provides the basis for all power calculations involving normal deviate tests. The plot is shown in Figure 2 and some useful values are listed below.
Power

D/s.e.
Figure 2 Power of a test based on a normal deviate. The true difference is D and the standard error is s.e. The power is plotted against the ratio D/s.e.

900/0
85%
80%

D/s.e. 3.24 3.00 2.80

70%/0

75%

2.64 2.48

the whole trial. More elaborate designs which balance for the number of times each treatment follows another are also available. A good discussion of the subject is given in Cox (1958).
6. The sensitivity of a trial

A trial which concludes that there is a highly significant difference between treatments (in the statistical sense) is obviously sufficiently sensitive for its purpose, but for a trial which concludes that there is no significant difference a further question arises; is there genuinely no difference or was the trial too small? This question must be phrased more precisely before it can be answered but some sort of answer is obviously essential to the proper assessment of the result of the trial. The power of a trial, often denoted by 1-fl, is defined to be the probability that the trial will produce a difference between treatments which is significantly different from zero at a certain level (usually taken to be 5%). The choice of 5% is arbitrary but it is convenient to work with a single level of significance and 5% is conventional. Power, defined in this way, is a probability, and its size depends on the size of the trail, the variability of the 2

There are several general features of the plot in Figure 2 which are worth commenting on. The curve is symmetric about D = 0 because the difference between treatments can be taken in either order without altering the result of the (two-sided) significance test. The power at D = 0 is 5% because the significance test is at the 5% level. At values of D different from zero the power increases slowly at first, then more rapidly, then slowly again. If the size of a trial is increased this will decrease the standard error which in turn increases the value of D/s.e. and provides a higher power. A good power to aim at in general is about 80%. Much higher might require too large a trial to be practicable, and much lower might be too insensitive to be useful. A power curve for a trial which has already been carried out provides the answer to the question about sensitivity. If true differences between treatments of an appreciable size (i.e. clinically appreciable) could only have been detected with low power then the trial was too insensitive. Somebody, of course, still has to decide what is a 'clinically appreciable difference' and what is 'low power'. Both of these pose problems but at least the main question has been analysed into its component parts and it is clear what must be agreed before an answer is possible. It can be difficult to decide what is an appreciable difference when the response is a rating scale, without physical meaning, but previous experience with such scales can provide

18

M. HILLS & P. ARMITAGE

a guide. Whether or not a power is too low depends on the context in a complicated way. The cost of running larger trials (both in terms of money and fewer new drugs), the cost of marketing ineffective drugs, and the cost of failing to market an effective drug, must all be taken into account. A power curve for a trial which is yet to be carried out provides an answer to how large the trial should be (or whether the chosen size is large enough). The main difference between this and the case where the trial has already been carried out is that now there is no information about the variability of the response. Sometimes this can be roughly estimated from past experience and sometimes (as when comparing proportions) a lower limit is known on theoretical grounds, but there are times when the variability is totally unknown and must be guessed before the power of a trial can be calculated. The use of power curves to assess sensitivity is just as important for cross-over trials as for parallel group trials but the relevant variability is within rather than between subjects. As an example consider the trial described in example 1 which produced a treatment effect of 2.03 days with s.e. = 0.60. This treatment effect is statistically significant so the trial was sufficiently sensitive to discriminate between the drug and placebo. Suppose though that the trial was yet to be carried out and that the number of patients was to be chosen to pick up a difference of 1 dry night out of 14 with a power of 80%. The first requirement is the s.d. of the change from one period to another. Let us suppose that data on enuresis for several weeks on a number of untreated subjects gave a rough estimate for this s.d. of 3.5 nights. Then the s.e. of a mean difference based on n subjects is 3.5/In. In the notation of example 1, if nA = nB = n, then s.e. (A) = 3.5/In, s.e. (B) = 3.5/jn and
s.e.

estimate of the standard deviation of change in response within a subject. At the worst, a very rough estimate can be obtained by equating 4+s.d. to a guess for the likely range of possible values. In the enuresis example the change from one fortnight to another is unlikely to be more than 10 nights giving a range of + 10 nights. Equating 4 s.d. to 20 gives s.d. = 5. The discussion so far has assumed that the measurement of response is normally distributed. However, the results rely heavily on the fact that the error distribution for a mean value is likely to be nearly normal, whatever the distribution of the individual values might be. The methods described are thus entirely adequate for the purposes of planning the trial. The situation for a binary response is more complicated because not all of the subjects contribute to the statistical test of significance. The 'n' of the power calculations is now the number of subjects who express a preference and the 'D' is now the difference between the probability of preferring X (given that a preference is expressed) and i. Neither are easy to work with-the natural way of thinking about the trial is in terms of the total size and the overall success rates for X and Y separately. Because of this it is usually helpful to consider the very simple situation in which there is no period effect and each subject behaves quite independently in the two periods (i.e. like two separate subjects). If the overall success rates for X and Y are u and v then the failure rates are 1- u and 1-v and the results for N subjects would be expected to look like this: Y Number of subjects X 0 O N(1-u)(l-v) 0 1 Nu(l -v) 1 0 N(1-u)v Nuv 1 1
N

{i(JA+B)} j{3.52/n+3.52/n} = 2.47/In.


=

If D, the true treatment effect, is 1 night then D/s.e. = In/2.47 and to achieve a power of 80% this must be 2.8 (from Figure 2). Thus n = (2.8 x2.47)2 =4 and there should be 96 subjects in the trial. To achieve a power of 80% for a difference of 2 nights would require n = 12 giving 24 subjects in the trial. Note that n varies inversely as D2 so decreasing the size of the difference to be picked up can increase the size of the trial considerably. With a quantitative measurement of response such as the one above there are no additional complications in working out power, due to the cross-over design. The only problem when working out the power of a trial yet to be performed is to get an

There is no separation into groups A and B as there is no period effect. The probability of preferring X given that a preference is expressed is now u(1 -v)/{u(l -V)+(1 -u)V} and the 'n' of the power calculations is now

N{u(1 -v) + (1 -u)v}.


Consider a numerical example in which u is 60% and v is 80%. The expected results for 100 subjects
are:
x

0
1

0
1

y 0 0 1 1

Number of subjects 0.4x0.2x 100= 8 0.6x0.2x 100= 12 0.4x0.8x 100 = 32 0.6x0.8x 100 = 48 100

TWO-PERIOD CROSS-OVER CLINICAL TRIAL

19

pr (preferring X) = 12/(12+ 32) = 0.27 D = 0.50-0.27 = 0.23 N= 100 n = 100(0.6x0.2+0.4x0.8) = 44.

In this example a difference between u and v of 20% gives a D of 0.23 and n which is 44% of the total sample size. If this difference between u and v is to be picked up with a power of 80% then D/s.e. must equal 2.8. Since D = 0.23 and the s.e. is

1{0.5 x 0.5/n} = 0.5/jn we have 0.23jn/0.5 = 2.8, i.e. n = 37 and N = 37/0.44 = 84. Thus 84 subjects would be enough to pick up a difference of 80% - 60% in overall success rates with a power of 80%. Although these calculations are based on a number of simplifying assumptions they do give some idea about sensitivity.
Discussion
To a physician it generally makes more sense to compare the same patient at two different times than to compare two different patients. To a statistician it makes for greater efficiency if each patient is used as two experimental units rather than one, particularly if within patient variability is less than between patient variability. Why, therefore, should the cross-over not be the automatic choice for any trial? The obvious answer is that it makes no sense in some clinical situations and the degree to which it does make sense in others is very difficult to determine. Accumulated experience of two treatments might establish that patients could be treated in a cross-over fashion without biasing the comparison between treatments, but accumulated experience is usually lacking with

clinical trials. Without such experience, to choose a cross-over is to take a chance. If the results of the trial suggest a definite interaction between treatments and periods then the chance has not come off and the treatment comparison should be based on the first period alone. This might seem to be no great loss since this is just what would have happened in a parallel group study but all too often the size of the trial, chosen with a cross-over in mind, is too small for a proper comparison between groups. A more serious potential hazard is that the results of the trial do not establish a definite interaction because the test, based on between subject variability, is insensitive. If the size of the trial has been chosen with a cross-over in mind then it is quite likely that an appreciable interaction could be missed and a biased estimate of the treatment comparison obtained. The logical conclusion from these two points is that, to be safe, a cross-over trial should always be large enough for it to be analysed as a parallel group study, and also for any appreciable interaction to be detected on a between-subject bias. To insist on this would largely cancel the statistical advantages of the cross-over. Where this leaves us is far from clear but two points emerge: 1. If an unequivocal result is required from a clinical trial and enough patients are available then a parallel group study should be carried out. The use of a base line can improve precision but even with this at least twice as many patients will be required as for an equivalent cross-over. 2. If the number of patients is limited and a crossover design is chosen, then the internal evidence that the basic assumptions of the cross-over are fulfilled must be presented and if necessary the conclusions should be based on the first period only.

References
ARMITAGE, P. (1971). Statistical methods in medical research. Oxford and Edinburgh: Blackwell Scientific Publications. COX, D.R. (1958) Planning ofexperiments. New York: Wiley. Documenta Geigy (1962) Scientific tables, 6th edition. Manchester: Geigy Pharmaceutical Company Limited. GART, J.J. (1969). An exact test for comparing matched proportions in cross-over designs. Biometrika, 56, 75-80. GRIZZLE, J.E. (1965). The two period change-over design and its use in clinical trials. Biometrics, 21, 467-468.
HILLS, M. (1974). Statistics for comparative studies. London: Chapman and Hall. MAINLAND, D. (1963). Elementary medical statistics, 2nd edition. Philadelphia and London: W.B. Saunders Company. PRESCOTT, R.J. (1979). The comparison of success rates in cross-over trials in the presence of an order effect. Applied Statistics (in press).

(Received September 14, 1978)

20

M. HILLS & P. ARMITAGE

Appendix. Analysis of variance for a quantitative response

nAnB(JA + B)

2(nA+nB)
H=

G'=

(nAdA + nB dB)2
2(nA + nB)

Since there are two sources of random variation, between and within subjects, the analysis of variance is of the split-plot type. The lay-out is shown in the following table:

Degrees of freedom
Between subjects TxP Residual

Sum of squares

Mean square

1 nA+nB-2

C
E
1G'r G

C 2s2
H

Within subjects T unadj. P adj. for T


or

1
1 1 nA+nB-2
2(nA+nB) -1

T adj. for P P unadj. Residual Total

G H' J
K

s2/2
2

nAnB(aA-B)2 H' = (nAA -nBaEB)' 2(nA + ns) 2(nA+nn) The residual sum of squares within subjects is J = i{(nA- O)SA+(OnB- l)SB}, where SA and sB are the variances of the differences in the two groups. The residual mean square is thus s2/2 (cf. Section 2.1). The variance ratio G/(s2/2) is exactly the square of the t statistic derived in Section 2.1 as a test for treatment effects, and the two methods are entirely equivalent. Similarly, the period effect is tested by the variance ratio H/(s2/2), and this is equivalent to the corresponding t test. When nA = nB, the two methods of subdivision are equivalent, since G = G' and H = H'; the two main effects, treatments and periods, are then said to be orthogonal. The between subjects sums of squares are calculated from the formulae:

C-2nAnB(nQA-inB)2
E
=

nA+nB 2{(nA- I)s2A + (nB - l)5B},

Tests for treatments and periods are obtained from the 'within subjects' part of the analysis. In an unbalanced design, with nA not equal to nB, the test for treatments is based on the sum of squares for treatments adjusted for periods. Similarly the test for periods is based on the sum of squares for periods adjusted for treatments. There are thus two ways of splitting up the two degrees of freedom for treatments and periods jointly. The formulae for these sums of squares are:

where S2A and s4 B are the variances of the means in the two groups. The residual mean square between subjects is 2s2 (cf. Section 3.3), and the variance ratio C/(2s2) is the square of the t statistic referred to in Section 3.3, i.e. the test for treatment period interaction. The total sum of squares, K, is not needed except as a check, but is derived in the usual way as the sum of squares of deviations of all readings about the overall mean.

You might also like