Professional Documents
Culture Documents
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.
http://www.jstor.org
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
BIOMETRICS 33, 159-174
March 1977
ofObserver
The Measurement forCategoricalData
Agreement
J. RICHARD LANDIS
Department of Biostatistics, Universityof Michigan, Ann Arbor, Michigan 48109 U.S.A.
GARY G. KOCH
Department of Biostatistics, Universityof North Carolina, Chapel Hill, North Carolina 27514 U.S.A.
Summary
This paper presentsa generalstatisticalmethodology cate-
for theanalysis of multivariate
goricaldata arisingfromobserver reliabilitystudies.The procedureessentiallyinvolvesthecon-
structionoffunctionsof theobservedproportions whichare directedat theextentto whichthe
observersagreeamongthemselves and theconstruction of teststatistics
for hypothesesinvolving
thesefunctions.Testsforinterobserver bias are presented in termsoffirst-ordermarginalhomo-
geneityand measuresofinterobserver agreement are developedas generalized
kappa-typestatistics.
Theseprocedures witha clinicaldiagnosisexamplefromtheepidemiological
are illustrated litera-
ture.
1. Introduction
Researchersin manv fieldshave become increasinglyaware of the observer (rater or
interviewer)as an importantsourceofmeasurementerror.Consequently,reliabilitystudies
are conductedin experimentalor surveysituationsto assess the level ofobservervariability
in the measurementproceduresto be used in data acquisition. When the data arising
fromsuch studies are quantitative,tests for interobserverbias and measures of inter-
observeragreementare usually obtainedfromstandardANOVA mixedmodelsor random
effectsmodels such as those discussed in Andersonand Bancroft[1952], Scheffe[1959],
and Searle [1971]. As a result,hypothesistests of observereffectsare used to investigate
interobserver in the mean responseamong observers,and estimates
bias, i.e., differences
of intraclasscorrelationcoefficientsare used to measureinterobserverreliability.rVlodifica-
tions and extensionsof these standard ANOVA models have been proposed by Grubbs
[1948, 1973], Mandel [1959], Fleiss [1966],Overall [1968],and Loewenson,Bearman and
Resch [1972]to evaluate the measurementerrorin varioustypesof applications.Although
assumptionsof normalityfor these models may not be warrantedin certain cases, the
ANOVA proceduresdiscussed in Searle [1971] and the symmetricsquare difference pro-
cedure in Koch [1967, 1968] still permitthe estimationof the appropriatecomponents
of varianceand the reliabilitycoefficients.
On the otherhand, many observerreliabilitystudiesinvolvecategoricaldata in which
the responsevariableis classifiedinto nominal(or possiblyordinal)multinomialcategories.
As reviewedin Landis and Koch [1975a, 1975b],a wide varietyof estimationand testing
procedureshave been recommendedfor the assessmentof observervariabilityin these
Key Words: Observer agreement; Multivariate categorical data; Kappa statistics; Repeated measurement
experiments;Weighted least squares.
159
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
160 BIOMETRICS, MARCH 1977
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 161
Table 1
DIAGNOSTIC CLASSIFICATION REGARDING MIULTIPLE SCLEROSIS
4 3 7 3 10 23 0.154
Total 84 37 11 17 149
1 5 3 0 0 8 0.116
2 3 11 4 0 18 0.261
New Orleans
Neurologist 3 2 13 3 4 22 0.319
(1)
4 1 2 4 14 21 0.304
Total 11 29 11 18 69
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
162 BIOMETRICS, MARCH 1977
3. Methodology
Let i = 1, 2, ..., s index a set of sub-populationsfromwhichrandomsamples have
been selected. Suppose that the same responsevariable is measured separatelyby each
of d observersusing an L-point scale. Let the r = Ld responseprofilesbe indexed by a
vectorsubscriptj = (jl
il jd), wherej, = 1, 2,
* , L forg = 1, 2, , d. Further-
more,let 7rij = 7ri i2, representthe joint probabilityofresponseprofilej forrandomly
i, id
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 163
K = 1 --- (3.2)
Moreover let 0 < whj < 1 forh = 1, 2, , u over all j, so that the resultingestimates
are interpretableas probabilitiesof agreement.Then the observationalprobabilityof
agreementassociated withthe hthset of weightsin the ith sub-populationis the weighted
sum
I
Z iw for = 21,2
...
(3.3)
Nih =W*..
i =f 1 2, .., U.
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
164 BIOMETRICS,MARCH1977
-E : 7rii 2 .* *i
. d =
Ofi1;1i2i 2 .
.f i di d
d
= Joikik for i = 1, 2, **, s. (3-5)
k= 1
= T~iI'o
g
for i= 1, 2, ** ,s. (3.7)
= 1
whereHA denoteshierarchicalagreement.
In order to maintain consistentnomenclaturewhen describingthe relative strength
of agreementassociated with kappa statistics,the followinglabels will be assignedto the
corresponding rangesof kappa:
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 165
4.1 MarginalHomogeneity
Tests
The functionsrequired to test the hypothesesinvolvingmarginaldistributionscan
be generatedin the formulationof (A.14) in Appendix 1 in Koch et al. [1977] with the
functionvectorF' = (F1', F2') where
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
166 BIOMETRICS, MARCH 1977
Table 2
HIERARCHICALWEIGHTS FOR AGREEMENTMEASURES
Observer 2 2 2 2
Diagnostic 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Class
1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0
2 0 1 0 0 1100 1 00 1 1 1 0
Observer 1
3 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1
4 0 0 0 1 0 0 0 1 0 0 11 0 0 1 1
which contain the marginalproportionsfor diagnosticclasses "1," "2" and "3" for the
two observerswithinthe two sub-populations.The test statisticforHsm is Qc = 46.37
with d.f. = 6, whichimpliesthat there are significant(a = 0.01) differences in the dis-
tributionsofthe observedresponseprofilesbetweentheWinnipegand New Orleanspatients.
The tests of this hypothesiswithineach of the observersalso indicate statisticallysig-
nificant(a = 0.01) differences betweenthe two sub-populations,althoughthe Winnipeg
neurologistrepresentsthe moredominantcomponent.Similarly,the test statisticforHcMa
is Qc = 69.01 withd.f. = 6, whichimpliesthat thereare significant(a = 0.01) differences
in theresponseprofilesbetweenthe two neurologists withineach sub-population.Moreover,
the dominant componentof these observerdifferences is withinthe Winnipegpatient
group. These results suggest that significantinterobserverbias exists between the two
neurologistsin their overall usage of the diagnosticclassificationscale. In addition,the
goodness-of-fitstatistic for testingthe interactionhypothesisHAger is Q = 14.09 with
d.f. = 3. This significant(a = 0.01) observerX sub-populationinteractionis consistent
withthe resultthat the observerdifferences are moresubstantialin the Winnipegpatient
group (Qc = 58.47) than in the New Orleanspatientgroup (Qc = 10.54).
Table 3
DESCRIPTION OF HIERARCHICAL WEIGHTS
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 167
1 0 00 01 0 0 0 1 0 0 0 1 0 00
1 000 0 1 00 00 1 0 000 1
1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1
_1 1 00 1 1 1 0 0 1 1 1 001 1
1 00 0 0 10 0000
1 000 00 1 0 0000
0 10 0 1 0 0 0 0 0 0 0
40X24 0 10 0 0 1 00 0000012; (4 3)
0 100 00 1 0 0000
0 10 0 0 001 0 0 0 0
0 01 0 1 0 0 0 0 0 0 0
00(10( 010 0000((
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
168 BIOMETRICS, MARCH 1977
0 0 1 0 0 0 1 0 0 0 0 0
00 1 0 000 1 0000
000 1 01 00 0000
000 1 00 1 0 0000
-1 0 0 0 0 -1 0 0 0 -1 0 0 0 0 -1 10 0 0
-1 -1 0 0 -1 -1 0 0 0 0 -1 0 0 0 0 -1 0 10 0
-1 -100 -1 -1 00 0 0 -1 -1 00 -1 -1 0010
-1 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 0 0 -1 -1 0 0 0 1
A3 = 012
1fX40 0 1 1 10 1 1 1 0 1 1 1 1 0 0 0 0 0 (4.4)
0 011 0 0 11 1 1 0 1 11 1 0 0000
0 011 0 0 1 1 1 1 0 0 1 1 0 0 0000
010 1 00011000110
0 1 0 0000
K11 0.208
K12 0.328
K13 0.408
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 169
Table 4
STATISTICAL TESTS FOR HIERARCHICAL HYPOTHESES
Hypothesis D.F. QC
K11 2 6.89*
K12 K22 K21
14 1A; 24 23 28.13**
13 =12 1 4.38*
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
170 BIOMETRICS, MARCH 1977
KIc
0 0 1 0 0 ,
0 1 0 0 0 _5
0 0 1 0 0
Table 5
STATISTICAL TESTS BETWEEN PATIENT SUB-POPULATIONS
Hypothesis D.F. Q
1 0.90
K11 K21
1 0.00
K12 K22
1 0.03
K13 = K23
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 171
Table 6
STATISTICAL TESTS FOR MODEL X
K =K 1 5.40* Ka = 0 1 31.05**
2 11
K =3 K 1 4.92* K = 0 1 40.71**
22
K = K 1 12.33** K = 0 1 45.49**
K =
K4 1 4.88* K = 0 1 72.44**
K = 0 1 94.97**
5. Discussion
In someapplications,one may also be interestedin a set ofweightswhichassignvarying
degreesof partial creditto the off-diagonal
cells dependingon the extentof the disagree-
ment,ratherthan successivelycombiningadjoiningcategoriesas shown in Table 2. For
Table 7
SMOOTHED ESTIMATES OF AGREEMENT UNDER MODEL X
Sub-population 1 2
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
172 BIOMETRICS, MARCH 1977
Table 8
ALTERNATIVE WEIGHTS FOR OVERALL AGREEMENT MEASURES
Weights jlj 2j
Observer 2 2
Diagnostic 1 2 3 4 1 2 3 4
Class
1 1 0 0 0 1 i ?
0
2 0 1 0 0 ? 1 i
Observer 1
3 0 0 1 0 i i 1 ?
4 0 0 0 1 0 i1? 1
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 173
Acknowledgments
This research was partially supportedby Research Grants GM\1-00038-20 and GM-
70004-05fromthe National Instituteof General Medical Sciencesand by the U. S. Bureau
of the Census throughJointStatisticalAgreementsJSA 74-2 and JSA 75-2. The authors
wouldlike to thankthe refereesfortheirhelpfulcommentson an earlierdraftofthispaper.
In addition,the authors are gratefulto 1\ls. Rebecca Wesson and Ms. Lynn Wilkinson
fortheirconscientioustypingof previousdraftsof thispaper,and to 1\ls.Linda L. Blakley
and 1\ls.Connie M\'Jassey
fortheirefficienttypingof the finalversionof this manuscript.
La Mlesurede la Concordance
EntreObservations
pourdes Donneesen Categories
Resume
Particle expose une methodologie statistiquegeneratepour l'analyse de donneesmulti-
variatesen categories
provenant d'etudesdefiabilited'observateurs.La procedurefaitprincipale-
mentappel a la construction de fonctionsdes proportions observeestraduisantla concordance
des observateursentreeux et a la constructionde statistiques
de testspour des hypothesesimpli-
quantcesfonctions. On present des testspourdes biais entreobservateurs enfonctionde l'homo-
geneit'marginaledu premierordreet on construit des mesuresde concordance entreobservateurs
commedes statistiquesgeneralisantcelies du typekappa. On illustreces proceduresavec an
exemplede diagnosticcliniqueprovenant de la litterature
epidemiologique.
References
Anderson,
R. L. and Bancroft,T. A. [1952].Statistical
Theory in Research.
McGrawHill,NewYork.
Bhapkar,V. P. [1966].A noteon theequivalenceoftwotestcriteria forhypotheses
in categorical
data.JournaloftheAmerican StatisticalAssociation61, 228-235.
Bhapkar,V. P. [1968].On the analysisof contingencytables witha quantitativeresponse.Biometrics
24,329-338.
Bhapkar,V. P. and Koch,G. G. [1968a].Hypotheses of "no interaction"
in multidimensional con-
tingency tables.Technometrics 10, 107-123.
Bhapkar, V. P. andKoch,G. G. [1968b].On thehypotheses of"nointeraction" incontingency tables.
Biometrics 24,567-594.
Ciechetti, D. V. [1972].A newmeasureof agreement betweenrank-ordered variables.Proceedings,
80thAnnualConvention, APA, 17-18.
Cohen,J.[1960].A coefficient ofagreement fornominalscales.Educational
andPsychological Measure-
ment 20, 37-46.
Cohen,J. [1968].Weighted kappa: nominalscaleagreement withprovision forscaleddisagreement
orpartialcredit.Psychological Bulletin70,213-220.
Fleiss,J. L. [1966].Assessingthe accuracyof multivariate observations.Journalof theAmerican
StatisticalAssociation61,403-412.
Fleiss,J.L., Cohen,J.andEveritt,B. S. [1969].Largesamplestandarderrors ofkappaandweighted
kappa.Psychological Bulletin72,323-337.
Fleiss,J. L. [1971].Measuringnominalscale agreement amongmanyraters.Psychological Bulletin
76,378-382.
Fleiss,J.L. and Cohen,J. [1973].The equivalence ofweighted kappaand theintraclass correlation
coefficient as measuresof reliability.Educationaland Psychological
Measurement 33, 613-619.
Fleiss,J. L. [1975].Measuringagreement betweentwojudgeson thepresence or absenceofa trait.
Biometrics 31,651-659.
Forthofer, R. N. and Koch,G. G. [1973].Ananalysisforcompounded functionsofcategoricaldata.
Biometrics 29, 143-157.
Grizzle,J.E., Starmer, C. F. and Koch,G. G. [1969].Analysisofcategoricaldata bylinearmodels.
Biometrics 25,489-504.
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
174 BIOMETRICS, MARCH 1977
1975
ReceivedApril 1975, RevisedNovember
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions