(Artigo) Landis e Koch 1977 Intervalo Kappa The Measurement of Observer Agreement Categorical Data

The Measurement of Observer Agreement for Categorical Data
Author(s): J. Richard Landis and Gary G. Koch

Reviewed work(s):
Source: Biometrics, Vol. 33, No. 1 (Mar., 1977), pp. 159-174
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2529310 .
Accessed: 19/11/2012 06:33
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.
http://www.jstor.org
This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
BIOMETRICS 33, 159-174
March 1977
ofObserver
The Measurement forCategoricalData
Agreement
J. RICHARD LANDIS
Department of Biostatistics, Universityof Michigan, Ann Arbor, Michigan 48109 U.S.A.
GARY G. KOCH
Department of Biostatistics, Universityof North Carolina, Chapel Hill, North Carolina 27514 U.S.A.
Summary
This paper presentsa generalstatisticalmethodology cate-
for theanalysis of multivariate
goricaldata arisingfromobserver reliabilitystudies.The procedureessentiallyinvolvesthecon-
structionoffunctionsof theobservedproportions whichare directedat theextentto whichthe
observersagreeamongthemselves and theconstruction of teststatistics
for hypothesesinvolving
thesefunctions.Testsforinterobserver bias are presented in termsoffirst-ordermarginalhomo-
geneityand measuresofinterobserver agreement are developedas generalized
kappa-typestatistics.
Theseprocedures witha clinicaldiagnosisexamplefromtheepidemiological
are illustrated litera-
ture.
1. Introduction
Researchersin manv fieldshave become increasinglyaware of the observer (rater or
interviewer)as an importantsourceofmeasurementerror.Consequently,reliabilitystudies
are conductedin experimentalor surveysituationsto assess the level ofobservervariability
in the measurementproceduresto be used in data acquisition. When the data arising
fromsuch studies are quantitative,tests for interobserverbias and measures of inter-
observeragreementare usually obtainedfromstandardANOVA mixedmodelsor random
effectsmodels such as those discussed in Andersonand Bancroft[1952], Scheffe[1959],
and Searle [1971]. As a result,hypothesistests of observereffectsare used to investigate
interobserver in the mean responseamong observers,and estimates
bias, i.e., differences
of intraclasscorrelationcoefficientsare used to measureinterobserverreliability.rVlodifica-
tions and extensionsof these standard ANOVA models have been proposed by Grubbs
[1948, 1973], Mandel [1959], Fleiss [1966],Overall [1968],and Loewenson,Bearman and
Resch [1972]to evaluate the measurementerrorin varioustypesof applications.Although
assumptionsof normalityfor these models may not be warrantedin certain cases, the
ANOVA proceduresdiscussed in Searle [1971] and the symmetricsquare difference pro-
cedure in Koch [1967, 1968] still permitthe estimationof the appropriatecomponents
of varianceand the reliabilitycoefficients.
On the otherhand, many observerreliabilitystudiesinvolvecategoricaldata in which
the responsevariableis classifiedinto nominal(or possiblyordinal)multinomialcategories.
As reviewedin Landis and Koch [1975a, 1975b],a wide varietyof estimationand testing
procedureshave been recommendedfor the assessmentof observervariabilityin these
Key Words: Observer agreement; Multivariate categorical data; Kappa statistics; Repeated measurement
experiments;Weighted least squares.
159
160 BIOMETRICS, MARCH 1977
cases. In thispaper we proposea unifiedapproachto the evaluationof observeragreement

for categoricaldata by expressingthe quantities which reflectthe extent to which the
observersagree among themselvesas functionsof observed proportionsobtained from
underlyingmultidimensional contingencytables. These functionsare thenused to produce
test statisticsfor the relevant hypothesesconcerninginterobserverbias in the overall
usage of the measurementscale and interobserveragreementon the classificationof in-
dividual subjects. For illustrativepurposes,this generalmethodologyis developed within
the contextof a typical data set whichresultedfroman investigationof observervari-
abilityin the clinicaldiagnosisof multiplesclerosis.
2. A Clinical Diagnois Example

Let us considerthe data arisingfromthe diagnosisof multiplesclerosisreportedin
Westlundand Kurland [1953]. Among other things,the investigatorswere interestedin
comparingpatient groups to study possible differences in the geographicaldistributions
of the disease. For thispurpose,a seriesof patientsin Winnipeg,Manitoba and a separate
seriesof patientsin New Orleans,Louisiana wereselectedand wereexaminedby a neurol-
ogist in their respectivelocations. Afterthe completionof all the examinations,each
neurologistwas requested to review all the recordswithoutseeing his earlier summary
and diagnosisand to classifytheminto one of the followingdiagnosticclasses:
1. Certainmultiplesclerosis;
2. Probable multiplesclerosis;
3. Possible multiplesclerosis(odds 50: 50);
4. Doubtful,unlikely,or definitely not multiplesclerosis.
In orderto evaluate agreementbetweenthe diagnosticians,the Winnipegneurologistthen
reviewed and classifiedeach of the New Orleans patient records,and vice versa. The
data resultingfromthese reviewdiagnosesare presentedin Table 1.
A preliminary inspectionof the Winnipegdata indicatesthat the Winnipegneurologist
tended to diagnose more of the patientsas certain (1) or probable (2) multiplesclerosis
than did his counterpartin New Orleans. As a result,they agreed on the diagnosis of
only 64/149 (43 percent) of the patients. Althoughthe differences in the overall crude
of
distributions the diagnosesseem to be less prominentwithinthe New Orleanspatients,
the neurologistsdiagnosed only 33/69 (48 percent) of them into identically the same
in
category.The statisticalissues concerningthesedifferences diagnosiscan be summarized
withinthe framework of the followingbasic questions:
(1) Are thereany differences betweenthe two patientpopulationswithrespectto the
overallcrudedistributionof the diagnosesby each of the two neurologists?
(2) Are thereany differences betweenthe overall crude distributionsof the diagnoses
by the two neurologistswithineach of the respectivepatientpopulations?
(3) Is thereany neurologistX sub-populationinteractionin the overallcrudedistribu-
tion of the diagnoses?
(4) Is there any difference between the two patient populationswith respect to the
overall agreementof the two neurologistson the specificdiagnosis of individual
patients?
(5) Is the agreementof the two neurologistson the specificdiagnosis of individual
patientssignificantly differentfromdance agreementbased on theiroverall crude
distributions of diagnoses?
AGREEMENT MEASURES FOR CATEGORICAL DATA 161
Table 1
DIAGNOSTIC CLASSIFICATION REGARDING MIULTIPLE SCLEROSIS
Sub-population Winnipeg Patients (1)
Observer Winnipeg Neurologist (2)
Diagnostic 1 2 3 4 Total Proportion

Class
1 38 5 0 1 44 0.295
New Orleans 2 33 11 3 0 47 0.315

Neurologist
(1) 3 10 14 5 6 35 0.235
4 3 7 3 10 23 0.154
Total 84 37 11 17 149
Proportion 0.564 0.248 0.074 0.114
Sub-population New Orleans Patients (2)
Observer Winnipeg Neurologist (2)
Diagnostic 1 2 3 4 Total Proportion

Class
1 5 3 0 0 8 0.116
2 3 11 4 0 18 0.261
New Orleans
Neurologist 3 2 13 3 4 22 0.319
(1)
4 1 2 4 14 21 0.304
Total 11 29 11 18 69
Proportion 0.159 0.420 0.159 0.261
(6) Are therecertainpatternsof disagreementwhichmay reflectsignificant imprecision

in the diagnosticcriteria?
As stated in Koch et al. [1977],questions (1)-(3) are directlyanalogous to the hypotheses
of "no whole-ploteffects,""no split-ploteffects,"and "no whole-plotX split-plotinter-
action" in standardsplit-plotexperiments. In thiscontext,question(1) addressesdifferences
among the sub-populations,question (2) involves the issue of interobserverbias, and
question (3) is concerned with the observer X sub-populationinteraction.Thus, the
first-order marginal distributionsof response for each of the neurologistswithin each
sub-populationcontain the relevant informationfor dealing with these questions. In
contrastto overall crude differences, questions (4)-(6) are addressed at interobserver

agreementon a subject-to-subjectbasis; and, as such they are directlyanalogous to
hypothesesconcerningintraclasscorrelationcoefficients in randomeffectsmodels. Hence,
certainfunctionsof the diagonal cells of varioussubtablesare used to provideinformation
forestimatingand testingthe significanceof agreementon the classificationof individual
subjects.
In the followingsectionsa general methodologyforansweringthese questionsis de-
veloped in terms of specifichypotheses.These proceduresare then illustratedwith an
analysis of the data in Table 1.
3. Methodology
Let i = 1, 2, ..., s index a set of sub-populationsfromwhichrandomsamples have
been selected. Suppose that the same responsevariable is measured separatelyby each
of d observersusing an L-point scale. Let the r = Ld responseprofilesbe indexed by a
vectorsubscriptj = (jl
il jd), wherej, = 1, 2,
* , L forg = 1, 2, , d. Further-
more,let 7rij = 7ri i2, representthe joint probabilityofresponseprofilej forrandomly
i, id
selectedsubjectsfromthe ith sub-population.Then let the first-order marginalprobability
=0i wk Er 1i I ,i 2 d,*' for g = l, 2, , d (3.1)

j with ik = ,2,
representthe probabilityof the kthresponsecategoryforthe gthobserverill the ith sub-

population.
3.1 HypothesesInvolvingMarginalDistributions
Hypothesesdirectedat the questionsof differences among sub-populationsand inter-
observerbias involve distributionsof the responseprofilesand can be expressedin terms
of constraintson the first-order marginalprobabilities fis . As a result, the specific
hypothesesassociated with questions (1)-(3) are directlyanalogous to HSM , HCI , and
HAM outlinedin Koch et al. [1977] in expressions(2.4), (2.5), and (2.9), respectively.In
particular,the d observerscorrespondto the d conditions,and thus the hypothesisof
firstordermarginalsymmetry (homogeneity) addressesthe issue of interobserverbias. These
hypothesescan also be expressedin termsof constraintson meanscorefunctionsassociated
witheach observersuchas the { Di} summaryindexesspecifiedin (2.14) in Koch etal. [1977].
Further discussion of hypotheses involving marginal distributionswithin the context
of observeragreementstudiesis givenin Landis [1975].
3.2 HypothesesInvolvingGeneralizedKappa-Type Measures

Whereas the previous hypothesesconcerningdifferences among sub-populationsand
interobserver bias involvedonlythe first-ordermarginalprobabilities,hypothesesdirected
at the extentto whichobserversagree amongthemselveson the classification of individual
subjects must be formulatedin termsof the internalelementsof the table. For example,
the estimateof the crude proportionof agreementbetween two observersis simplythe
sum of the observedproportionson the main diagonalof the corresponding two-waytable.
In addition,if partial creditis permittedfor certaintypes of disagreement,an estimate
of the weightedproportionof agreementwill involve the weightedinclusionof the off-
diagonal cells.
As reviewedin Landis and!Koch [1975a, 1975b],numerousmeasuresof observeragree-

menthave been proposedforcategoricaldata, e.g., Goodman and Kruskal [1954],Cohen
[1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]. Most of these quantities
are of the form
K = 1 --- (3.2)
whereir0is an observationalprobabilityof agreementand wreis a hypotheticalexpected

probabilityof agreementunder an appropriateset of baseline constraintssuch as total
independenceof observerclassifications.Rangingfrom[- 7re/(l - re)] to +1, K indicates
the extentto which the observationalprobabilityof agreementis in excess of the prob-
abilityof agreementhypothetically expectedunderthe baseline constraints.Furthermore,
as shownin Fleiss and Cohen [1973]and Fleiss [1975],K is directlyanalogous to the intra-
obtained fromANOVA modelsforquantitativemeasurements
class correlationcoefficient
and can be used as a measure of the reliabilityof multipledeterminationson the same
subjects.
Several kappa-type measures of interobserveragreementcan be formulatedto in-
vestigate selected patterns of disagreementsimultaneouslyby choosing corresponding
sets of weightswhichreflectthe role of each responsecategoryin a givenagreementindex.
For example, a set of weightscan be chosen so that the resultingagreementmeasure
indicates the combinedperformanceof all the observers,such as majorityor consensus
agreement,or sets of weightscan be directedat subsets of observers,such as all possible
pairwise agreementmeasures. Alternatively,these weights can be chosen so that the
associated kappa measuresindicate the incrementsin agreementwhichresultby succes-
sively combiningrelevant categoriesof the responsevariable. Such kappa measuresare
said to be in a hierarachicalrelationshipwith each other.Thus, in general,let w1;, W2v,
... , wj be u setsofweights
assignedtotheresponse profiles indexedbyj = (j, , j,, ..* id)
Moreover let 0 < whj < 1 forh = 1, 2, , u over all j, so that the resultingestimates
are interpretableas probabilitiesof agreement.Then the observationalprobabilityof
agreementassociated withthe hthset of weightsin the ith sub-populationis the weighted
sum
I
Z iw for = 21,2
...
(3.3)
Nih =W*..
i =f 1 2, .., U.
the expectedproportionofagreementassociatedwith(3.3) is theweighted

Correspondingly,
sum
~~~~~~h 'Y i ~~Yih

h = for =1, 2, . .. ,
Z
Wh.7ri(i 12(3i4) U,(34
where wrij(e) representsthe joint hypotheticalexpected probabilityof response profilej

forrandomlyselectedsubjects fromthe ith sub-population.
These expectedprobabilitiesare determinedby the choiceof a particularset ofbaseline
constraintsassumed forthe responseprofiles.For this purpose,let E = {IEl , E2 , ...
representsuch underlyingconstraintson the marginalprobabilities {/,in} of (3.1). In
this context,the followingsets of constraintsare of interestin creatinginterobserver
agreementmeasures:
(i) Under the assumptionof total independenceamong the responsevariables from
the d observers,the {17rii (e) } satisfy
164 BIOMETRICS,MARCH1977
-E : 7rii 2 .* *i
. d =
Ofi1;1i2i 2 .
.f i di d
d
= Joikik for i = 1, 2, **, s. (3-5)
k= 1
(ii) Under the assumption of "no interobserverbias" the hypothesisof first-order

marginalhomogeneity(Hc A in Koch et al. [1977]) holds. In this situation,let the
commonprobabilityof classificationinto the kthcategorybe
VPik = Oilk = Obi2k = = (3.6)
for i = 1, 2, , s and k = 1, 2, *.., L. Then under the baseline constraints
of total independenceand marginalhomogeneitythe {7ri(e)} satisfy
52
:2 {i I j2 I () =i i 2d
= T~iI'o
g
for i= 1, 2, ** ,s. (3.7)
= 1
Consequently,a generalizedkappa-typemeasure of agreementdirectlyanalogous to

(3.2) can be formulatedby
i
Kih = Nih -
i
for =
1, 2, ,S (3.8)
1 - T'ih It = 1,12, ,
undera set of specifiedconstraintsin E. Here Kih representsan agreementmeasureamong

the d observersin the ith sub-populationwithrespectto the hthset of weights.
Withinthis framework, the specifichypothesesassociated with questions (4)-(6) can
now be formulatedas follows:
(4) If thereare no differencesamongthe s sub-populationswithrespectto the measures
ofoverallspecific agreement amongthed observers underE, thenthe {Kih} satisfy
the hypothesis
HSA IEz Klh = K2h = Kh for t = 1, 2, ,, (3.9)
whereSA denotessub-populationagreement.
(5) If the level of observedagreementis equal to that expected under E, then the
Kih} satisfythehypothesis
HNA IEZ Kih = O for = 1, 2, ,s (3.10)

h =1, 2, ...
whereNA denotesno agreement.

(6) In some cases the weightsforthe kappa measuresare chosento be in a hierarchical
relationshipwith each otherin orderto investigatespecificdisagreementpatterns.
In these situations,if the extent of disagreementis the same for the categories
combinedby the (h + 1)-st set of weightsas forthose combinedby the hth set,
thenthe {Kih} satisfythe hypothesis
HHAjfEz Ki,h+1 = Ki,h for i = 1, 2, * , s, (3.11)
whereHA denoteshierarchicalagreement.
In order to maintain consistentnomenclaturewhen describingthe relative strength
of agreementassociated with kappa statistics,the followinglabels will be assignedto the
corresponding rangesof kappa:
Kappa Statistic Strength

ofAgreement
< 0.00 Poor
0.00-0.20 Slight
0.21-0.40 Fair
0.41-0.60 1\Ioderate
0.61-0.80 Substantial
0.81-1.00 AlmostPerfect
Althoughthese divisionsare clearlyarbitrary,they do provide useful "benchmarks"for

the discussionof the specificexamplein Table 1.
3.3 Estimationand HypothesisTesting
Test statisticsforthehypothesesconsideredin theprevioussectionsas wellas estimators
for correspondingmodel parameterscan be obtained by using the general approach for
the analysisof multivariatecategoricaldata proposedby Grizzle,Starmerand Koch [1969]
(hereafterabbreviatedGSK) as outlinedin Appendix1 in Koch etal. [1977].The hypotheses
in Section 3.1 involvingconstraintson the first-ordermarginalprobabilitiescan be tested
by expressingthe estimatesof the {4iak} or the {a-qidas linearfunctionsof the type given
in Appendix1 (A.14) in Koch etal. [1977].These particularmatrixexpressionshave already
been discussedin considerabledetail in Koch and Reinfurt[1971] and Koch et at. [1977],
and thus they will not be elaboratedhere. Otherwise,theirspecificconstructionforthese
hypothesesin observeragreementstudiesis documentedin Landis [1975].
In contrastto the linear functionswhich pertain to the hypothesesin Section 3.1,
all the hypothesesinvolvinggeneralizedkappa-type measures require the expressionof
the ratioestimatesof the {Kih} as compounded
logarithmic-exponential-linear
functions
of the observed proportionsas formulatedin Appendix 1 (A.20) in Koch et al. [1977].
As a result,the test statisticsforthe hypothesesin Section 3.2 can also be generatedby
the corresponding expressiongivenin Appendix1 (A.11) in Koch etal. [1977].
4. Analysisof Multiple SclerosisData

This section is concernedwith the analysis of the multiplesclerosisdata ill Table 1
with primaryemphasis given to illustratingthe methodologyin Section 3. Tests of sig-
nificanceare used in a descriptivecontextto identifyimportantsources of variation as
opposed to a rigorousinferentialcontext;and thus issues pertainingto multiplecompar-
isons are ignoredhere. These, however,can be handled by the Scheffetype procedures
given in Grizzle,Starmerand Koch [1969]. The design forthis example involves s = 2
sub-populations,d = 2 observers,and L = 4 responsecategories.Thus, thereare r = L 16
possiblemultivariateresponseprofileswithineach of the sub-populations.
4.1 MarginalHomogeneity
Tests
The functionsrequired to test the hypothesesinvolvingmarginaldistributionscan
be generatedin the formulationof (A.14) in Appendix 1 in Koch et al. [1977] with the
functionvectorF' = (F1', F2') where
F1' = (0.295, 0.315, 0.235, 0.564, 0.248, 0.074) (4.1)

F2' = (0.116, 0.261, 0.319, 0.159, 0.420, 0.159),
Table 2
HIERARCHICALWEIGHTS FOR AGREEMENTMEASURES
Weights w1j w2j w3j w4j
Observer 2 2 2 2
Diagnostic 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Class
1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0
2 0 1 0 0 1100 1 00 1 1 1 0
Observer 1
3 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1
4 0 0 0 1 0 0 0 1 0 0 11 0 0 1 1
which contain the marginalproportionsfor diagnosticclasses "1," "2" and "3" for the
two observerswithinthe two sub-populations.The test statisticforHsm is Qc = 46.37
with d.f. = 6, whichimpliesthat there are significant(a = 0.01) differences in the dis-
tributionsofthe observedresponseprofilesbetweentheWinnipegand New Orleanspatients.
The tests of this hypothesiswithineach of the observersalso indicate statisticallysig-
nificant(a = 0.01) differences betweenthe two sub-populations,althoughthe Winnipeg
neurologistrepresentsthe moredominantcomponent.Similarly,the test statisticforHcMa
is Qc = 69.01 withd.f. = 6, whichimpliesthat thereare significant(a = 0.01) differences
in theresponseprofilesbetweenthe two neurologists withineach sub-population.Moreover,
the dominant componentof these observerdifferences is withinthe Winnipegpatient
group. These results suggest that significantinterobserverbias exists between the two
neurologistsin their overall usage of the diagnosticclassificationscale. In addition,the
goodness-of-fitstatistic for testingthe interactionhypothesisHAger is Q = 14.09 with
d.f. = 3. This significant(a = 0.01) observerX sub-populationinteractionis consistent
withthe resultthat the observerdifferences are moresubstantialin the Winnipegpatient
group (Qc = 58.47) than in the New Orleanspatientgroup (Qc = 10.54).
Table 3
DESCRIPTION OF HIERARCHICAL WEIGHTS
Set of Disagreement Permitted

Weights for Agreement Statistic
1 None; requires perfect agreement.
2 Certain (1) with Probable (2).
3 Certain (1) with Probable (2);

Possible (3) with Doubtful (4).
Certain (1) with Probable (2);

4 Probable (2) with Possible (3);
Possible (3) with Doubtful (4).
4.2 HierarchicalKappa-Type MeasuresofAgreement

Specificpatternsof disagreementbetweenthe neurologistson the diagnosticclassifica-
tion of individualsubjects can be investigatedby selectinga hierarchyof weightswhich
successivelycombine adjoining categoriesof diagnosisin orderto create potentiallyless
stringentreliabilitymeasures. For example, the four sets of weightsin Table 2 can be
used to investigatethe sources of imprecisediagnosticcriteria.As indicated in Table 3,
these weightsare chosen so that specificdisagreementpatternsare successivelytolerated
in the correspondingestimatesofagreement.In particular,w1jrepresentsthe set of weights
whichgeneratethe kappa measureof perfectagreementproposedin Cohen [1960].
The sequence of hierarchicalkappa-type statistics within each of the two patient
populationsassociated withthe weightsgivenin Table 2 can be expressedin the formula-
tion (A.20) in Appendix 1 in Koch et al. [1977] under the baseline constraintsof total
independenceEl in (3.5) by letting
1111 0000 0000 0000
0000 1 1 1 1 0000 0000
0000 0000 1111 0000
0000 0000 0000 1 1 1 1
1 0 00 01 0 0 0 1 0 0 0 1 0 00
A, - 0001( OO1??X2; '01 (4.2)

24X3200 010
1 0 00 1 0 0010 001 0
2 0 0 1 0001 0001 0001
1 000 0 1 00 00 1 0 000 1
1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1
1100 1100 0011 0011
_1 1 00 1 1 1 0 0 1 1 1 001 1
1000 1000 0000
1 00 0 0 10 0000
1 000 00 1 0 0000
1000 0001 0000
0 10 0 1 0 0 0 0 0 0 0
40X24 0 10 0 0 1 00 0000012; (4 3)
0 100 00 1 0 0000
0 10 0 0 001 0 0 0 0
0 01 0 1 0 0 0 0 0 0 0
00(10( 010 0000((
0 0 1 0 0 0 1 0 0 0 0 0
00 1 0 000 1 0000
0001 1000 0000
000 1 01 00 0000
000 1 00 1 0 0000
4OX24 0 0 0 1 0001 0000 0 12; (4.3)

4AX24
~~~~~~~~~~~~Con
0000 0000 1000
0000 0000 0100
0000 0000 0010
-1 0 0 0 0 -1 0 0 0 -1 0 0 0 0 -1 10 0 0
-1 -1 0 0 -1 -1 0 0 0 0 -1 0 0 0 0 -1 0 10 0
-1 -100 -1 -1 00 0 0 -1 -1 00 -1 -1 0010
-1 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 0 0 -1 -1 0 0 0 1
A3 = 012
1fX40 0 1 1 10 1 1 1 0 1 1 1 1 0 0 0 0 0 (4.4)
0 011 0 0 11 1 1 0 1 11 1 0 0000
0 011 0 0 1 1 1 1 0 0 1 1 0 0 0000
010 1 00011000110
0 1 0 0000
A4 = [14 -I4] 0I2; (4.5)

9X16
For the data in Table 1, these estimatesare givenby
K11 0.208
K12 0.328
K13 0.408
F = K14 = 0.596 (4.6)

K21 0.297
K22 0.332
K93 0.386
_K24- _0.789
wherekih is the estimateof the agreementmeasurein the ith sub-populationassociated

withthe hthset of weightsshownin Table 2. In addition,the estimatedcovariancematrix
of F is givenby
0.2546 0.2122 0.1868 0.1442

0.2122 0.4005 0.3862 0.2912 0
0.1868 0.3862 0.5200 0.3832
VF = 0.1442 0.2912 0.3832 0.5700 X 10-2. (4 7)

0.6163 0.5582 0.5046 0.2185
0.5582 0.6879 0.6544 0.3010
? 0.5046 0.6544 1.0030 0.4147
0.2185 0.3010 0.4147 0.7720
The test statisticsforthe hierarchicalhypothesesin (3.11) are displayedin Table 4.

These results indicate that all increases in successive agreementmeasures within the
Winnipegpatientgroupare significant(a = 0.05); but forthe New Orleanspatientgroup,
the only significant(a = 0.05) increasein agreementpertainedto the finalset of weights.
Thus, the neurologistsare exhibitingsignificantdisagreementbetween diagnoses (1,2),
(2,3) and (3,4) in the Winnipeggroup and significantdisagreementbetween diagnoses
(2,3) in the New Orleansgroup,as evidencedby theinflatedfrequenciesin theseoff-diagonal
cells in Table 1. On the other hand, the estimatesin (4.6) suggestthat the hierarchical
Table 4
STATISTICAL TESTS FOR HIERARCHICAL HYPOTHESES
Hypothesis D.F. QC
Combined Patient Groups
K11 2 6.89*
K12 K22 K21
13 12' 23 K22 2 S.15
14 1A; 24 23 28.13**
Winnipeg Patients (1)
K12 =11 1 6.20**
13 =12 1 4.38*
K14 =13 1 10.96**
New Orleans Patients (2)
K22 =21 1 0.69
K23 =22 1 0.76
K24 =23 1 17.17**
* means significant at a = 0.05;

** means significant at a = 0.01.
kappa measureswithinboth patient groupsexhibitthe same increasingtrend.Since the

estimatedvariances of the kappa statisticsare much largerforthe New Orleans patient
group (due to the smallersample size), the agreementpatternsmay indeed be essentially
the same in both patientgroups.
If the two neurologistsare indeed exhibitingthe same agreementpatternswithrespect
to the weightsgiven in Table 2 withinthe two groupsof patients,then under (3.5) the
Kih } satisfythe following
hypothesesfrom(3.9)
HSAI1E,1 KUZb= K2h for t = 1, 2, 3, 4. (4.8)
Test statisticsforthese hypothesesboth individuallyand jointlyare presentedin Tfable5.
The resultsin Tables 4 and 5 suggestthat a reduced model can be used to combine
parameterswhich are essentiallyequivalent. For this purpose,the agreementstatistics
in (4.6) can be modeledby
10 0 0 0
0 1 ? ? ?
0K1
KIc
0 0 1 0 0 ,
EAIFl = X 000 10 K3 (4.9)

K4
0 1 0 0 0 _5
0 0 1 0 0
where"EA" denotes"asymptoticexpectation."For thismodel,the goodness-of-fit statistic

is Q = 2.27 withd.f. = 3. Thus, thisreducedmodelprovidesa satisfactory characterization
of the variation among these agreementmeasures. Specifictest statisticsfor the cor-
respondinghypothesesin (3.10) and (3.11) pertainingto the model X in (4.9) are givenin
(a = 0.01) different
Table 6. These resultssuggestthat all the parametersare significantly
fromzero, and moreover,are significantly (a = 0.05) different
fromeach other.Further-
more, by reducingthe model to these smoothed estimates,the marginallysignificant
(a = 0.10) difference betweenK14 and K24 in Table 5 is now significant(a = 0.05) forthe
Table 5
STATISTICAL TESTS BETWEEN PATIENT SUB-POPULATIONS
Hypothesis D.F. Q
Kl =2h for h = 1,2,3,4. 4 7.15
1 0.90
K11 K21
1 0.00
K12 K22
1 0.03
K13 = K23
K14 =24 1 2.77
Table 6
STATISTICAL TESTS FOR MODEL X
Hypothesis D,F, QC Hypothesis D.F.
K =K 1 5.40* Ka = 0 1 31.05**
2 11
K =3 K 1 4.92* K = 0 1 40.71**
22
K = K 1 12.33** K = 0 1 45.49**
K =
K4 1 4.88* K = 0 1 72.44**
K = 0 1 94.97**
* means significant at ot = 0.05;

* means significant at ot = 0.01.
comparisonof K4 and K5 in thisfinalmodel.Finally,the predictedvalues forthe Ki,, } based

on thefittedmodel(4.9) are displayedin Table 7 togetherwiththeircorresponding estimated
standarderrors.
Thus, theseresultssuggestthat the diagnosticcriteriaare not verydistinctwithrespect
to theirusage by these two neurologists.In additionto bias at the macro stage, i.e., con-
sideringonly the overall marginalproportions,these observersexhibitedsignificantdis-
agreementat the micro state, i.e., consideringeach individual subject, in specifyinga
diagnosis.Only withrespectto the relativelyrelaxedcriterioncorresponding to the fourth
set of weightsdo the kappa statisticsindicate a "moderate" to "substantial" level of
interobserver reliability.
5. Discussion
In someapplications,one may also be interestedin a set ofweightswhichassignvarying
degreesof partial creditto the off-diagonal
cells dependingon the extentof the disagree-
ment,ratherthan successivelycombiningadjoiningcategoriesas shown in Table 2. For
Table 7
SMOOTHED ESTIMATES OF AGREEMENT UNDER MODEL X
Sub-population 1 2
Agreement Estimate Estimated Estimate Estimated

Weights Statistic Under X Standard Error Under X Standard Error
Wij Kil 0.236 0.042 0.236 0.042
w2j Ki2 0.311 0.049 0.311 0.049
W3j Ki3 0.383 0.057 0.383 0.057
W. Ki. 0.579 0.068 0.790 0.081

4j i4
Table 8
ALTERNATIVE WEIGHTS FOR OVERALL AGREEMENT MEASURES
Weights jlj 2j
Observer 2 2
Diagnostic 1 2 3 4 1 2 3 4
Class
1 1 0 0 0 1 i ?
0
2 0 1 0 0 ? 1 i
Observer 1
3 0 0 1 0 i i 1 ?
4 0 0 0 1 0 i1? 1
example,the weightsw2j in Table 8 are directlyanalogous to those discussed in Cohen

[1968],Fleiss, Cohen and Everitt [1969]and Cicchetti[1972],whichwereused to generate
weightedkappa and C statistics.For the data in Table 1, these estimatesare givenby
K11 0.208
F = K12 = 0.315 (5.1)
K21 0.297
K22 _ 0.407_
where the {KiJ estimate the perfectagreementkappa measure and the { i,} estimate
the partial-creditweightedkappa agreementmeasurebetweenthe two neurologistsin the
two patient populations.A more extensiveanalysis of these data under the weightsin
Table 8 is givenin Landis [1975]and Landis et al. [1976].
Althoughthe methodologyforthe assessmentof observeragreementdevelopedill this
paper is quite general,these procedureshave been illustratedwith an example involving
only two observers.However, for situationsin which eitherthe numberof observersd
or the numberof responsecategoriesL is moderatelylarge,the numberof possiblemulti-
variate response profilesr = Ld becomes extremelylarge. Consequently,the matrices
requiredto implementthe GSK proceduresdirectlymay be outsidethe scope of computa-
tional feasibility.In addition, for each of the s sub-populationsmany of the r possible
responseprofileswill not necessarilybe observed in the respectivesamples so that cor-
respondingcell frequenciesare zero. Thus, in such cases, specializedcomputingprocedures
are requiredto obtain the estimatesof the pertinentfunctions.
One alternativeapproach for handlingsuch very large contingencytables in which
most of the observed cell frequenciesare zero is discussed in Landis and Koch [1977].
In this regard,the same estimatorswhichwould need to be obtained fromthe conceptual
multidimensional contingency table can be generatedby firstforming appropriateindicator
variables of the raw data fromeach subject and then computingthe across-subjectarith-
meticmeans. Subsequentto these preliminary steps,the usual matrixoperationsdiscussed
in Appendix1 in Koch et al. [1977] can then be applied to these indicatorvariable means
to determinethe required measures of observeragreement.These alternativecomputa-
tionsinvolvingraw data, as wellas the extendedGSK proceduressummarizedin Appendix1
in Koch et al. [1977] can all be performedby a recentlydeveloped computerprogram
(GENCAT) discussedin Landis, etal. [1976].
Acknowledgments
This research was partially supportedby Research Grants GM\1-00038-20 and GM-
70004-05fromthe National Instituteof General Medical Sciencesand by the U. S. Bureau
of the Census throughJointStatisticalAgreementsJSA 74-2 and JSA 75-2. The authors
wouldlike to thankthe refereesfortheirhelpfulcommentson an earlierdraftofthispaper.
In addition,the authors are gratefulto 1\ls. Rebecca Wesson and Ms. Lynn Wilkinson
fortheirconscientioustypingof previousdraftsof thispaper,and to 1\ls.Linda L. Blakley
and 1\ls.Connie M\'Jassey
fortheirefficienttypingof the finalversionof this manuscript.
La Mlesurede la Concordance
EntreObservations
pourdes Donneesen Categories
Resume
Particle expose une methodologie statistiquegeneratepour l'analyse de donneesmulti-
variatesen categories
provenant d'etudesdefiabilited'observateurs.La procedurefaitprincipale-
mentappel a la construction de fonctionsdes proportions observeestraduisantla concordance
des observateursentreeux et a la constructionde statistiques
de testspour des hypothesesimpli-
quantcesfonctions. On present des testspourdes biais entreobservateurs enfonctionde l'homo-
geneit'marginaledu premierordreet on construit des mesuresde concordance entreobservateurs
commedes statistiquesgeneralisantcelies du typekappa. On illustreces proceduresavec an
exemplede diagnosticcliniqueprovenant de la litterature
epidemiologique.
References
Anderson,
R. L. and Bancroft,T. A. [1952].Statistical
Theory in Research.
McGrawHill,NewYork.
Bhapkar,V. P. [1966].A noteon theequivalenceoftwotestcriteria forhypotheses
in categorical
data.JournaloftheAmerican StatisticalAssociation61, 228-235.
Bhapkar,V. P. [1968].On the analysisof contingencytables witha quantitativeresponse.Biometrics
24,329-338.
Bhapkar,V. P. and Koch,G. G. [1968a].Hypotheses of "no interaction"
in multidimensional con-
tingency tables.Technometrics 10, 107-123.
Bhapkar, V. P. andKoch,G. G. [1968b].On thehypotheses of"nointeraction" incontingency tables.
Biometrics 24,567-594.
Ciechetti, D. V. [1972].A newmeasureof agreement betweenrank-ordered variables.Proceedings,
80thAnnualConvention, APA, 17-18.
Cohen,J.[1960].A coefficient ofagreement fornominalscales.Educational
andPsychological Measure-
ment 20, 37-46.
Cohen,J. [1968].Weighted kappa: nominalscaleagreement withprovision forscaleddisagreement
orpartialcredit.Psychological Bulletin70,213-220.
Fleiss,J. L. [1966].Assessingthe accuracyof multivariate observations.Journalof theAmerican
StatisticalAssociation61,403-412.
Fleiss,J.L., Cohen,J.andEveritt,B. S. [1969].Largesamplestandarderrors ofkappaandweighted
kappa.Psychological Bulletin72,323-337.
Fleiss,J. L. [1971].Measuringnominalscale agreement amongmanyraters.Psychological Bulletin
76,378-382.
Fleiss,J.L. and Cohen,J. [1973].The equivalence ofweighted kappaand theintraclass correlation
coefficient as measuresof reliability.Educationaland Psychological
Measurement 33, 613-619.
Fleiss,J. L. [1975].Measuringagreement betweentwojudgeson thepresence or absenceofa trait.
Forthofer, R. N. and Koch,G. G. [1973].Ananalysisforcompounded functionsofcategoricaldata.
Biometrics 29, 143-157.
Grizzle,J.E., Starmer, C. F. and Koch,G. G. [1969].Analysisofcategoricaldata bylinearmodels.
Grubbs, F. E. [1948]. On estimatingprecisionof measuringinstrumentsand product variability.

Journalof theAmericanStatisticalAssociation43, 243-264.
Grubbs,F. E. [1973]. Errorsof measurement,precision,accuracy and the statisticalcomparisonof
measuringinstruments.Technometrics 15, 53-66.
Goodman,L. A. and Kruskal,W. H. [1954]. Measures of associationforcrossclassification.Journal
of theAmericanStatisticalAssociation49, 732-764.
Koch, G. G. [1967]. A generalapproach to the estimationof variance components.Technometrics 9,
93-118.
Koch, G. G. [1968]. Some furtherremarksconcerning"A general approach to the estimationof
variancecomponents."Technometrics 10, 551-558.
Koch, G. G. and Reinfurt,D. W. [1971]. The analysis of categoricaldata frommixedmodels. Bio-
metrics27, 157-173.
Koch, G. G., Landis, J. R., Freeman,J. L., Freeman,D. H., Jr.and Lehnen,R. G. [1977].A general
methodologyfor the analysis of experimentswith repeated measurementof categoricaldata.
Biometrics33, 133-158.
Landis, J. R. [1975]. A generalmethodologyforthe measurementof observeragreementwhen the
data are categorical.Ph.D. Dissertation,Universityof North Carolina, Instituteof Statistics
Mimeo Series No. 1022.
Landis, J. R. and Koch, G. G. [1975a]. A reviewof statisticalmethodsin the analysisof data arising
fromobserverreliabilitystudies (Part I). StatisticaNeerlandica29, 101-123.
Landis, J. R. and Koch, G. G. [1975b].A reviewof statisticalmethodsin the analysisof data arising
fromobserverreliabilitystudies (Part II). StatisticaNeerlandica29, 151-161.
Landis, J. R. and Koch, G. G. [1977]. An application of hierarchicalkappa-typestatisticsin the
assessmentof majorityagreementamong multipleobservers.Accepted forpublicationin Bio-
metrics.
Landis, J. R., Stanish,W. M., Freeman,J. L. and Koch, G. G. [1976].A computerprogramforthe
generalizedchi-squareanalysis of categoricaldata using weightedleast squares (GENCAT).
Universityof Michigan BiostatisticsTechnical Report No. 8. Acceptedforpublicationin Com-
puterProgramsin Biomedicine.
Light, R. J. [1971]. Measures of responseagreementforqualitative data: some generalizationsand
alternatives.PsychologicalBulletin76, 365-377.
Loewenson,R. B., Bearman, J. E. and Resch, J. A. [1972]. Reliabilityof measurementsforstudies
of cardiovascularatherosclerosis. Biometrics28, 557-569.
Mandel, J. [1959].The measuringprocess.Technometrics 1, 251-267.
Neyman,J. [1949]. Contributionto the theoryof the X2test. Proceedingsof theBerkeleySymposium
on mathematical statisticsand probability,
Berkeley and Los Angeles,Universityof California
Press, 239-272.
Overall, J. E. [1968]. Estimating individual rater reliabilitiesfromanalysis of treatmenteffects.
Educationaland PsychologicalMeasurement 28, 255-264.
Scheff6,H. [1959]. The Analysisof Variance.Wiley,New York.
Searle, S. R. [1971].Linear Models. Wiley,New York.
Wald, A. [1943]. Tests of statisticalhypothesesconcerninggeneralparameterswhen the numberof
observationsis large. Transactionsof theAmericanMathematicalSociety54, 426-482.
Westlund,K. B. and Kurland,L. T. [1953].Studies on multiplesclerosisin Winnipeg.Manitoba and
New Orleans,Louisiana. AmericanJournalofHygiene57, 380-396.
1975
ReceivedApril 1975, RevisedNovember

(Artigo) Landis e Koch 1977 Intervalo Kappa The Measurement of Observer Agreement Categorical Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Artigo) Landis e Koch 1977 Intervalo Kappa The Measurement of Observer Agreement Categorical Data

Uploaded by

Copyright:

Available Formats

The Measurement of Observer Agreement for Categorical Data

Author(s): J. Richard Landis and Gary G. Koch

cases. In thispaper we proposea unifiedapproachto the evaluationof observeragreement

2. A Clinical Diagnois Example

Sub-population Winnipeg Patients (1)

Observer Winnipeg Neurologist (2)

Diagnostic 1 2 3 4 Total Proportion

New Orleans 2 33 11 3 0 47 0.315

Proportion 0.564 0.248 0.074 0.114

Sub-population New Orleans Patients (2)

Observer Winnipeg Neurologist (2)

Diagnostic 1 2 3 4 Total Proportion

Proportion 0.159 0.420 0.159 0.261

(6) Are therecertainpatternsof disagreementwhichmay reflectsignificant imprecision

contrastto overall crude differences, questions (4)-(6) are addressed at interobserver

selectedsubjectsfromthe ith sub-population.Then let the first-order marginalprobability

=0i wk Er 1i I ,i 2 d,*' for g = l, 2, , d (3.1)

representthe probabilityof the kthresponsecategoryforthe gthobserverill the ith sub-

3.2 HypothesesInvolvingGeneralizedKappa-Type Measures

As reviewedin Landis and!Koch [1975a, 1975b],numerousmeasuresof observeragree-

whereir0is an observationalprobabilityof agreementand wreis a hypotheticalexpected

the expectedproportionofagreementassociatedwith(3.3) is theweighted

~~~~~~h 'Y i ~~Yih

where wrij(e) representsthe joint hypotheticalexpected probabilityof response profilej

(ii) Under the assumption of "no interobserverbias" the hypothesisof first-order

Consequently,a generalizedkappa-typemeasure of agreementdirectlyanalogous to

undera set of specifiedconstraintsin E. Here Kih representsan agreementmeasureamong

HNA IEZ Kih = O for = 1, 2, ,s (3.10)

whereNA denotesno agreement.

HHAjfEz Ki,h+1 = Ki,h for i = 1, 2, * , s, (3.11)

Kappa Statistic Strength

Althoughthese divisionsare clearlyarbitrary,they do provide useful "benchmarks"for

4. Analysisof Multiple SclerosisData

F1' = (0.295, 0.315, 0.235, 0.564, 0.248, 0.074) (4.1)

Weights w1j w2j w3j w4j

Set of Disagreement Permitted

1 None; requires perfect agreement.

2 Certain (1) with Probable (2).

3 Certain (1) with Probable (2);

Certain (1) with Probable (2);

4.2 HierarchicalKappa-Type MeasuresofAgreement

A, - 0001( OO1??X2; '01 (4.2)

1100 1100 0011 0011

1000 1000 0000

1000 0001 0000

0001 1000 0000

4OX24 0 0 0 1 0001 0000 0 12; (4.3)

A4 = [14 -I4] 0I2; (4.5)

For the data in Table 1, these estimatesare givenby

F = K14 = 0.596 (4.6)

wherekih is the estimateof the agreementmeasurein the ith sub-populationassociated

0.2546 0.2122 0.1868 0.1442

VF = 0.1442 0.2912 0.3832 0.5700 X 10-2. (4 7)

The test statisticsforthe hierarchicalhypothesesin (3.11) are displayedin Table 4.

Combined Patient Groups

13 12' 23 K22 2 S.15

Winnipeg Patients (1)

K12 =11 1 6.20**

K14 =13 1 10.96**

New Orleans Patients (2)

K22 =21 1 0.69

K23 =22 1 0.76

K24 =23 1 17.17**

~~~~h 'Y i Yih