You are on page 1of 22

/OL. 52, No.

4 JULY, 1955

Psychological Bulletin
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS
LEE J. CRONBACH PAUL E. MEEHLi
AND
University of Illinois University of Minnesota

Validation of psychological tests with some areas where the Committee


Las not yet been adequately concep- would probably not be unanimous.
tualized, as the APA Committee on The present writers are solely respon-
J'sychological Tests learned when it sible for this attempt to explain the
undertook (1950-54) to specify what concept and elaborate its implica-
qualities should be investigated be- tions.
fore a test is published. In order to Identification of construct validity
make coherent recommendations the was not an isolated development.
('ommittee found it necessary to dis- Writers on validity during the pre-
tinguish four types of validity, estab- ceding decade had shown a great deal
lished by different types of research of dissatisfaction with conventional
and requiring different interpreta- notions of validity, and introduced
tion. The chief innovation in the new terms and ideas, but the result-
Committee's report was the term con- ing aggregation of types of validity
siruct validity? This idea was first seems only to have stirred the muddy
formulated by a subcommittee waters. Portions of the distinctions
(Meehl and R. C. Challman) study- we shall discuss are implicit in Jen-
ing how proposed recommendations kins' paper, "Validity for what?"
would apply to projective techniques, (33), Gulliksen's "Intrinsic validity"
and later modified and clarified by (27), Goodenough's distinction be-
the entire Committee (Bordin, Chall- tween tests as "signs" and "samples"
man, Conrad, Humphreys, Super, (22), Cronbach's separation of "logi-
and the present writers). The state- cal" and "empirical" validity (11),
ments agreed upon by the Commit- Guilford's "factorial validity" (25),
tee (and by committees of two other and Hosier's papers on "face valid-
associations) were published in the ity" and "validity generalization"
Technical Recommendations (59). The (49, 50). Helen Peak (52) comes
present interpretation of construct close to an explicit statement of con-
validity is not "official" and deals struct validity as we shall present it.
1
The second author worked on this prob- FOUR TYPES OF VALIDATION
lem in connection with his appointment to the The categories into which the Rec-
Minnesota Center for Philosophy of Science.
We are indebted to the other members of the ommendations divide validity studies
(Center (Herbert Feigl, Michael Scriven, are: predictive validity, concurrent
Wilfrid Sellars), and to D. L. Thistlethwaite validity, content validity, and con-
of the University of Illinois, for their major struct validity. The first two of these
contributions to our thinking and their sug-
gestions for improving this paper. may be considered together as cri-
2
Referred to in a preliminary report (58) terion-oriented validation procedures.
as congruent validity. The pattern of a criterion-oriented
281
282 LEE J. CRONBACH AND PAUL E. MEEHL

study is familiar. The investigator is be measured." When an investigator


primarily interested in some criterion believes that no criterion available to
which he wishes to predict. He ad- him is fully valid, he perforce be-
ministers the test, obtains an inde- comes interested in construct validity
pendent criterion measure on the because this is the only way to avoid
same subjects, and computes a cor- the "infinite frustration" of relating
relation. If the criterion is obtained every criterion to some more ultimate
some time after the test is given, he is standard (21). In content validation,
studying predictive validity. If the acceptance of the universe of content
test score and criterion score are de- as defining the variable to be meas-
termined at essentially the same time, ured is essential. Construct validity
he is studying concurrent validity. must be investigated whenever no
Concurrent validity is studied when criterion or universe of content is
one test is proposed as a substitute accepted as entirely adequate to de-
for another (for example, when a fine the quality to be measured. De-
multiple-choice form of spelling test termining what psychological con-
is substituted for taking dictation), structs account for test performance
or a test is shown to correlate with is desirable for almost any test. Thus,
some contemporary criterion (e.g., although the MM PI was originally
psychiatric diagnosis). established on the basis of empirical
Content validity is established by discrimination between patient
showing that the test items are a sam- groups and so-called normals (con-
ple of a universe in which the investi- current validity), continuing research
gator is interested. Content validity has tried to provide a basis for de-
is ordinarily to be established de- scribing the personality associated
ductively, by defining a universe of with each score pattern. Such inter-
items and sampling systematically pretations permit the clinician to pre-
within this universe to establish the dict performance with respect to cri-
test. teria which have not yet been em-
Construct validation is involved ployed in empirical validation studies
whenever a test is to be interpreted (cf. 46, pp. 49-50, 110-111).
as a measure of some attribute or
quality which is not "operationally We can distinguish among the four types
denned." The problem faced by the of validity by noting that each involves a
different emphasis on the criterion. In pre-
investigator is, "What constructs dictive or concurrent validity, the criterion
account for variance in test perform- behavior is of concern to the tester, and he
ance?" Construct validity calls for may have no concern whatsoever with the
no new scientific approach. Much type of behavior exhibited in the test. (An
employer does not care if a worker can mani-
current research on tests of personal- pulate blocks, but the score on the block
ity (9) is construct validation, usu- test may predict something he cares about.)
ally without the benefit of a clear Content validity is studied when the tester
formulation of this process. is concerned with the type of behavior in-
Construct validity is not to be iden- volved in the test performance. Indeed, if the
test is a work sample, the behavior repre-
tified solely by particular investiga- sented in the test may be an end in itself.
tive procedures, but by the orienta- Construct validity is ordinarily studied when
tion of the investigator. Criterion- the tester has no definite criterion measure
oriented validity, as Bechtoldt em- of the quality with which he is concerned, and
must use indirect measures. Here the trait or
phasizes (3, p. 1245), "involves the quality underlying the test is of central im-
acceptance of a set of operations as an portance, rather than either the test behavior
adequate definition of whatever is to or the scores on the criteria (59, p. 14).
CONSTRUCT VALIDITY 283
Construct validation is important e.g., perhaps the test measures "aca-
at times for every sort of psychologi- demic aspiration," in which case we
cal test: aptitude, achievement, in- will expect different results if we in-
terests, and so on. Thurstone's state- duce palmar sweating by economic
ment is interesting in this connec- threat. It is then reasonable to in-
tion: quire about other kinds of evidence.
In the field of intelligence tests, it used to be Add these facts from further stud-
common to define validity as the correlation ies: Test X correlates .45 with fra-
between a test score and some outside cri- ternity brothers' ratings on "tense-
terion. We have reached a stage of sophistica-
tion where the test-criterion correlation is ness." Test X correlates .55 with
too coarse. It is obsolete. If we attempted to amount of intellectual inefficiency in-
ascertain the validity of a test for the second duced by painful electric shock, and
space-factor, for example, we would have to .68 with the Taylor Anxiety scale.
get judges [to] make reliable judgments about Mean X score decreases among four
people as to this factor. Ordinarily their
[the available judges'] ratings would be of diagnosed groups in this order: anxi-
no value as a criterion. Consequently, validity ety state, reactive depression, "nor-
studies in the cognitive functions now depend mal," and psychopathic personality.
on criteria of internal consistency . . . (60, And palmar sweat under threat of
P. 3). failure in Psychology I correlates .60
Construct validity would be involved with threat of failure in mathematics.
in answering such questions as: To Negative results eliminate competing
what extent is this test of intelligence explanations of the X score; thus,
culture-free? Does this test of "inter- findings of negligible correlations be-
pretation of data" measure reading tween X and social class, vocational
ability, quantitative reasoning, or re- aim, and value-orientation make it
sponse sets? How does a person with fairly safe to reject the suggestion
A in Strong Accountant, and B in that X measures "academic aspira-
Strong CPA, differ from a person who tion." We can have substantial con-
has these scores reversed? fidence that X does measure anxiety
Example of construct validation pro- proneness if the current theory of
cedure. Suppose measure X correlates anxiety can embrace the variates
.50 with Y, the amount of palmar which yield positive correlations, and
sweating induced when we tell a stu- does not predict correlations where
dent that he has failed a Psychology we found none.
I exam. Predictive validity of X for
Y is adequately described by the co- KINDS OF CONSTRUCTS
efficient, and a statement of the ex- At this point we should indicate
perimental and sampling conditions. summarily what we mean by a con-
If someone were to ask, "Isn't there struct, recognizing that much of the
perhaps another way to interpret this remainder of the paper deals with
correlation?" or "What other kinds this question, A construct is some
of evidence can you bring to support postulated attribute of people, as-
your interpretation?", we would sumed to be reflected in test perform-
hardly understand what he was ask- ance. In test validation the attribute
ing because no interpretation has about which we make statements in
been made. These questions become interpreting a test is a construct. We
relevant when the correlation is ad- expect a person at any time to possess
vanced as evidence that "test X or not possess a qualitative attribute
measures anxiety proneness." Alter- (amnesia) or structure, or to possess
native interpretations are possible; some degree of a quantitative attrib-
284 LEE J. CRONBACH AND PAUL E. MEEHL

ute (cheerfulness). A construct has THE RELATION OF CONSTRUCTS


certain associated meanings carried TO "CRITERIA"
in statements of this general charac- Critical View of the Criterion Implied
ter: Persons who possess this attri-
bute will, in situation X, act in manner An unquestionable criterion may
F (with a stated probability). The be found in a practical operation, or
logic of construct validation is in- may be established as a consequence
voked whether the construct is highly of an operational definition. Typi-
systematized or loose, used in rami- cally, however, the psychologist is un-
fied theory or a few simple proposi- willing to use the directly operational
tions, used in absolute propositions approach because he is interested in
or probability statements. We seek building theory about a generalized
to specify how one is to defend a pro- construct. A theorist trying to relate
posed interpretation of a test; we behavior to "hunger" almost cer-
are not recommending any one type of tainly invests that term with mean-
interpretation. ings other than the operation
The constructs in which tests are "elapsed-time-since-feeding." If he
to be interpreted are certainly not is concerned with hunger as a tissue
likely to be physiological. Most often need, he will not accept time lapse as
they will be traits such as "latent hos- equivalent to his construct because it
tility" or "variable in mood," or de- fails to consider, among other things,
scriptions in terms of an educational energy expenditure of the animal.
objective, as "ability to plan experi- In some situations the criterion is
ments." For the benefit of readers no more valid than the test. Sup-
who may have been influenced by cer- pose, for example, that we want to
tain eisegeses of MacCorquodale know if counting the dots on Bcnder-
and Mcehl (40), let us here empha- Gestalt figure five indicates "com-
size: Whether or not an interpreta- pulsive rigidity," and take psychia-
tion of a test's properties or relations tric ratings on this trait as a criterion.
involves questions of construct valid- Even a conventional report on the re-
ity is to be decided by examining the sulting correlation will say something
entire body of evidence offered, to- about the extent and intensity of the
gether with what is asserted about psychiatrist's contacts and should
the test in the context of this evi- describe his qualifications (e.g., clip-
dence. Proposed identifications of lomate status? analyzed?).
constructs allegedly measured by the Why report these facts? Because
test with constructs of other sciences data are needed to indicate whether
(e.g., genetics, neuroanatomy, bio- the criterion is any good. "Compul-
chemistry) make up only one class sive rigidity" is not really intended to
of construct-validity claims, and a mean "social stimulus value to psy-
rather minor one at present. Space chiatrists." The implied trait in-
does not permit full analysis of the volves a range of behavior-disposi-
relation of the present paper to the tions which may be very imperfectly
MacCorquoclale-Meehl distinction sampled by the psychiatrist. Sup-
between hypothetical constructs and pose dot-counting does not occur in a
intervening variables. The philoso- particular patient and yet we find
phy of science pertinent to the pres- that the psychiatrist has rated him as
ent paper is set forth later in the sec- "rigid." When questioned the psy-
tion entitled, "The nomological net- chiatrist tells us that the patient was
work." a rather easy, free-wheeling sort;
CONSTRUCT VALIDITY 285
however, the patient did lean over to variable be investigated completely
straighten out a skewed desk blotter, by itself, they allow two variables to
and this, viewed against certain collapse into one whenever the prop-
other facts, tipped the scale in favor erties of the operationally defined
of a "rigid" rating. On the face of it, measures are the same: "If a new
counting Bender dots may be just as test is demonstrate 1 to predict the
good (or poor) a sample of the com- scores on an older, well-established
pulsive-rigidity domain as straighten- test, then an evaluation of the predic-
ing desk blotters is. tive power of the older test may be
Suppose, to extend our example, we used for the new one." But accurate
have four tests on the "predictor" inferences are possible only if the two
side, over against the psychiatrist's tests correlate so highly that there
"criterion," and find generally posi- is negligible reliable variance in either
tive correlations among the five vari- test, independent of the other. Where
ables. Surely it is artificial and arbi- the correspondence is less close, one
trary to impose the "test-should-per- must either retain all the separate
dict-criterion" pattern on such data. variables operationally defined or em-
The psychiatrist samples verbal con- bark on construct validation.
tent, expressive pattern, voice, pos- The practical user of tests must
ture, etc. The psychologist samples rely on constructs of some generality
verbal content, perception, expres- to make predictions about new situa-
sive pattern, etc. Our proper con- tions. Test X could be used to pre-
clusion is that, from this evidence, dict palmar sweating in the face of
the four tests and the psychiatrist all failure without invoking any con-
assess some common factor. struct, but a counselor is more likely
The asymmetry between the "test" to be asked to forecast behavior in
and the so-designated "criterion" diverse or even unique situations for
arises only because the terminology which the correlation of test X is un-
of predictive validity has become a known. Significant predictions rely
commonplace in test analysis. In on knowledge accumulated around
this study where a construct is the the generalized construct of anxiety.
central concern, any distinction be- The Technical Recommendations
tween the merit of the test and cri- state:
terion variables would be justified It is ordinarily necessary to evaluate construct
only if it had already been shown validity by integrating evidence from many
that the psychiatrist's theory and different sources. The problem of construct
operations were excellent measures validation becomes especially acute in the
clinical field since for many of the constructs
of the attribute. dealt with it is not a question of finding an
imperfect criterion but of finding any criterion
INADEQUACY OF VALIDATION IN at all. The psychologist interested in con-
TERMS OF SPECIFIC CRITERIA struct validity for clinical devices is concerned
The proposal to validate construc- with making an estimate of a hypothetical
internal process, factor, system, structure,
tual interpretations of tests runs or state and cannot expect to find a clear
counter to suggestions of some others. unitary behavioral criterion. An attempt to
Spiker and McCandless (57) favor identify any one criterion measure or any
an operational approach. Validation composite as the criterion aimed at is, however,
usually unwarranted (50, p. 14-15).
is replaced by compiling statements
as to how strongly the test predicts This appears to conflict with argu-
other observed variables of interest. ments for specific criteria prominent
To avoid requiring that each new at places in the testing literature.
286 LEE J. CRONBACH AND PAUL E. MEEHL

Thus Anastasi (2) makes many state- ultimately be judged to have greater
ments of the latter character: "It is construct validity than the criterion.
only as a measure of a specifically We start with a vague concept which
defined criterion that a test can be we associate with certain observa-
objectively validated at all ... To tions. We then discover empirically
claim that a test measures anything that these observations covary with
over and above its criterion is pure some other observation which pos-
speculation" (p. 67). Yet elsewhere sesses greater reliability or is more in-
this article supports construct valida- timately correlated with relevant ex-
tion. Tests can be profitably inter- perimental changes than is the orig-
preted if we "know the relationships inal measure, or both. For example,
between the tested behavior . . . and the notion of temperature arises be-
other behavior samples, none of these cause some objects feel hotter to the
behavior samples necessarily occupy- touch than others. The expansion of
ing the preeminent position of a cri- a mercury column does not have face
terion" (p. 75). Factor analysis with validity as an index of hotness. But
several partial criteria might be used it turns out that (a) there is a statis-
to study whether a test measures a tical relation between expansion and
postulated "general learning ability." sensed temperature; (b) observers
If the data demonstrate specificity of employ the mercury method with
ability instead, such specificity is good interobserver agreement; (c)
"useful in its own right in advancing the regularity of observed relations
our knowledge of behavior; it should is increased by using the thermometer
not be construed as a weakness of the (e.g., melting points of samples of the
tests" (p. 75). same material vary little on the ther-
We depart from Anastasi at two mometer; we obtain nearly linear re-
points. She writes, "The validity of lations between mercury measures
a psychological test should not be and pressure of a gas). Finally, (d)
confused with an analysis of the fac- a theoretical structure involving un-
tors which determine the behavior observable microevents—the kinetic
under consideration." We, however, theory—is worked out which explains
regard such analysis as a most im- the relation of mercury expansion to
portant type of validation. Second, heat. This whole process of concep-
she refers to "the will-o'-the-wisp of tual enrichment begins with what in
psychological processes which are retrospect we see as an extremely fal-
distinct from performance" (2, p. 77). lible "criterion"—the human tem-
While we agree that psychological perature sense. That original criter-
processes are elusive, we are sympa- ion has now been relegated to a pe-
thetic to attempts to formulate and ripheral position. We have lifted our-
clarify constructs which are evi- selves by our bootstraps, but in a
denced by performance but distinct legitimate and fruitful way.
from it. Surely an inductive inference Similarly, the Binet scale was first
based on a pattern of correlations valued because children's scores
cannot be dismissed as "pure specu- tended to agree with judgments by
lation." schoolteachers. If it had not shown
this agreement, it would have been
Specific Criteria Used Temporarily: discarded along with reaction time
The "Bootstraps" Effect and the other measures of ability pre-
Even when a test is constructed on viously tried. Teacher judgments
the basis of a specific criterion, it may once constituted the criterion against
CONSTRUCT VALIDITY 287

which the individual intelligence test with age in an elementary school


was validated. But if today a child's sample would surely be suspect.
IQ is 135 and three of his teachers Correlation matrices and factor an-
complain about how stupid he is, we alysis. If two tests are presumed to
do not conclude that the test has measure the same construct, a cor-
failed. Quite to the contrary, if no relation between them is predicted.
error in test procedure can be argued, (An exception is noted where some
we treat the test score as a valid state- second attribute has positive loading
ment about an important quality, in the first test and negative loading
and define our task as that of finding in the second test; then a low correla-
out what other variables—person- tion is expected. This is a testable
ality, study skills, etc.—modify interpretation provided an external
achievement or distort teacher judg- measure of either the first or the sec-
ment. ond variable exists.) If the obtained
correlation departs from the expecta-
EXPERIMENTATION TO INVESTI- tion, however, there is no way to
GATE CONSTRUCT VALIDITY know whether the fault lies in test A,
test B, or the formulation of the con-
Validation Procedures struct. A matrix of intercorrelations
We can use many methods in con- often points out profitable ways of
struct validation. Attention should dividing the construct into more
particularly be drawn to Macfar- meaningful parts, factor analysis be-
lane's survey of these methods as they ing a useful computational method
apply to projective devices (41). in such studies.
Group differences. If our under- Guilford (26) has discussed the
standing of a construct leads us to place of factor analysis in construct
expect two groups to differ on the validation. His statements may be
test, this expectation may be tested extracted as follows:
directly. Thus Thurstone and Chave "The personnel psychologist wishes
validated the Scale for Measuring to know 'why his tests are valid.' He
Attitude Toward the Church by show- can place tests and practical criteria
ing score differences between church in a matrix and factor it to identify
members and nonchurchgoers. 'real dimensions of human person-
Churchgoing is not the criterion of ality.' A factorial description is ex-
attitude, for the purpose of the test is act and stable; it is economical in
to measure something other than the explanation; it leads to the creation
crude sociological fact of church at- of pure tests which can be combined
tendance; on the other hand, failure to predict complex behaviors." It is
to find a difference would have seri- clear that factors here function as
ously challenged the test. constructs. Eyscnck, in his "criterion
Only coarse correspondence be- analysis" (18), goes farther than Guil-
tween test and group designation is ford, and shows that factoring can be
expected. Too great a correspondence used explicitly to test hypotheses
between the two would indicate that about constructs.
the test is to some degree invalid, be- Factors may or may not be
cause members of the groups are ex- weighted with surplus meaning. Cer-
pected to overlap on the test. Intel- tainly when they are regarded as
ligence test items are selected initially "real dimensions" a great deal of
on the basis of a correspondence to surplus meaning is implied, and the
age, but an item that correlates .95 interpreter must shoulder a substan-
288 LEE J. CRONBACH AND PAUL E. MEEHL
tial burden of proof. The alternative that irrelevant elements influence
view is to regard factors as defining a scores. Thus a study of the PMA
working reference frame, located in a space tests shows that variance can
convenient manner in the "space" be partially accounted for by a re-
denned by all behaviors of a given sponse set, tendency to mark many
type. Which set of factors from a figures as similar (12). An internal
given matrix is "most useful" will factor analysis of the PEA Interpre-
depend partly on predilections, but tation of Data Test shows that in ad-
in essence the best construct is the dition to measuring reasoning skills,
one around which we can build the the test score is strongly influenced
greatest number of inferences, in the by a tendency to say "probably true"
most direct fashion. rather than "certainly true," regard-
Studies of internal structure. For less of item content (17). On the
many constructs, evidence of homo- other hand, a study of item groupings
geneity within the test is relevant in in the DAT Mechanical Comprehen-
judging validity. If a trait such as sion Test permitted rejection of the
dominance is hypothesized, and the hypothesis that knowledge about
items inquire about behaviors sub- specific topics such as gears made a
sumed under this label, then the hy- substantial contribution to scores
pothesis appears to require that these (13).
items be generally intercorrelated. Studies of change over occasions.
Even low correlations, if consistent, The stability of test scores ("rctest
would support the argument that reliability," Cattell's "N-tcchnique")
people may be fruitfully described in may be relevant to construct valida-
terms of a generalized tendency to tion. Whether a high degree of sta-
dominate or not dominate. The gen- bility is encouraging or discouraging
eral quality would have power to pre- for the proposed interpretation de-
dict behavior in a variety of situa- pends upon the theory defining the
tions represented by the specific construct.
items. Item-test correlations and More powerful than the retest after
certain reliability formulas describe uncontrolled intervening experiences
internal consistency. is the retest with experimental in-
It is unwise to list uninterpreted tervention. If a transient influence
data of this sort under the heading swings test scores over a wide range,
"validity" in test manuals, as some there are definite limits on the extent
authors have done. High internal to which a test result can be inter-
consistency may lower validity. Only preted as reflecting the typical be-
if the underlying theory of the trait havior of the individual. These are
being measured calls for high item examples of experiments which have
intercorrelations do the correlations indicated upper limits to test valid-
support construct validity. Negative ity: studies of differences associated
item-test correlations may support with the examiner in projective test-
construct validity, provided that the ing, of change of score under alterna-
items with negative correlations are tive directions ("tell the truth" vs.
believed irrelevant to the postulated "make yourself look good to an em-
construct and serve as suppressor ployer"), and of coachability of
variables (31, p. 431-436; 44). mental tests. We may recall Gullik-
Study of distinctive subgroups of sen's distinction (27): When the
items within a test may set an upper coaching is of a sort that improves
limit to construct validity by showing the pupil's intellectual functioning in
CONSTRUCT VALIDITY 289
school, the test which is affected by negative evidence on construct valid-
the coaching has validity as a meas- ity. A recent analysis of "empathy"
ure of intellectual functioning; if the tests is perhaps worth citing (14).
coaching improves test taking but "Empathy" has been operationally
not school performance, the test defined in many studies by the ability
which responds to the coaching has of a judge to predict what responses
poor validity as a measure of this will be given on some questionnaire
construct. by a subject he has observed briefly.
Sometimes, where differences be- A mathematical argument has shown,
tween individuals are difficult to however, that the scores depend on
assess by any means other than the several attributes of the judge which
test, the experimenter validates by enter into his perception of any in-
determining whether the test can de- dividual, and that they therefore can-
tect induced intra-individual differ- not be interpreted as evidence of his
ences. One might hypothesize that ability to interpret cues offered by
the Zeigarnik effect is a measure of particular others, or his intuition.
ego involvement, i.e., that with ego
involvement there is more recall of The Numerical Estimate of
incomplete tasks. To support such Construct Validity
an interpretation, the investigator There is an understandable tend-
will try to induce ego involvement on ency to seek a "construct validity
some task by appropriate directions coefficient," A numerical statement
and compare subjects' recall with of the degree of construct validity
their recall for tasks where there was would be a statement of the propor-
a contrary induction. Sometimes the tion of the test score variance that is
intervention is drastic. Porteus finds attributable to the construct variable.
(S3) that brain-operated patients This numerical estimate can some-
show disruption of performance on times be arrived at by a factor analy-
his maze, but clo not show impaired sis, but since present methods of fac-
performance on conventional verbal tor analysis are based on linear rela-
tests and argues therefrom that his tions, more general methods will
test is a better measure of planfulness. ultimately be needed to deal with
Studies of process. One of the best many quantitative problems of con-
ways of determining informally what struct validation.
accounts for variability on a test is Rarely will it be possible to esti-
the observation of the person's pro- mate definite "construct satura-
cess of performance. If it is supposed, tions," because no factor correspond-
for example, that a test measures ing closely to the construct will be
mathematical competence, and yet available. One can only hope to set
observation of students' errors shows upper and lower bounds to the "load-
that erroneous reading of the ques- ing." li "creativity" is defined as
tion is common, the implications of a something independent of knowledge,
low score are altered. Lucas in this then a correlation of .40 between a
way showed that the Navy Relative presumed test of creativity and a test
Movement Test, an aptitude test, of arithmetic knowledge would indi-
actually involved two different abili- cate that at least 16 per cent of the
ties: spatial visualization and mathe- reliable test variance is irrelevant to
matical reasoning (39). creativity as defined. Laboratory
Mathematical analysis of scoring performance on problems such as
procedures may provide important Maier's "hatrack" would scarcely be
290 LEE J. CRONBACH AND PAUL E. MEEHL
an ideal measure of creativity, but it it occurs. We shall refer to the inter-
would be somewhat relevant. If its locking system of laws which consti-
correlation with the test is .60, this tute a theory as a nomological network.
permits a tentative estimate of 36 2. The laws in a nomological net-
per cent as a lower bound. (The esti- work may relate (a) observable prop-
mate is tentative because the test erties or quantities to each other; or
might overlap with the irrelevant por- (b) theoretical constructs to observa-
tion of the laboratory measure.) The bles; or (c) different theoretical con-
saturation seems to lie between 36 structs to one another. These "laws"
and 84 per cent; a cumulation of stud- may be statistical or deterministic.
ies would provide better limits. 3. A necessary condition for a con-
It should be particularly noted struct to be scientifically admissible
that rejecting the null hypothesis is that it occur in a nomological net,
does not finish the job of construct at least some of whose laws involve
validation (35, p. 284). The problem observables. Admissible constructs
is not to conclude that the test "is may be remote from observation, i.e.,
valid" for measuring- the construct a long derivation may intervene be-
variable. The task is to state as defi- tween the nomologicals which im-
nitely as possible the degree of valid- plicitly define the construct, and the
ity the test is presumed to have. (derived) nomologicals of type a.
These latter propositions permit pre-
THE LOGIC OF CONSTRUCT dictions about events. The construct
VALIDATION is not "reduced" to the observations,,
Construct validation takes place but only combined with other con-
when an investigator believes that structs in the net to make predictions
his instrument reflects a particular about observables.
construct, to which are attached cer- 4. "Learning more about" a the-
tain meanings. The proposed inter- oretical construct is a matter of elab-
pretation generates specific testable orating the nomological network in
hypotheses, which are a means of which it occurs, or of increasing the
confirming or disconfirming the claim. definiteness of the components. At
The philosophy of science which we least in the early history of a con-
believe does most justice to actual struct the network will be limited,
scientific practice will now be briefly and the construct will as yet have
and dogmatically set forth. Readers few connections.
interested in further study of the 5. An enrichment of the net such
philosophical underpinning are re- as adding a construct or a relation to
ferred to the works by Braithwaite theory is justified if it generates nom-
(6, especially Chapter III), Carnap ologicals that are confirmed by ob-
(7; 8, pp. 56-69), Pap (51), Sellars servation or if it reduces the number
(55, 56), Feigl (19, 20), Beck (4), of nomologicals required to predict
Kneale (37, pp. 92-110), Ilcmpel the same observations. When ob-
(29; 30, Sec. 7). servations will not fit into the net-
work as it stands, the scientist has a
The Nomological Net certain freedom in selecting where to
The fundamental principles are modify the network. That is, there
these: may be alternative constructs or ways
1. Scientifically speaking, to of organizing the net which for the
"make clear what something is" time being are equally defensible.
means to set forth the laws in which 6. We can say that "operations"
CONSTRUCT VALIDITY 291
which are qualitatively very different A rigorous (though perhaps prob-
"overlap" or "measure the same abilistic) chain of inference is re-
thing" if their positions in the nomo- quired to establish a test as a measure
logical net tie them to the same con- of a construct. To validate a claim
struct variable. Our confidence in that a test measures a construct, a
this identification depends upon the nomological net surrounding the con-
amount of inductive support we have cept must exist. When a construct is
for the regions of the net involved. fairly new, there may be few speci-
It is not necessary that a direct ob- fiable associations by which to pin
servational comparison of the two down the concept. As research pro-
operations be made—we may be con- ceeds, the construct sends out roots
tent with an intranetwork proof in- in many directions, which attach it
dicating that the two operations to more and more facts or other con-
yield estimates of the same network- structs. Thus the electron has more
defined quantity. Thus, physicists accepted properties than the neu-
are content to speak of the "tempera- trino ; numerical ability has more than
ture" of the sun and the "tempera- the second space factor.
ture" of a gas at room temperature "Acceptance," which was critical
even though the test operations are in criterion-oriented and content
nonoverlapping because this identifi- validities, has now appeared in con-
cation makes theoretical sense. struct validity. Unless substantially
With these statements of scientif- the same nomological net is accepted
ic methodology in mind, we return by the several users of the construct,
to the specific problem of construct public validation is impossible. If A
validity as applied to psychological uses aggressiveness to mean overt as-
tests. The preceding guide rules sault on others, and B's usage in-
should reassure the "toughminded," cludes repressed hostile reactions,
who fear that allowing construct evidence which convinces B that a
validation opens the door to noncon- test measures aggressiveness con-
firmable test claims. The answer is vinces A that the test does not.
that unless the network makes con- Hence, the investigator who proposes
tact with observations, and exhibits to establish a test as a measure of a
explicit, public steps of inference, construct must specify his network or
construct validation cannot be theory sufficiently clearly that others
claimed. An admissible psychological can accept or reject it (cf. 41, p. 406).
construct must be behavior-relevant A consumer of the test who rejects
(59, p. 15). For most tests intended the author's theory cannot accept
to measure constructs, adequate cri- the author's validation. He must
teria do not exist. This being the validate the test for himself, if he
case, many such tests have been left wishes to show that it represents the
unvalidated, or a finespun network construct as he defines it.
of rationalizations has been offered Two general qualifications are in
as if it were validation. Rationaliza- order with reference to the methodo-
tion is not construct validation. One logical principles 1—6 set forth at
who claims that his test reflects a the beginning of this section. Both
construct cannot maintain his claim of them concern the amount of
in the face of recurrent negative re- "theory," in any high-level sense of
sults because these results show that that word, which enters into a con-
his construct is too loosely defined to struct-defining network of laws or
yield verifiable inferences. lawlike statements. We do not wish
292 LEE J. CRONBACH AND PAUL E. MEEHL
to convey the impression that one al- larities, two-point threshold, reaction
ways has a very elaborate theoretical time, and line bisection as subtests.
network, rich in hypothetical proc- The first four of these correlate, and
esses or entities. he extracts a huge first factor. This
Constructs as inductive summaries. becomes a second approximation of
In the early stages of development of the intelligence construct, described
a construct or even at more advanced by its pattern of loadings on the four
stages when our orientation is thor- tests. The other three tests have
oughly practical, little or no theory negligible loading on any common
in the usual sense of the word need be factor. On this evidence the investi-
involved. In the extreme case the hy- gator reinterprets intelligence as
pothesized laws are formulated en- "manipulation of words." Subse-
tirely in terms of descriptive (obser- quently it is discovered that test-
vational) dimensions although not all stupid people are rated as unable to
of the relevant observations have express their ideas, are easily taken in
actually been made. by fallacious arguments, and misread
The hypothesized network "goes complex directions. These data sup-
beyond the data" only in the limited port the "linguistic" definition of in-
sense that it purports to characterize telligence and the test's claim of
the behavior facets which belong to validity for that construct. But then
an observable but as yet only par- a block design test with pantomime
tially sampled cluster; hence, it gen- instructions is found to be strongly
erates predictions about hitherto un- saturated with the first factor. Im-
sampled regions of the phenotypic mediately the purely "linguistic" in-
space. Even though no unobserva- terpretation of Factor I becomes sus-
bles or high-order theoretical con- pect. This finding, taken together
structs are introduced, an element of with our initial acceptance of the
inductive extrapolation appears in others as relevant to the background
the claim that a cluster including concept of intelligence, forces us to
some elements not-yet-observed has reinterpret the concept once again.
been identified. Since, as in any sort- If we simply list the tests or traits
ing or abstracting task involving a which have been shown to be satur-
finite set of complex elements, sev- ated with the "factor" or which be-
eral nonequivalent bases of cate- long to the cluster, no construct is
gorization are available, the investi- employed. As soon as we even sum-
gator may choose a hypothesis which marize the properties of this group of
generates erroneous predictions. The indicators—we are already making
failure of a supposed, hitherto un- some guesses. Intensional characteri-
tried, member of the cluster to be- zation of a domain is hazardous since
have in the manner said to be charac- it selects (abstracts) properties and
teristic of the group, or the finding implies that new tests sharing those
that a nonmember of the postulated properties will behave as do the
cluster does behave in this manner, known tests in the cluster, and that
may modify greatly our tentative tests not sharing them will not.
construct. The difficulties in merely "charac-
For example, one might build an terizing the surface cluster" are strik-
intelligence test on the basis of his ingly exhibited by the use of certain
background notions of "intellect," special and extreme groups for pur-
including vocabulary, arithmetic cal- poses of construct validation. The
culation, general information, simi- Pd scale of MM PI was originally dc-
CONSTRUCT VALIDITY 293
rived and cross-validated upon hos- the range of socially permissible ac-
pitalized patients diagnosed "Psy- tion; the proffered criterion specifica-
chopathic personality, asocial and tion is still too restrictive. Again, any
amoral type" (42). Further research clinician familiar with MMPI lore
shows the scale to have a limited de- would predict an elevated Pd on a
gree of predictive and concurrent sample of (nondelinquent) profes-
validity for "delinquency" more sional actors. Chyatte's confirmation
broadly defined (5,28). Several stud- of this prediction (10) tends to sup-
ies show associations between Pd port both: (a) the theory sketch of
and very special "criterion" groups "what the Pd factor is, psychologi-
which it would be ludicrous to iden- cally"; and (6) the claim of the Pd
tify as "the criterion" in the tradi- scale to construct validity for this
tional sense. If one lists these hetero- hypothetical factor. Let the reader
geneous groups and tries to charac- try his hand at writing a brief pheno-
terize them intensionally, he faces typic criterion specification that will
enormous conceptual difficulties. For cover both trigger-happy hunters
example, a recent survey of hunting and Broadway actors! Andjif he
accidents in Minnesota showed that should be ingenious enough to achieve
hunters who had "carelessly" shot this, does his definition also encom-
someone were significantly elevated pass Hovey's report that high Pd
on Pft when compared with other predicts the judgments "not shy"
hunters (48). This is in line with one's and "unafraid of mental patients"
theoretical expectations; when you made upon nurses by their supervi-
ask MM PI "experts" to predict for sors (32, p. 143)? And then we have
such a group they invariably predict Cough's report that low Pd is asso-
Pd or Ma. or both. The finding seems ciated with ratings as "good-natured"
therefore to lend some slight support (24, p. 40), and Roessell's data show-
to the construct validity of the P<i ing that high Pa is predictive of
scale. But of course it would be non- "dropping out of high school" (54).
sense to define the Pd component The point is that all seven of these
"operationally" in terms of, say, ac- "criterion" dispositions would be
cident proneness. We might try to readily guessed by any clinician hav-
subsume the original phenotype and ing even superficial familiarity with
the hunting-accident proneness under MMPI interpretation; but to medi-
some broader category, such as "Dis- ate these inferences explicitly re-
position to violate society's rules, quires quite a few hypotheses about
whether legal, moral, or just sensible," dynamics, constituting an admittedly
But now we have ceased to have a sketchy (but far from vacuous) net-
neat operational criterion, and are work defining the genotype psycho-
using instead a rather vague and pathic deviate.
wide-range class. Besides, there is Vagueness of present psychological
worse to come. We want the class laws. This line of thought leads di-
specification to cover a group trend rectly to our second important quali-
that (nondelinquent) high school stu- fication upon the network schema.
dents judged by their peer group as The idealized picture is one of a tidy
least "responsible" score over a full set of postulates which jointly entail
sigma higher on Pa than those judged the desired theorems; since some of
most "responsible" (23, p. 75). Most the theorems are coordinated to the
of the behaviors contributing to such observation base, the system consti-
sociometric choices fall well within tutes an implicit definition of the
294 LEE J. CRONBACH AND PAUL E. MEEHL
theoretical primitives and gives them CONCLUSIONS REGARDING THE NET-
an indirect empirical meaning. In WORK AFTER EXPERIMENTATION
practice, of course, even the most ad- The proposition that x per cent of
vanced physical sciences only approx- test variance is accounted for by the
imate this ideal. Questions of "catc- construct is inserted into the accepted
goricalness" and the like, such as network. The network then generates
logicians raise about pure calculi, are a testable prediction about the rela-
hardly even statable for empirical tion of the test scores to certain other
networks. (What, for example, would variables, and the investigator gath-
be the desiderata of a "well-formed ers data. If prediction and result are
formula" in molar behavior theory?) in harmony, he can retain his belief
Psychology works with crude, half- that the test measures the construct.
explicit formulations. We do not The construct is at best adopted,
worry about such advanced formal never demonstrated to be "correct."
questions as "whether all molar-be- We do not first "prove" the theory,
havior statements are decidable by and then validate the test, nor con-
appeal to the postulates" because versely. In any probable inductive
we know that no existing theoretical type of inference from a pattern of ob-
network suffices to predict even the servations, we examine the relation
known descriptive laws. Neverthe- between the total network of theory
less, the sketch of a network is there; and observations. The system in-
if it were not, we would not be saying volves propositions relating test to
anything intelligible about our con- construct, construct to other con-
structs. We do not have the rigorous structs, and finally relating some of
implicit definitions of formal calculi these constructs to observables. In
(which still, be it noted, usually per- ongoing research the chain of infer-
mit of a multiplicity of interpreta- ence is very complicated. Kelly and
tions). Yet the vague, avowedly in- Fiske (36, p. 124) give a complex dia-
complete network still gives the con- gram showing the numerous infer-
structs whatever meaning they do ences required in validating a predic-
have. When the network is very in- tion from assessment techniques,
complete, having many strands miss- where theories about the criterion sit-
ing entirely and some constructs tied uation are as integral a part of the
in only by tenuous threads, then the prediction as are the test data. A pre-
"implicit definition" of these con- dicted empirical relationship permits
structs is disturbingly loose; one us to test all the propositions leading
might say that the meaning of the to that prediction. Traditionally the
constructs is underdetermined. Since proposition claiming to interpret the
the meaning of theoretical constructs test has been set apart as the hypoth-
is set forth by stating the laws in which esis being tested, but actually the evi-
they occur, our incomplete knowledge dence is significant for all parts of the
of the laws of nature produces a vague- chain. If the prediction is not con-
ness in our constructs (see Hempel, firmed, any link in the chain may be
30; Kaplan, 34; Pap, 51). We will be wrong.
able to say "what anxiety is" when A theoretical network can be di-
we know all of the laws involving it; vided into subtheories used in making
meanwhile, since we are in the pro- particular predictions. All the events
cess of discovering these laws, we do successfully predicted through a sub-
not yet know precisely what anxiety theory are of course evidence in favor
is. of that theory. Such a subtheory
CONSTRUCT VALIDITY 295
may be so well confirmed by volumi- strategic decisions. His result can be
nous and diverse evidence that we can interpreted in three ways:
reasonably view a particular experi- 1. The test does not measure the
ment as relevant only to the test's construct variable.
validity. If the theory, combined 2. The theoretical network which
with a proposed test interpretation, generated the hypothesis is incorrect.
mispredicts in this case, it is the latter 3. The experimental design failed
which must be abandoned. On the to test the hypothesis properly.
other hand, the accumulated evidence (Strictly speaking this may be an-
for a test's construct validity may be alyzed as a special case of 2, but in
so strong that an instance of mispre- practice the distinction is worth mak-
diction will force us to modify the ing-)
subthcory employing the construct For further research. If a specific
rather than deny the claim that the fault of procedure makes the third a
test measures the construct. reasonable possibility, his proper re-
Most cases in psychology today lie sponse is to perform an adequate
somewhere between these extremes. study, meanwhile making no report.
Thus, suppose we fail to find a When faced with the other two alter-
greater incidence of "homosexual natives, he may decide that his test
signs" in the Rorschach records of does not measure the construct ade-
paranoid patients. Which is more quately. Following that decision, he
strongly disconfirmed—the Ror- will perhaps prepare and validate a
schach signs or the orthodox theory new test. Any rescoring or new inter-
of paranoia? The negative finding pretative procedure for the original
shows the bridge between the two to instrument, like a new test, requires
be undependablc, but this is all we validation by means of a fresh body of
can say. The bridge cannot be used data.
unless one end is placed on solider The investigator may regard inter-
ground. The investigator must de- pretation 2 as more likely to lead to
cide which end it is best to relocate. eventual advances. It is legitimate
Numerous successful predictions for the investigator to call the net-
dealing with phenotypically diverse work defining the construct into ques-
"criteria" give greater weight to the tion, if he has confidence in the test.
claim of construct validity than do Should the investigator decide that
fewer predictions, or predictions in- some step in the network is unsound,
volving very similar behaviors. In he may be able to invent an alterna-
arriving at diverse predictions, the tive network. Perhaps he modifies
hypothesis of test validity is con- the network by splitting a concept
nected each time to a subnetwork into two or more portions, e.g., by
largely independent of the portion designating types of anxiety, or per-
previously used. Success of these haps he specifies added conditions
derivations testifies to the inductive under which a generali?ation holds.
power of the test-validity statement, When an investigator modifies the
and renders it unlikely that an theory in such a manner, he is now re-
equally effective alternative can be quired to gather afresh body of data to
offered. test the altered hypotheses. This step
should normally precede publication
Implications oj Negative Evidence of the modified theory. If the new
The investigator whose prediction data are consistent with the modified
and data are discordant must make network, he is free from the fear that
296 LEE J. CRONBACH AND PAUL E. MEEHL
his nomologicals were gerrymandered the psychological practitioner, the
to fit the peculiarities of his first sam- burden of proof is on the test. A test
ple of observations. He can now trust should not be used to measure a trait
his test to some extent, because his until its proponent establishes that
test results behave as predicted. predictions made from such measures
The choice among alternatives, like are consistent with the best available
any strategic decision, is a gamble as theory of the trait. In the view of the
to which course of action is the best test developer, however, both the
investment of effort. Is it wise to test and the theory are under scru-
modify the theory? That depends on tiny. He is free to say to himself pri-
how well the system is confirmed by vately, "If my test disagrees with the
prior data, and how well the modifi- theory, so much the worse for the
cations fit available observations. Is theory." This way lies delusion, un-
it worth while to modify the test in less he continues his research using a
tne hope that it will fit the construct? better theory.
That depends on how much evidence
there is—apart from this abortive ex- Reporting of Positive Results
periment—to support the hope, and
also on how much it is worth to the The test developer who finds posi-
investigator's ego to salvage the test. tive correspondence between his pro-
The choice among alternatives is a posed interpretation and data is ex-
matter of research planning. pected to report the basis for his
For practical use of the test. The validity claim. Defending a claim of
consumer can accept a test as a meas- construct validity is a major task, not
ure of a construct only when there is a to be satisfied by a discourse without
strong positive fit between predic- data. The Technical Recommenda-
tions and subsequent data. When tions have little to say on reporting of
the evidence from a proper investiga- construct validity. Indeed, the only
tion of a published test is essentially detailed suggestions under that head-
negative, it should be reported as a ing refer to correlations of the test
stop sign to discourage use of the test with other measures, together with a
pending a reconciliation of test and cross reference to some other sections
construct, or final abandonment of of the report. The two key princi-
the test. If the test has not been pub- ples, however, call for the most com-
lished, it should be restricted to re- prehensive type of reporting. The
search use until some degree of manual for any test "should report all
validity is established (1). The con- available information which will as-
sumer can await the results of the in- sist the user in determining what psy-
vestigator's gamble with confidence chological attributes account for vari-
that proper application of the scien- ance in test scores" (59, p. 27). And,
tific method will ultimately tell "The manual for a test which is used
whether the test has value. Until the primarily to assess postulated attri-
evidence is in, he has no justification butes of the individual should outline
for employing the test as a basis for the theory on which the test is based
terminal decisions. The test may and organize whatever partial valid-
serve, at best, only as a source of sug- ity data there are to show in what
gestions about individuals to be con- way they support the theory" (59,
firmed by other evidence (15, 47). p. 28). It is recognized, by a classifi-
There are two perspectives in test cation as "very desirable" rather than
validation. From the viewpoint of "essential," that the latter recom-
CONSTRUCT VALIDITY 297

mendation goes beyond present prac- ethical responsibility to prevent un-


tice of test authors. substantiated interpretations from
The proper goals in reporting con- appearing as truths. A claim is un-
struct validation are to make clear substantiated unless the evidence for
(a) what interpretation is proposed, the claim is public, so that other sci-
(b) how adequately the writer be- entists may review the evidence, criti-
lieves this interpretation is substan- cize the conclusions, and offer alterna-
tiated, and (c) what evidence and rea- tive interpretations.
soning lead him to this belief. With- The report of evidence in a test
out a the construct validity of the manual must be as complete as any
test is of no use to the consumer. research report, except where ade-
Without b the consumer must carry quate public reports can be cited.
the entire burden of evaluating the Reference to something "observed by
test research. Without c the con- the writer in many clinical cases" is
sumer or reviewer is being asked to worthless as evidence. Full case re-
take a and b on faith. The test man- ports, on the other hand, may be a
ual cannot always present an exhaus- valuable source of evidence so long as
tive statement on these points, but these cases are representative and
it should summarize and indicate negative instances receive due atten-
where complete statements may be tion. The report of evidence must be
found. interpreted with reference to the
To specify the interpretation, the theoretical network in such a manner
writer must state what construct he that the reader sees why the author
has in mind, and what meaning he regards a particular correlation or ex-
gives to that construct. For a con- periment as confirming (or throwing
struct which has a short history and doubt upon) the proposed inter-
has built up few connotations, it will pretation. Evidence collected by
be fairly easy to indicate the pre- others must be taken fairly into ac-
sumed properties of the construct, count.
i.e., the nomologicals in which it ap-
pears. For a construct with a longer VALIDATION OF A COMPLEX TEST
history, a summary of properties and "As A WHOLE"
references to previous theoretical dis- Special questions must be consid-
cussions may be appropriate. It is ered when we are investigating the
especially critical to distinguish pro- validity of a test which is aimed to
posed interpretations from other provide information about several
meanings previously given the same constructs. In one sense, it is naive
construct. The validator faces no to inquire "Is this test valid?" One
small task; he must somehow com- does not validate a test, but only a
municate a theory to his reader. principle for making inferences. If a
To evaluate his evidence calls for a test yields many different types of
statement like the conclusions from a inferences, some of them can be valid
program of research, noting what is and others invalid (cf. Technical
well substantiated and what alterna- Recommendation C2: "The manual
tive interpretations have been con- should report the validity of each
sidered and rejected. The writer must type of inference for which a test is
note what portions of his proposed recommended"). From this point of
interpretation are speculations, ex- view, every topic sentence in the
trapolations, or conclusions from in- typical book on Rorschach inter-
sufficient data. The author has an pretation presents a hypothesis re-
298 LEE J. CRONBACH AND PAUL E. MEEHL
quiring validation, and one should gives no confidence in the long-range
validate inferences about each aspect validity of the thirty-first score.
of the personality separately and in Confidence in a theory is increased
turn, just as he would want informa- as more relevant evidence confirms
tion on the validity (concurrent or it, but it is always possible that to-
predictive) for each scale of MM PI. morrow's investigation will render the
There is, however, another defensi- theory obsolete. The Technical Rec-
ble point of view. If a test is purely ommendations suggest a rule of rea-
empirical, based strictly on observed son, and ask for evidence for each
connections between response to an type of inference for which a test is
item and some criterion, then of recommended. It is stated that no
course the validity of one scoring key test developer can present predictive
for the test does not make validation validities for all possible criteria;
for its other scoring keys any less similarly, no developer can run all
necessary. But a test may be de- possible experimental tests of his
veloped on the basis of a theory which proposed interpretation. But the rec-
in itself provides a linkage between ommendation is more subtle than ad-
the various keys and the various cri- vice that a lot of validation is better
teria. Thus, while Strong's Voca- than a little.
tional Interest Blank is developed Consider the Rorschach test. It is
empirically, it also rests on a "the- used for many inferences, made by
ory" that a youth can be expected to means of nomological networks at
be .satisfied in an occupation if he has several levels. At a low level are the
interests common to men now happy simple unrationalized correspond-
in the occupation. When Strong finds ences presumed to exist between cer-
that those with high Engineering in- tain signs and psychiatric diagnoses.
terest scores in college are prepon- Validating such a sign does nothing to
derantly in engineering careers 19 substantiate Rorschach theory. For
years later, he has partly validated other Rorschach formulas an explicit
the proposed use of the Engineer a priori rationale exists (for instance,
score (predictive validity). Since the high F% interpreted as implying
evidence is consistent with the theory rigid control of impulses). Each time
on which all the test keys were built, such a sign shows correspondence
this evidence alone increases the pre- with criteria, its rationale is sup-
sumption that the other keys have ported just a little. At a still higher
predictive validity, flow strong is level of abstraction, a considerable
this presumption? Not very, from body of theory surrounds the general
the viewpoint of the traditional skep- area of outer control, interlacing
ticism of science. Engineering in- many different constructs. As evi-
terests may stabilize early, while in- dence cumulates, one should be able
terests in art or management or social to decide what specific inference-
work arc still unstable. A claim can- making chains within this system
not be made that the whole Strong can be depended upon. One should
approach is valid just because one also be able to conclude—or deny—
score shows predictive validity. But that so much of the system has stood
if thirty interest scores were investi- up under test that one has some con-
gated longitudinally and all of them fidence in even the untested lines in
showed the type of validity predicted the network.
by Strong's theory, we would indeed In addition to relatively delimited
be caviling to say that this evidence nomological networks surrounding
CONSTRUCT VALIDITY 299
control or aspiration, the Rorschach would interpret the personality "glo-
interpreter usually has an overriding bally." They may argue that a test is
theory of the test as a whole. This best validated in matching studies.
may be a psychoanalytic theory, a Without going into detailed questions
theory of perception and set, or a the- of matching methodology, we can ask
ory stated in terms of learned habit whether such a study validates the
patterns. Whatever the theory of the nomological network "as a whole."
interpreter, whenever he validates an The judge does employ some network
inference from the system, he obtains in arriving at his conception of his
some reason for added confidence in subject, integrating specific infer-
his overriding system. His total the- ences from specific data. Matching
ory is not tested, however, by experi- studies, if successful, demonstrate
ments dealing with only one limited only that each judge's interpretative
set of constructs. The test developer theory has some validity, that it is
must investigate far-separated, inde- not completely a fantasy. Very high
pendent sections of the network. The consistency between judges is re-
more diversified the predictions the quired to show that they are using
system is required to make, the the same network, and very high suc-
greater confidence we can have that cess in matching is required to show
only minor parts of the system will that the network is dependable.
later prove faulty. Here we begin to If inference is less than perfectly
glimpse a logic to defend the judg- dependable, we must know which as-
ment that the test and its whole inter- pects of the interpretative network
pretative system is valid at some level are least dependable and which are
of confidence. most dependable. Thus, even if one
There are enthusiasts who would has considerable confidence in a test
conclude from the foregoing para- "as a whole" because of frequent suc-
graphs that since there is some evi- cessful inferences, one still returns as
dence of correct, diverse predictions an ultimate aim to the request of the
made from the Rorschach, the test Technical Recommendation for sep-
as a whole can now be accepted as arate evidence on the validity of each
validated. This conclusion overlooks type of inference to be made.
the negative evidence. Just one find-
ing contrary to expectation, based on RECAPITULATION
sound research, is sufficient to wash a Construct validation was intro-
whole theoretical structure away. duced in order to specify types of re-
Perhaps the remains can be salvaged search required in developing tests
to form a new structure. But this for which the conventional views on
structure now must be exposed to validation are inappropriate. Per-
fresh risks, and sound negative evi- sonality tests, and some tests of abil-
dence will destroy it in turn. There ity, are interpreted in terms of attri-
is sufficient negative evidence to pre- butes for which there is no adequate
vent acceptance of the Rorschach and criterion. This paper indicates what
its accompanying interpretative sorts of evidence can substantiate
structures as a whole. So long as any such an interpretation, and how such
aspects of the overriding theory evidence is to be interpreted. The
stated for the test have been discon- following points made in the discus-
firmed, this structure must be rebuilt. sion are particularly significant.
Talk of areas and structures may 1. A construct is defined implicitly
seem not to recognize those who by a network of associations or propo-
300 LEE J. CRONBACH AND PAUL E. MEEHL

sitions in which it occurs. Constructs 6. Construct validity cannot gen-


employed at different stages of re- erally be expressed in the form of a
search vary in defmiteness. single simple coefficient. The data
2. Construct validation is possible often permit one to establish upper
only when some of the statements in and lower bounds for the proportion
the network lead to predicted rela- of test variance which can be attri-
tions among observables. While some buted to the construct. The integra-
observables may be regarded as "cri- tion of diverse data into a proper in-
teria," the construct validity of the terpretation cannot be an entirely
criteria themselves is regarded as quantitative process.
under investigation. 7. Constructs may vary in nature
3. The network defining the con- from those very close to "pure de-
struct, and the derivation leading to scription" (involving little more than
the predicted observation, must be extrapolation of relations among ob-
reasonably explicit so that validating servation-variables) to highly the-
evidence may be properly inter- orclical constructs involving hypoth-
preted. esized entities and processes, or mak-
4. Many types of evidence are rele- ing identifications with constructs of
vant to construct validity, including other sciences.
content validity, interitem correla- 8. The investigation of a test's con-
tions, intcrtcst correlations, test-"cri- struct validity is not essentially dif-
tcrion" correlations, studies of sta- ferent from the general scientific pro-
bility over time, and stability under cedures for developing and confirming
experimental intervention. High cor- theories.
relations and high stability may con-
stitute either favorable or unfavor- Without in the least advocating con-
able evidence for the proposed inter- struct validity as preferable to the
pretation, depending on the theory other three kinds (concurrent, predict-
surrounding the construct. ive, content), we do believe it im-
5. When a predicted relation fails perative that psychologists make a
to occur, the fault may lie in the pro- place for it in their methodological
posed interpretation of the test or in thinking, so that its rationale, its
the network. Altering the network scientific legitimacy, and its dangers
so that it can cope with the new ob- may become explicit and familiar.
servations is, in effect, redefining the This would be preferable to the wide-
construct. Any such new interpreta- spread current tendency to engage
tion of the test must be validated by in what actually amounts to con-
a fresh body of data before being ad- struct validation research and use of
vanced publicly. Great care is re- constructs in practical testing, while
quired to avoid substituting a pos- talking an "operational" method-
teriori rationalizations for proper val- ology which, if adopted, would force
idation. research into a mold it does not fit.

REFERENCES
1. AMERICAN PSYCHOLOGICAL ASSOCIATION. 3. BECHTOLDT, H. P. Selection. In S. S'
Ethical standards of psychologists. Wash- Stevens (Ed.), Handbook of experimental
ington, D.C.: American Psychological psychology. New York: Wiley, 1951.
Association, Inc., 1953. Pp. 1237-1267.
2. ANASTASI, ANNE. The concept of validity 4. BECK, L. W. Constructions and inferred
in the interpretation of test scores. entities. Phil. Sci., 1950, 17. Re-
Educ. psychol Measmt, 1950, 10, 67-78. printed in H. Feigl and M. Brodbeck
CONSTRUCT VALIDITY 301
(Eds.), Readings in the philosophy of 19. FEIGL, H. Existential hypotheses. Phil.
science. New York: Appieton-Century- Sci., 1950, 17, 35-62.
Crofts, 1953. Pp. 368-381. 20. FEIGL, H. Confirmability and confirma-
5 BLAIR, W. R. N. A comparative study of tion. Rev. int. de Phil., 1951, 5, 1-12.
disciplinary offenders and non-offenders Reprinted in P. P. Wiener (Ed.),
in the Canadian Army. Canad. J. Readings in philosophy of science. New
Psychol., 1950, 4, 49-62. York: Scribner's, 1953. Pp. 522-530.
6. BRAITHWAIXE, R. B. Scientific explana- 21. GAYLORD, R. H. Conceptual consistency
tion. Cambridge: Cambridge Univer. and criterion equivalence: a dual ap-
Press, 19S3. proach to criterion analysis. Unpub-
7. CARNAP, R. Empiricism, semantics, and lished manuscript (PRB Research
ontology. Rev. int. de Phil., 1950, II, Note No. 17). Copies obtainable from
20-40. Reprinted in P. P. Wiener ASTIA-DSC, AD-21 440.
(Ed.), Readings in philosophy of science, 22. GOODENOUGH, FLORENCE L. Mental
New York: Scribner's, 19S3. Pp. testing. New York: Rinehart, 1950.
509-521. 23. GOUGH, H. G., McCLOSKY, H., & MEEHL,
8. CARNAP, R. Foundations of logic and P. E. A personality scale for social
mathematics. International encyclo- responsibility. J. abnorm. soc. Psychol.,
pedia of unified science, I, No. 3. Pages 1952, 47, 73-80.
56-69 reprinted as "The interpretation 24. GOUGH, H. G., McKEE, M. G., & YAN-
of physics" in H. Feigl and M. Brodbeck DELL, R. J. Adjective check list an-
(Eds.), Readings in the philosophy of alyses of a number of selected psycho-
science. New York: Appleton-Century- metric and assessment variables. Un-
Crofts, 1953. Pp. 309-318. published manuscript. Berkelcy:IPAR,
9. CHILD, I. L. Personality. Annu. Rev. 1953.
Psychol, 1954, S, 149-171. 25. GUILFORD, J. P. New standards for test
10. CHYATTE, C. Psychological characteristics evaluation. Educ. psychol. Measmt.
of a group of professional actors. Oc- 1946, 6, 427-439.
cupations, 1949, 27, 245-250. 26. GUILFORD, J. P. Factor analysis in a test-
11. CRONBACII, L. J, Essentials of psychologi- development program. Psychol. Rev.,
cal testing. New York: Harper, 1949. 1948, 55, 79-94.
12. CRONHACH, L. J. Further evidence on 27. GULLIKSEN, H. Intrinsic validity. Amer.
response sets and test design. Educ. Psychologist, 1950, 5, 511-517.
psychol. Measml, 1950, 10, 3-31. 28. HATHAWAY, S. R., & MONACHESI, E. D.
13. CRONBACH, L. J. Coefficient alpha and Analyzing and predicting juvenile delin-
the internal structure of tests. Psycho- quency with the MMPI. Minneapolis:
metrika, 1951, 16, 297-335. Univer. of Minnesota Press, 1953.
14. CRONBACH, L. J. Processes affecting 29. HEMPEL, C. G. Problems and changes in
scores on "understanding of others" the empiricist criterion of meaning.
and "assumed similarity." Psychol. Rev. int. de Phil, 1950, 4, 41-63. Re-
Bull., 1955, 52, 177-193. printed in L. Linsky, Semantics and
15. CRONBACH, L. J. The counselor's prob- the philosophy of language. Urbana:
lems from the perspective of communi- Univer. of Illinois Press, 1952. Pp.
cation theory. In Vivian H. Hewer 163-185.
(Ed.), New perspectives in counseling. 30. HEMPEL, C. G. Fundamentals of concept
Minneapolis: Univer. of Minnesota formation in empirical science. Chicago:
Press, 1955. Univer. of Chicago Press, 1952.
16. CURETON, E. E. Validity. In E. F. 31. HORST, P. The prediction of personal
Lindquist (Ed.), Educational measure- adjustment. Soc. Sci. Res. Council Bull.,
ment. Washington, D. C.: American 1941, No. 48.
Council on Education, 1950. Pp. 32. HOVEY, H. B. MMPI profiles and per-
621-695. sonality characteristics. J. consult.
17. DAMRIN, DORA E. A comparative study Psychol, 1953, 17, 142-146.
of information derived from a diag- 33. JENKINS, J. G. Validity for what? /.
nostic problem-solving test by logical consult. Psychol, 1946, 10, 93-98.
and factorial methods of scoring. Un- 34. KAPLAN, A. Definition and specification
published doctor's dissertation, Univer. of meaning. J. Phil, 1946, 43, 281-288.
of Illinois, 1952. 35. KELLY, E. L. Theory and techniques of
18. EYSENCK, H. J. Criterion analysis—an assessment. Annu. Rev. Psychol, 1954,
application of the hypothetico-deduc- 5, 281-311.
tive method in factor analysis. Psychol. 36. KELLY, E. L., & FISKE, D. W. The pre-
Rev., 1950, 57, 38-53. diction of performance in clinical psy-
302 LEE J. CKONBACH AND PAUL E. MEEHL

chology. Ann Arbor: Univer. of Michi- Psychol. Bull, 1955, 52, 194-216.
gan Press, 1951. 48. Minnesota Hunter Casualty Study. St.
37. KNEALE, W. Probability and induction. Paul: Jacob Schmidt Brewing Com-
Oxford: Clarendon Press, 1949. Pages pany, 1954.
92-110 reprinted as "Induction, explan- 49. MOSIER, C. I. A critical examination of
ation, and transcendent hypotheses" the concepts of face validity. Educ.
in H. Feigl and M. Brodbeck (Eds.), psychol. Measmt, 1947, 7, 191-205.
Readings in the philosophy of science. 50. MOSIER, C. I. Problems and designs of
New York: Appleton-Century-Crofts, cross-validation. Educ. psychol.
1953. Pp. 353-367. Measmt, 1951, 11, 5-12.
38. LINDQUIST, E. F. Educational measure- 51. PAP, A. Reduction-sentences and open
ment. Washington, D. C.: American concepts. Methodos, 1953, 5, 3-30.
Council on Education, 1950. 52. PEAK, HELEN. Problems of objective
39. LUCAS, C. M. Analysis of the relative observation. In L. Festinger and D.
movement test by a method of individ- Katz (Eds.), Research methods in the
ual interviews. Bur. Naval Personnel behavioral sciences. New York: Dryden
Res. Rap., Contract Nonr-694 (00), Press, 1953. Pp. 243-300.
NR 151-13, Educational Testing Serv- 53. PORTEUS, S. D. The Porteus maze test
ice, March 1953. and intelligence. Palo Alto: Pacific
40. MACCORQUODALE, K., & MEKHL, P. E. Books, 1950.
On a distinction between hypothetical 54. ROESSEL, F. P. MMPI results for high
constructs and intervening variables. school drop-outs and graduates. Un-
Psychol. Rev., 1948, 55, 95-107. published doctor's dissertation, Univer.
41. MACFAELANE, JEAN W. Problems of of Minnesota, 1954.
validation inherent in projective meth- 55. SELLARS, W. S, Concepts as involving
ods. Amer. J. Orthopsychiat., 194-2, laws and inconceivable without them.
12, 405-410. Phil. Set., 1948, 15, 287-315.
42. MCKINLEY, J. C., & HATHAWAY, S. R. 56. SELLARS, W. S. Some reflections on lan-
The MMPI: V. Hysteria, hypomania, guage games. Phil. Sci., 1954, 21,
and psychopathic deviate. J. appl. 204-228.
Psychol., 1944, 28, 153-174. 57. SPIKER, C. C., & MCCANDLESS, B. R.
43. MC.KINLEY, J. C., HATHAWAY, S. R., & The concept of intelligence and the
MEEHL, P. E. The MMPI: VI. The philosophy of science. Psychol. Rev.,
K scale. /. consult. Psychol., 1948, 12, 1954, 61, 255-267.
20-31. 58. Technical recommendations for psycho-
44. MEEHL, P. E. A simple algebraic de- logical tests and diagnostic techniques:
velopment of Horst's suppressor vari- preliminary proposal. Amer. Psycholo-
ables. Amer. J. Psychol., 1945, 58, gist, 1952, 7, 461-476.
550-554. 59. Technical recommendations for psycho-
45. MEEHL, P. E. An investigation of a logical tests and diagnostic techniques.
general normality or control factor in Psychol. Bull. Supplement, 1954, 51,
personality testing. Psychol, Monogr., 2, Part 2, 1-38.
1945, 59, No. 4 (Whole No. 274). 60. THURSTONE, L. L. The criterion problem
46. MEEHL, P. E. Clinical vs. statistical in personality research. Psychometric
prediction. Minneapolis: Univer. of Lab. Rep., No. 78. Chicago: Univer.
Minnesota Press, 1954. of Chicago, 1952.
47. MEEHL, P. E., & ROSEN, A. Antecedent
probability and the efficiency of psycho- Received for early publication February 18,
metric signs, patterns or cutting scores. 1955

You might also like