Professional Documents
Culture Documents
, 1985)
Form equivalence: two different forms of test, based on the same content, on
one occasion to the same examinees (Alternate form)
Internal consistency: the coefficient of test scores obtained from a single test
or survey (Cronbach Alpha, KR20, Spilt-half). For instance, let's say
respondents are asked to rate statements in an attitude survey about computer
anxiety. One statement is "I feel very negative about computers in general."
Another statement is "I enjoy using computers." People who strongly agree
with the first statement should be strongly disagree with the second statement,
and vice versa. If the rating of both statements is high or low among several
respondents, the responses are said to be inconsistent and patternless. The
same principle can be applied to a test. When no pattern is found in the
students' responses, probably the test is too difficult and students just guess the
answers randomly.
Reliability is a necessary but not sufficient condition for validity. For
instance, if the needle of the scale is five pounds away from zero, I always
over-report my weight by five pounds. Is the measurement consistent? Yes,
but it is consistently wrong! Is the measurement valid? No! (But if it underreports my weight by five pounds, I will consider it a valid measurement)
Performance, portfolio, and responsive evaluations, where the tasks vary
substantially from student to student and where multiple tasks may be
evaluated simultaneously, are attacked for lacking reliability. One of the
difficulties is that there are more than one source of measurement errors in
performance assessment. For example, the reliability of writing skill test score
is affected by the raters, the mode of discourse, and several other factors
(Parkes, 2000).
Face validity: Face validity simply means the validity at face value. As a
check on face validity, test/survey items are sent to teachers to obtain
suggestions for modification. Because of its vagueness and subjectivity,
psychometricians have abandoned this concept for a long time. However,
outside the measurement arena, face validity has come back in another form.
While discussing the validity of a theory, Lacity and Jansen (1994) defines
validity as making common sense, and being persuasive and seeming right to
the reader. For Polkinghorne (1988), validity of a theory refers to results that
have the appearance of truth or reality.
The internal structure of things may not concur with the appearance. Many
times professional knowledge is counter-common sense. The criteria of
validity in research should go beyond "face," "appearance," and "common
sense."
When was the founder and CEO of Microsoft, William Gates III born?
a. 1949
b. 1953
c. 1957
d. None of the above
Validity is not a property of the test or assessment...but rather of the meaning of the
test scores.
References
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun
(Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the
perspective of replications. Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd Ed.). Washington, D. C.: American Council on Education.
Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor
procedures. Educational and Psychological Measurement, 64, 391-418.
Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of
Education. New York: McGraw-Hill.
Goodenough, F. L. (1949). Mental testing: Its history, principles, and applications.
New York: Rinehart.
Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error
and bias in research findings. Newsbury Park: Sage Publications.
Kane, M. (2001). Current concerns in validity theory. Journal of educational
Measurement, 38, 319-342.
MODULE R10
RELIABILITY AND VALIDITY
Warwick and Linninger (1975) point out that there are two basic goals in questionnaire
design.
1. To obtain information relevant to the purposes of the survey.
2. To collect this information with maximal reliability and validity.
How can a researcher be sure that the data gathering instrument being used will
measure what it is supposed to measure and will do this in a consistent manner? This is
a question that can only be answered by examining the definitions for and methods of
establishing the validity and reliability of a research instrument. These two very
important aspects of research design will be discussed in this module.
Validity
Validity can be defined as the degree to which a test measures what it is supposed to
measure. There are three basic approaches to the validity of tests and measures as shown
by Mason and Bramble (1989). These are content validity, construct validity, and
criterion-related validity.
Content Validity
This approach measures the degree to which the test items represent the domain or
universe of the trait or property being measured. In order to establish the content
validity of a measuring instrument, the researcher must identify the overall content to be
represented. Items must then be randomly chosen from this content that will accurately
represent the information in all areas. By using this method the researcher should obtain
a group of items which is representative of the content of the trait or property to be
measured.
Identifying the universe of content is not an easy task. It is, therefore, usually suggested
that a panel of experts in the field to be studied be used to identify a content area. For
example, in the case of researching the knowledge of teachers about a new curriculum, a
group of curriculum and teacher education experts might be asked to identify the content
of the test to be developed.
Construct Validity
Cronbach and Meehl (1955) indicated that, "Construct validity must be investigated
whenever no criterion or universe of content is accepted as entirely adequate to define the
quality to be measured" as quoted by Carmines and Zeller (1979). The term construct in
this instance is defined as a property that is offered to explain some aspect of human
behavior, such as mechanical ability, intelligence, or introversion (Van Dalen, 1979). The
construct validity approach concerns the degree to which the test measures the construct
it was designed to measure.
There are two parts to the evaluation of the construct validity of a test. First and most
important, the theory underlying the construct to be measured must be considered.
Second the adequacy of the test in measuring the construct is evaluated (Mason and
Bramble, 1989). For example, suppose that a researcher is interested in measuring the
introverted nature of first year teachers. The researcher defines introverted as the overall
lack of social skills such as conversing, meeting and greeting people, and attending
faculty social functions. This definition is based upon the researchers own observations.
A panel of experts is then asked to evaluate this construct of introversion. The panel
cannot agree that the qualities pointed out by the researcher adequately define the
construct of introversion. Furthermore, the researcher cannot find evidence in the
research literature supporting the introversion construct as defined here. Using this
information, the validity of the construct itself can be questioned. In this case the
researcher must reformulate the previous definition of the construct.
Once the researcher has developed a meaningful, useable construct, the adequacy of the
test used to measure it must be evaluated. First, data concerning the trait being
measured should be gathered and compared with data from the test being assessed. The
data from other sources should be similar or convergent. If convergence exists, construct
validity is supported.
After establishing convergence the discriminate validity of the test must be determined.
This involves demonstrating that the construct can be differentiated from other
constructs that may be somewhat similar. In other words, the researcher must show that
the construct being measured is not the same as one that was measured under a different
name.
Criterion-Related Validity
This approach is concerned with detecting the presence or absence of one or more criteria
considered to represent traits or constructs of interest. One of the easiest ways to test for
criterion-related validity is to administer the instrument to a group that is known to
exhibit the trait to be measured. This group may be identified by a panel of experts. A
wide range of items should be developed for the test with invalid questions culled after
the control group has taken the test. Items should be omitted that are drastically
inconsistent with respect to the responses made among individual members of the group.
If the researcher has developed quality items for the instrument, the culling process
should leave only those items that will consistently measure the trait or construct being
studied. For example, suppose one wanted to develop an instrument that would identify
teachers who are good at dealing with abused children. First, a panel of unbiased experts
identifies 100 teachers out of a larger group that they judge to be best at handling abused
children. The researcher develops 400 yes/no items that will be administered to the whole
group of teachers, including those identified by the experts. The responses are analyzed
and the items to which the expert identified teachers and other teachers responding
differently are seen as those questions that will identify teachers who are good at dealing
with abused children.
Reliability
The reliability of a research instrument concerns the extent to which the instrument
yields the same results on repeated trials. Although unreliability is always present to a
certain extent, there will generally be a good deal of consistency in the results of a quality
instrument gathered at different times. The tendency toward consistency found in
repeated measurements is referred to as reliability (Carmines & Zeller, 1979).
In scientific research, accuracy in measurement is of great importance. Scientific
research normally measures physical attributes which can easily be assigned a precise
value. Many times numerical assessments of the mental attributes of human beings are
accepted as readily as numerical assessments of their physical attributes. Although we
may understand that the values assigned to mental attributes can never be completely
precise, the imprecision is often looked upon as being too small to be of any practical
concern. However, the magnitude of the imprecision is much greater in the measurement
of mental attributes than in that of physical attributes. This fact makes it very important
that the researcher in the social sciences and humanities determine the reliability of the
data gathering instrument to be used (Willmott & Nuttall, 1975).
Retest Method
One of the easiest ways to determine the reliability of empirical measurements is by the
retest method in which the same test is given to the same people after a period of time.
The reliability of the test (instrument) can be estimated by examining the consistency of
the responses between the two tests.
If the researcher obtains the same results on the two administrations of the instrument,
then the reliability coefficient will be 1.00. Normally, the correlation of measurements
across time will be less than perfect due to different experiences and attitudes that
respondents have encountered from the time of the first test.
The test-retest method is a simple, clear cut way to determine reliability, but it can be
costly and impractical. Researchers are often only able to obtain measurements at a
single point in time or do not have the resources for multiple administration.
Alternative Form Method
Like the retest method, this method also requires two testings with the same people.
However, the same test is not given each time. Each of the two tests must be designed to
measure the same thing and should not differ in any systematic way. One way to help
ensure this is to use random procedures to select items for the different tests.
The alternative form method is viewed as superior to the retest method because a
respondents memory of test items is not as likely to play a role in the data received. One
drawback of this method is the practical difficulty in developing test items that are
consistent in the measurement of a specific phenomenon.
Split-Halves Method
This method is more practical in that it does not require two administrations of the same
or an alternative form test. In the split-halves method, the total number of items is
divided into halves, and a correlation taken between the two halves. This correlation only
estimates the reliability of each half of the test. It is necessary then to use a statistical
correction to estimate the reliability of the whole test. This correction is known as the
Spearman-Brown prophecy formula (Carmines & Zeller, 1979)
Pxx" = 2Pxx'/1+Pxx'
where Pxx" is the reliability coefficient for the whole test and Pxx' is the split-half
correlation.
Example
If the correlation between the halves is .75, the reliability for the total test is:
If one is using the correlation matrix rather than the variance-covariance matrix then
alpha reduces to the following:
alpha = Np/[1+p(N-1)]
where N equals the number of items and p equals the mean interitem correlation.
Example
The average intercorrelation of a six item scale is .5, then the alpha for the scale would
be:
alpha = 6(.5)/[1+.5(6-1)]
= 3/3.5 = .857
An example of how alpha can be calculated can be given by using the 10 item self-esteem
scale developed by Rosenberg (1965). (See table) The 45 correlations in the table are first
summed: .185+.451+.048+ . . . + .233= 14.487. Then the mean interitem correlation is
found by dividing this sum by 45: 14.487/45= .32. Now use this number to calculate
alpha:
alpha = 10(.32)/[1+.32(10-1)]
= 3.20/3.88
= .802
The coefficient alpha is an internal consistency index designed for use with tests
containing items that have no right answer. This is a very useful tool in educational and
social science research because instruments in these areas often ask respondents to rate
the degree to which they agree or disagree with a statement on a particular scale.
Cronbachs Alpha Example
Questions
10
10
10
10
10
10
10
10
10
10
10
10
26
24
22
27
32
27
17
18
23
25
2.6
2.4
2.2
2.7
3.2
2.8
1.7
1.8
2.3
2.5
90
78
60
89
130
98
35
36
65
79
22.4
20.4
11.6
16.1
27.6
19.6
6.1
3.6
12.1
16.5
2.5
2.3
1.3
1.8
3.1
2.2
.68
.4
1.3
1.8
S2
=
P = Mean Interitem Correlation
alpha = Np / [1+p(N-1)]
= (10)(.36)/[1+.36(10-1)]
= 3.6/4.24
=.849
Measurement
Reliability
Reliability refers to a condition where a measurement process yields consistent scores
(given an unchanged measured phenomenon) over repeat measurements. Perhaps the
most straightforward way to assess reliability is to ensure that they meet the following
three criteria of reliability. Measures that are high in reliability should exhibit all
three.
Test-retest Reliability
When a researcher administers the same measurement tool multiple times - asks the
same question, follows the same research procedures, etc. - does he/she obtain
consistent results, assuming that there has been no change in whatever he/she is
measuring? This is really the simplest method for assessing reliability - when a
researcher asks the same person the same question twice ("What's your name?"), does
he/she get back the same results both times. If so, the measure has test-retest
reliability. Measurement of the piece of wood talked about earlier has high test-retest
reliability.
Inter-item reliability
This is a dimension that applies to cases where multiple items are used to measure a
single concept. In such cases, answers to a set of questions designed to measure some
single concept (e.g., altruism) should be associated with each other.
Interobserver reliability
Interobserver reliability concerns the extent to which different interviewers or
observers using the same measure get equivalent results. If different observers or
interviewers use the same instrument to score the same thing, their scores should
match. For example, the interobserver reliability of an observational assessment of
parent-child interaction is often evaluated by showing two observers a videotape of a
parent and child at play. These observers are asked to use an assessment tool to score
the interactions between parent and child on the tape. If the instrument has high
interobserver reliability, the scores of the two observers should match.
Validity
To reiterate, validity refers to the extent we are measuring what we hope to measure
(and what we think we are measuring). How to assess the validity of a set of
measurements? A valid measure should satisfy four criteria.
Face Validity
This criterion is an assessment of whether a measure appears, on the face of it, to
measure the concept it is intended to measure. This is a very minimum assessment - if
a measure cannot satisfy this criterion, then the other criteria are inconsequential. We
can think about observational measures of behavior that would have face validity. For
example, striking out at another person would have face validity for an indicator of
aggression. Similarly, offering assistance to a stranger would meet the criterion of
face validity for helping. However, asking people about their favorite movie to
measure racial prejudice has little face validity.
Content Validity
Content validity concerns the extent to which a measure adequately represents all
facets of a concept. Consider a series of questions that serve as indicators of
depression (don't feel like eating, lost interest in things usually enjoyed, etc.). If there
were other kinds of common behaviors that mark a person as depressed that were not
included in the index, then the index would have low content validity since it did not
adequately represent all facets of the concept.
Criterion-related Validity
Criterion-related validity applies to instruments than have been developed for
usefulness as indicator of specific trait or behavior, either now or in the future. For
example, think about the driving test as a social measurement that has pretty good
predictive validity. That is to say, an individual's performance on a driving test
correlates well with his/her driving ability.
Construct Validity
But for a many things we want to measure, there is not necessarily a pertinent
criterion available. In this case, turn to construct validity, which concerns the extent to
which a measure is related to other measures as specified by theory or previous
research. Does a measure stack up with other variables the way we expect it to? A
good example of this form of validity comes from early self-esteem studies - selfesteem refers to a person's sense of self-worth or self-respect. Clinical observations in
psychology had shown that people who had low self-esteem often had depression.
Therefore, to establish the construct validity of the self-esteem measure, the
researchers showed that those with higher scores on the self-esteem measure had
lower depression scores, while those with low self-esteem had higher rates of
depression.
At best, we have a measure that has both high validity and high reliability. It yields
consistent results in repeated application and it accurately reflects what we hope to
represent.
It is possible to have a measure that has high reliability but low validity - one that is
consistent in getting bad information or consistent in missing the mark. *It is also
possible to have one that has low reliability and low validity - inconsistent and not on
target.
Finally, it is not possible to have a measure that has low reliability and high validity you can't really get at what you want or what you're interested in if your measure
fluctuates wildly.
This web page was designed to provide you with basic information on an important
characteristic of a good measurement instrument: reliability. Prior to starting any
research project, it is important to determine how you are going to measure a
particular phenomena. This process of measurement is important because it allows
you to know whether you are on the right track and whether you are measuring what
you intend to measure. Both reliability and validity are essential for good
measurement, because they are your first line of defense against forming inaccurate
conclusions (i.e., incorrectly accepting or rejecting your research hypotheses).
Although this tutorial will only address general issues of reliability, you can access
more detailed information by clicking on the words or titles that are highlighted.
What is Reliability?
I am sure you are familiar with terms such as consistency, predictability,
dependability, stablity, and repeatability. Well, these are the terms that come to mind
when we talk about reliability. Broadly defined, reliability of a measurement refers to
the consistency or repeatability of the measurement of some phenomena. If a
measurement instrument is reliable, that means the instrument can measure the same
thing more than once or using more than one method and yield the same result. When
we speak of reliability, we are not speaking of individuals, we are actually talking
about scores.
The observed score is one of the major components of reliability. The observed score
is just that, the score you would observe in a research setting. The observed score
comprised of a true score and an error score. The true score is a theoretical concept.
Why is it theoretical? Because there is no way to really know what the true score is
(unless you're God). The true score reflects the true value of a variable. The error
score is the reason why the observed is different from the true score. The error score
is further broken down into method (or systematic) error and trait (or random) error.
Method error refers to anything that causes a difference between the observed score
and true score due to the testing situation. For example, any type of disruption (loud
music, talking, traffic) that occurs while students are taking a test may cause the
students to become distracted and may affect their scores on the test. On the other
hand, trait error is caused by any factors related to the characteristic of the person
taking the test that may randomly affect measurement. An example of trait error at
work is when individuals are tired, hungry, or unmotivated. These characteristics can
affect their performance on a test, making the scores seem worse than they would be
if the individuals were alert, well-fed, or motivated.
Reliability can be viewed as the ratio of the true score over the true score plus the
error score, or:
true score
true score + error score
Okay, now that you know what reliability is and what its components are, you're
probably wondering how to achieve reliability. Simply put, the degree of reliability
can be increased by decreasing the error score. So, if you want a reliable instrument,
you must decrease the error.
As previously stated, you can never know the actual true score of a measurement.
Therefore, it is important to note that reliability cannot be calculated; it can only be
estimated. The best way to estimate reliability is to measure the degree of correlation
between the different forms of a measurement. The higher the correlation, the higher
the reliability.
3 Aspects of Reliability
Before going on to the types of reliability, I must briefly review 3 major aspects of
reliability: equivalence, stability, and homogeneity. Equivalence refers to the degree
of agreement between 2 or more measures administered nearly at the same time. In
order for stability to occur, a distinction must be made between the repeatability of the
measurement and that of the phenomena being measured. This is achieved by
employing two raters. Lastly, homogeneity deals with assessing how well the
different items in a measure seem to reflect the attribute one is trying to measure. The
emphasis here is on internal relationships, or internal consistency.
Types of Reliability
Now back to the different types of reliability. The first type of reliability is parallel
forms reliability. This is a measure of equivalence, and it involves administering two
different forms to the same group of people and obtaining a correlation between the
two forms. The higher the correlation between the two forms, the more equivalent the
forms.
The second type of reliability, test-retest reliability, is a measure of stability which
examines reliability over time. The easiest way to measure stability is to administer
the same test at two different points in time (to the same group of people, of course)
and obtain a correlation between the two tests. The problem with test-retest reliability
is the amount of time you wait between testings. The longer you wait, the lower your
estimation of reliability.
Finally, the third type of reliability is inter-rater reliability, a measure of
homogeneity. With inter-rater reliability, two people rate a behavior, object, or
phenomenon and determine the amount of agreement between them. To determine
inter-rater reliability, you take the number of agreements and divide them by the
number of total observations.
Nominal
Ordinal
Interval
Ratio
The nominal level of measurement describes variables that are categorical in nature.
The characteristics of the data you're collecting fall into distinct categories. If there
are a limited number of distinct categories (usually only two), then you're dealing
with a dichotomous variable. If there are an unlimited or infinite number of distinct
categories, then you're dealing with a continuous variable. Nominal variables
include demographic characteristics like sex, race, and religion.
The ordinal level of measurement describes variables that can be ordered or ranked
in some order of importance. It describes most judgments about things, such as big or
little, strong or weak. Most opinion and attitude scales or indexes in the social
sciences are ordinal in nature.
The interval level of measurement describes variables that have more or less equal
intervals, or meaningful distances between their ranks. For example, if you were to
ask somebody if they were first, second, or third generation immigrant, the
assumption is that the distance, or number of years, between each generation is the
same. All crime rates in criminal justice are interval level measures, as is any kind of
rate.
The ratio level of measurement describes variables that have equal intervals and a
fixed zero (or reference) point. It is possible to have zero income, zero education, and
no involvement in crime, but rarely do we see ratio level variables in social science
since it's almost impossible to have zero attitudes on things, although "not at all",
"often", and "twice as often" might qualify as ratio level measurement.
Advanced statistics require at least interval level measurement, so the researcher
always strives for this level, accepting ordinal level (which is the most common) only
when they have to. Variables should be conceptually and operationally defined with
levels of measurement in mind since it's going to affect how well you can analyze
your data later on.
RELIABILITY AND VALIDITY
For a research study to be accurate, its findings must be both reliable and valid.
Reliability means that the findings would be consistently the same if the study were
done over again. It sounds easy, but think of a typical exam in college; if you scored a
74 on that exam, don't you think you would score differently if you took if over
again?
Validity: A valid measure is one that provides the information that it was intended to
provide. The purpose of a thermometer, for example, is to provide information on the
temperature, and if it works correctly, it is a valid thermometer.
A study can be reliable but not valid, and it cannot be valid without first being
reliable. You cannot assume validity no matter how reliable your measurements are.
There are many different threats to validity as well as reliability, but an important
early consideration is to ensure you have internal validity. This means that you are
using the most appropriate research design for what you're studying (experimental,
quasi-experimental, survey, qualitative, or historical), and it also means that you have
screened out spurious variables as well as thought out the possible contamination of
other variables creeping into your study. Anything you do to standardize or clarify
your measurement instrument to reduce user error will add to your reliability.
It's also important early on to consider the time frame that is appropriate for what
you're studying. Some social and psychological phenomena (most notably those
involving behavior or action) lend themselves to a snapshot in time. If so, your
research need only be carried out for a short period of time, perhaps a few weeks or a
couple of months. In such a case, your time frame is referred to as cross-sectional.
Sometimes, cross-sectional research is criticized as being unable to determine cause
and effect, and a longer time frame is called for, one that is called longitudinal,
which may add years onto carrying out your research. There are many different types
of longitudinal research, such as those that involve tracking a cohort of subjects (such
as schoolchildren across grade levels), or those that involve time-series (such as
tracking a third world nation's economic development over four years or so). The
general rule is to use longitudinal research the greater the number of variables you've
got operating in your study and the more confident you want to be about cause and
effect.
METHODS OF MEASURING RELIABILITY
There are four good methods of measuring reliability:
test-retest
multiple forms
inter-rater
split-half
The test-retest in the same group technique is to administer your test, instrument,
survey, or measure to the same group of people at different points in time. Most
researchers administer what is called a pretest for this, and to troubleshoot bugs at the
same time. All reliability estimates are usually in the form of a correlation coefficient,
so here, all you do is calculate the correlation coefficient between the two scores of
each group and report it as your reliability coefficient.
The multiple forms technique has other names, such as parallel forms and disguised
test-retest, but it's simply the scrambling or mixing up of questions on your survey,
for example, and giving it to the same group twice. The idea is that it's a more
rigorous test of reliability.
Inter-rater reliability is most appropriate when you use assistants to do interviewing
or content analysis for you. To calculate this kind of reliability, all you do is report the
percentage of agreement on the same subject between your raters, or assistants.
Split-half reliability is estimated by taking half of your test, instrument, or survey,
and analyzing that half as if it were the whole thing. Then, you compare the results of
this analysis with your overall analysis. There are different variations of this
technique, one of the most common being called Cronbach's alpha (a frequently
reported reliability statistic) which correlates performance on each item with overall
score. Another technique, closer to the split-half method, is the Kuder-Richardson
coefficient, or KR-20. Statistical packages on most computers will calculate these for
you, although in graduate school, you'll have to do them by hand and understand that
all test statistics are derived from the formula that all observed scores consist of a true
score and error score.
METHODS OF MEASURING VALIDITY
There are four good methods of estimating validity:
face
content
criterion
construct
Face validity is the least statistical estimate (validity overall is not as easily
quantified as reliability) as it's simply an assertion on the researcher's partclaiming
that they've reasonably measured what they intended to measure. It's essentially a
"take my word for it" kind of validity. Usually, a researcher asks a colleague or expert
in the field to vouch for the items measuring what they were intended to measure.
Content validity goes back to the ideas of conceptualization and operationalization.
If the researcher has focused in too closely on only one type or narrow dimension of a
construct or concept, then it's conceivable that other indicators were overlooked. In
such a case, the study lacks content validity. Content validity is making sure you've
covered all the conceptual space. There are different ways to estimate it, but one of
the most common is a reliability approach where you correlate scores on one domain
or dimension of a concept on your pretest with scores on that domain or dimension
with the actual test. Another way is to simply look over your inter-item correlations.
Criterion validity is using some standard or benchmark that is known to be a good
indicator. A researcher might have devised a police cynicism scale, for example, and
they compare their Cronbach's alpha to the known Cronbach's alpha of say,
Neiderhoffer's cynicism scale. There are different forms of criterion validity:
concurrent validity is how well something estimates actual day-by-day behavior;
predictive validity is how well something estimates some future event or
manifestation that hasn't happened yet. The latter type is commonly found in
criminology. Suppose you are creating a scale that predicts how and when juveniles
become mass murderers. To establish predictive validity, you would have to find at
least one mass murderer, and investigate if the predictive factors on your scale,
retrospectively, affected them earlier in life. With criterion validity, you're concerned
with how well your items are determining your dependent variable.
Construct validity is the extent to which your items are tapping into the underlying
theory or model of behavior. It's how well the items hang together (convergent
validity) or distinguish different people on certain traits or behaviors (discriminant
validity). It's the most difficult validity to achieve. You have to either do years and
years of research or find a group of people to test that have the exact opposite traits or
behaviors you're interested in measuring.
A LIST OF THREATS TO RELIABILITY AND VALIDITY
I. Overview
The goal of this set of notes is explore issues of reliability and validity as they apply
to psychological measurement. The approach will be to look these issues by
examining a particular scale, the PTSD-Interview (PTSD-I: Watson, Juba, Manifold,
Kucala, & Anderson, 1991). The issues to be discussed include:
(a) How would you go about developing a scale to measure posttraumatic stress
disorder?
(b) What items would you include in your scale and how would you determine the
content validity of the scale?
(c) How would you determine the reliability of the scale?
(d) How would you determine the validity of the scale?
This web page will focus on the the first three issues. A companion web page will
look at the validity question.
B. Content Validity
Content validity asks the question, "Do the items on the scale adequately sample the
domain of interest?"
If you are developing a test to diagnose PTSD then the test must adequately reflect all
of the DSM diagnostic criteria.
See handout: The PTSD-I Scale
Do the items on the PTSD-I scale adequately reflect the DSM-IV criteria? How
might you come to quantitative decision about the content validity of the scale?
The DSM criteria are all or nothing. Participants respond to the PTSD-I items on 7point scales:
No
Very Little
Extremely
Never
Very Rarely
Always
1
2
A little
Somewhat
Sometimes Commonly
3
6
Quite a bit
Very much
Often
Very Often
How do you determine the minimum score you need to fit the all-or-nothing criteria
of the DSM? Is 3 "A little sometimes" enough to meet the criterion? Is 7
"extremely always" necessary to meet the criterion? If you make the criteria too strict
then you will underdiagnose PTSD. On the other hand if you make the criteria too
lenient you will over diagnose PTSD.
One approach would be to go back to the DSM criteria to see if they give any
guidance. In this instance are the DSM-IV guidelines helpful in determining
minimum score that should be used to indicate that a specific criterion was met?
An empirical approach is to examine diagnostic utility indices at each possible cutting
score. You can choose a cut score for the items that will maximize sensitivity,
maximize specificity, or maximize efficiency. Watson et al. (1991) determined a score
of 4 or greater on an item indicated that the DSM-III-R criterion was met. They
choose that score because it produced an optimal sensitivity/specificity balance.
What are the advantages and disadvantages of using this type of continuous scale
rather than an all-or-nothing (yes or no) response scale?
If you are developing a content test for general psychology, how would you determine
if the test had adequate content validity?
D. Face Validity
Face validity refers to the issue of whether or not the items are measuring what they
appear, on the face of it, to measure.
Does the PTSD-I have face validity?
Does the MMPI have face validity?
B. Reliability Estimates
Measured at one testing session
HOMOGENEITY
Single
Measure
TEST-RETEST
RELIABILITY
or
STABILITY
Cronbachs coefficient
alpha, r11
where n is
the number
of items in
the test;
and
rii is the
average
measured as the
correlation
between the
same test given
at different times
error variance is
due to time
sampling and
content sampling
correlation
between all
of the test
items
See also the computational
formula for Cronbach's
alpha.
Graphic representation of
reliability as a function of n and
rii
Different
forms of
the
Measure
measured as the
correlation between
different forms of the test
given at the same time
error variance is due to
content sampling
C. Reliability Standards
measured as the
correlation between
different forms of
the test given at
different times
error variance is
due to content
sampling and time
sampling
A good rule of thumb for reliability is that if the test is going to be used to make
decisions about peoples lives (e.g., the test is used as a diagnostic tool that will
determine treatment, hospitalization, or promotion) then the minimum acceptable
coefficient alpha is .90.
This rule of thumb can be substantially relaxed is the test is going to be used for
research purposes only.
Here are the reliability estimates for the PTSD-I and the Posttraumatic Stress
Diagnostic Scale (PDS; Foa, 1995). The PTSD-I is based in the DSM-II-R. The PDS
is based on the DSM-IV.
Measured at one testing session
HOMOGENEITY (alpha)
Single
Measure
PTSD-I
.921
.87 to .932,3
TEST-RETEST RELIABILITY
PDS
PTSD-I
PDS
.926
1 week = .951
90 days =
.762,4, .912,5
10-22 days =
.837
EQUIVALENCE
Different
forms of the
Measure
STABILITY and
EQUIVALENCE
t' = r11x1
12
-8.00
-7.20
-10.96
-3.44
17
-3.00
-2.70
-6.46
1.06
-2.00
-1.80
-5.56
1.96
-1.00
-0.90
-4.66
2.86
.00
0.00
-3.76
3.76
1.00
0.90
-2.86
4.66
5.00
4.50
0.74
8.26
8.00
7.20
3.44
10.96
9.00
8.10
4.34
11.86
0.00
6.06
0.00
5.45
-3.76
5.45
3.76
5.45
1.9163
t' = r11*X1
= .90*X1
Introductory level measurement books typically say that the confidence interval for an
obtained score can be constructed around that obtained score rather than around the
true score. They are technically incorrect, but the confidence interval so constructed
will not be too far off as long as the reliability of the test is high.
F. Interrater reliability
Interrater reliability is concerned with the consistency between the judgments of 2 or
more raters.
In the past, interrater reliability has been measured by having 2 (or more) people
make decisions (e.g. meets PTSD diagnosis criteria) or ratings across a number of
cases and then find the correlation between those two sets of decisions or ratings. That
method gives an overestimate of the interrater reliability so it is rarely used.
One of the more commonly used measures of interrater reliability is kappa. Kappa
takes into account the expected level of agreements between judges or raters. For that
reason it is considered to be a more appropriate measure on interrater reliability For
a discussion of kappa and how to compute it using SPSS see Crosstabs: Kappa.
The interrater reliability indices for the PTSD-I are shown below.
Interrater Reliability for the PTSD-I and PDS
PTSD-I: kappa = .611
PTSD-I: kappa = >.842
PDS: kappa = .773
1
Meets Criteria
Sensitivity =
column percent
(Type ? error)2
(Type ? error)2
Specificity =
column percent
Meets Criteria
Total
n = 20
n = 10
n = 30
n = 65
n = 70
sensitivity = 20/25
= .80
Test
Classification
n=5
Does not Meet Criteria
Total
Specificity = 65/75
= .87
n = 25
n = 75
n = 100
Sensitivity
.89
.82
Specificity
.94
.77
Efficiency
.92
.79
References
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297-334.
Foa. E. B. (1995). Posttraumatic Stress Diagnostic Scale. Minneapolis: National
Computer Systems
Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version
III-A). St. Louis, MO: Washington University, Department of Psychiatry.
Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991).
The PTSD interview: Rationale, description, reliability, and concurrent validity of a
DSM-III based technique. Journal of Clinical Psychology, 47, 179-188.\
Williams, J. B., W., Gibbon, M., First, M. B., Spitzer, R. L., Davies, M., Borus, J.,
Howes, M. J., Kane, J., Pope, H., G., Rounsaville, B., & Wittchen, H.U. (1992). The
Structured Clinical Interview for DSM-III-R (SCID). Archives of General Psychiatry,
49, 630-636.
Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSD-I as an
outcome measure. Poster presented at the annual meeting of the International Society
for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.
Footnotes
1. The computational formula for Cronbach's alpha uses the standard deviations of
each of the items and the standard deviation of the test as a whole rather than the
average intercorrelation between all of the items:
where n = the number
of items in the test,
is the sum of the
squared standard
deviations of each of
the items( i.e., the sum
of the item variances),
and
is the squared
standard deviation of
the total test scores
(i.e., the variance of the
test).
Or, you could use a computer program to compute alpha. In SPSS for windows,
Cronbach's alpha can be found under:
Statistics
Scale
Reliability analysis ...
V. Validity
Validity refers to the issue of whether the test measures what it is intending to
measure. Does a test of, say, mathematics ability measure that ability, or is reading
comprehension a part of what is measured by the test? The validity of a test is
constrained by its reliability. If a test does not consistently measure a construct or
domain then it cannot expect to have high validity coefficients.
A. Criterion-related validity
Predictive validity. Criterion related validity refers to how strongly the scores on the
test are related to other behaviors. A test may be used to predict future behavior, e.g.,
will people who score high on the "in basket" test be good supervisors if they are
promoted, or will people who score high on the GRE exam be successful in graduate
school? In order to find the relationship between the test and some behavior you need
to clearly specify the behavior that you want to predict, that is, you need to specify the
criterion. This is not often an easy task. What do you mean by being a "good
supervisor" or "success in graduate school" and how would you measure those
characteristics?
As far as I know there are no studies that have looked at the predictive validity of the
PTSD-I. How would you devise a study to explore the predictive validity of the
PTSD-I?
Concurrent validity. Rather than predicting future behavior you may be interested
in the relationship between your test and other tests that purport to measure the same
domain. For example, if you are creating a new, shorter, less costly test of
intelligence, you would be interested in the relationship between your test and a
standard test of intelligence, for example the WAIS. If you gave both tests at the
same time and found the correlation between the two you would be determining the
concurrent validity of your test.
Watson, et. al. (1991) examined the concurrent validity of the PTSD-I and the
posttraumatic disorder section of the Diagnostic Interview Schedule (DIS: Rollins &
Helzer, 1985). As mentioned earlier, the DIS is a commonly used diagnostic measure
of PTSD. They computed the correlations between the individual items of the PTSDI and the corresponding items of the DIS and found that they ranged from .57 to .99,
with a median correlation of .77. They also looked at the correlation between the total
score of the PTSD-I and the diagnosis of PTSD based on the DIS, it was .94. We
know from our earlier discussion that the reliability of a test depends upon the number
of items. So we would expect that the reliability of the total test score would be
higher than the reliability of the individual items on the test.
Wilson et al. (1994) administered the PTSD-I, the Impact of Events Scale (IES;
Horowitz, Wilner, & Alvarez, 1979) and the Symptom Check List -90-Revised (SCL90-R; Derogatis, 1992) to 80 adult participants in a study of traumatic memory. The
IES scale measures the PTSD symptoms of intrusion and avoidance. It was not
designed to be used to diagnose PTSD. The SCL-90-R measures the occurrence of 9
psychological symptoms for psychiatric and medical patients. Many of the items on
the PTSD-I are phrased in terms of the lifetime incidence of PTSD symptoms. For
example, item C-5 reads, "Have you felt more cut off emotionally from other people
at some period than you did before (the stressor)? This form of the PTSD-I was used
at the pretest, prior to the administration of the treatment. The concurrent validities of
the PTSD-I with the IES and the Global Severity Index (GSI) of the SCL-90R were
.54and .66, respectively. We belatedly recognized that questions of that form would
not be useful in measuring changes due to treatment so we revised the time frame of
the scale to match that used by the IES and the SCL-90-R, which was one
week. Using the one-week time frame for the PTSD-I the concurrent validities of the
PTSD-I with the IES and the GSI were substantially higher, .85 and .88 respectively,
see Table 1.
Why do you think that the two versions of the PTSD-I would have such different
concurrent validities? Which are the "best" concurrent validities?
Table 1. Concurrent Validities for the PTSD-I and the PDS
Criterion Measure
PTSD-I
PTSD diagnosis
made by the DIS
.941
PTSD-I symptoms as
lifetime measure,
Impact of Events
Scale (IES)
.542
PTSD-I symptoms
within the past week,
Impact of Events
Scale (IES)
.852
PTSD-I symptoms as
lifetime measure,
Global Severity Index
(GSI) of the SCL-90R
.662
PTSD-I symptoms
within the past week,
Global Severity Index
(GSI) of the SCL-90R
.882
PDS
(symptom severity
score)
IES-I = .803
IES-A= .663
Beck Depression
Inventory (BDI)
.793
State-Trait Anxiety
Scale (STAI-State)
.733
State-Trait Anxiety
Scale (STAI-Trait)
.743
1Watson,
B. Construct validity
When you ask about construct validity you are taking a broad view of your test. Does
the test adequately measure the underlying construct? The question is asked both in
terms of convergent validity, are test scores related to behaviors and tests that it
should be related to and in terms of divergent validity, are test scores unrelated to
behaviors and tests that it should be unrelated to?
There is no single measure of construct validity. Construct validity is based on the
accumulation of knowledge about the test and its relationship to other tests and
behaviors.
Convergent validity
Correlational approach. The concurrent validities between the test and other
measures of the same domain are correlational measures of convergent validity.
Contrasted groups. Another way of measuring convergent validity is to look at
the differences in test scores between groups of people who would be expected to
score differently on the test. For example, Watson, et al. (1991) looked the
differences in PTSD-I scores for those who were and were not diagnosed as PTSD by
the DIS. They found that the PTSD-I score for those diagnosed with PTSD was higher
(M = 58.2; SD = 14.5) than for those not diagnosed as PTSD (M = 28.0; SD = 12.2),
t(59) = 8.68, p < .0001.
Experimental. Meaningful treatment effect size demonstrates experimental
intervention validity for a measure. Wilson, et al. (1994) computed effect sizes
between the EMDR treated group and the delayed treatment group by dividing the
difference between the two groups by the standard deviation of the delayed treatment
control group (Glass, McGaw, & Smith, 1981). The effect size for the 7-day version
of the PTSD-I was 1.28. The comparable effect sizes for the IES and GSI were 1.41
and 0.66, respectively. You might also report significance tests as measures of
experimental convergent validity, but effect sizes are more informative.
Discriminant validity
Discriminant validity, are the test scores unrelated to tests and behaviors in different
domains, seems to be less often assessed than is convergent validity. But the question
of discriminant validity is important when you are trying to distinguish your theory
from another theory. The subtitle of Daniel Goleman's book Emotional Intelligence
is "Why it can matter more than IQ." His argument is that emotional IQ is different
from traditional IQ and so measures of emotional IQ should not correlate very highly
with measures of traditional IQ. This is a question of discriminant validity.
References
Derogatis, L. R. (1992). SCL-90-R: Administration, scoring and procedures manual-II. Baltimore, MD: Clinical Psychometric Research.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research.
Beverly Hills, CA: Sage Publications.
Goleman, D. (1995). Emotional intelligence: Why it can matter more than IQ. New
York: Bantam Books.
Horowitz, M. J., Wilner, N., & Alvarez, W. (1979). Impact of Events Scale: A
measure of subjective distress. Psychosomatic Medicine, 41, 209-218.
Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version
III-A). St. Louis, MO: Washington University, Department of Psychiatry.
Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991).
The PTSD interview: Rationale, description, reliability, and concurrent validity of a
DSM-III based technique. Journal of Clinical Psychology, 47, 179-188.
Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSD-I as an
outcome measure. Poster presented at the annual meeting of the International Society
for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.