Conventional Views of Reliability

Conventional views of reliability (AERA et al.
, 1985)
Temporal stability: the same form of a test on two or more separate

occasions to the same group of examinees (Test-retest). However, this
approach is not practical. Repeated measurements are likely to change the
examinees. For example, the examinees will adapt the test format and thus
tend to score higher in later tests.
Form equivalence: two different forms of test, based on the same content, on
one occasion to the same examinees (Alternate form)
Internal consistency: the coefficient of test scores obtained from a single test
or survey (Cronbach Alpha, KR20, Spilt-half). For instance, let's say
respondents are asked to rate statements in an attitude survey about computer
anxiety. One statement is "I feel very negative about computers in general."
Another statement is "I enjoy using computers." People who strongly agree
with the first statement should be strongly disagree with the second statement,
and vice versa. If the rating of both statements is high or low among several
respondents, the responses are said to be inconsistent and patternless. The
same principle can be applied to a test. When no pattern is found in the
students' responses, probably the test is too difficult and students just guess the
answers randomly.
Reliability is a necessary but not sufficient condition for validity. For
instance, if the needle of the scale is five pounds away from zero, I always
over-report my weight by five pounds. Is the measurement consistent? Yes,
but it is consistently wrong! Is the measurement valid? No! (But if it underreports my weight by five pounds, I will consider it a valid measurement)
Performance, portfolio, and responsive evaluations, where the tasks vary
substantially from student to student and where multiple tasks may be
evaluated simultaneously, are attacked for lacking reliability. One of the
difficulties is that there are more than one source of measurement errors in
performance assessment. For example, the reliability of writing skill test score
is affected by the raters, the mode of discourse, and several other factors
(Parkes, 2000).
Replications as unification: Users may be confused by the diversity of

reliability indices. Nevertheless, different types of reliability measures share a
common thread: What constitutes a replication of a measurement procedure?
(Brennan, 2001) Take internal consistency as an example. This measure is
used because it is convenient to compute the reliability index based upon data
collected from one occasion. However, the ultimate inference should go
beyond one single testing occasion to others (Yu, 2005). In other words, any
procedures for estimating reliability should attempt to mirror a result based
upon full-length replications.
Conventional views of validity (Cronbach, 1971)
Face validity: Face validity simply means the validity at face value. As a
check on face validity, test/survey items are sent to teachers to obtain
suggestions for modification. Because of its vagueness and subjectivity,
psychometricians have abandoned this concept for a long time. However,
outside the measurement arena, face validity has come back in another form.
While discussing the validity of a theory, Lacity and Jansen (1994) defines
validity as making common sense, and being persuasive and seeming right to
the reader. For Polkinghorne (1988), validity of a theory refers to results that
have the appearance of truth or reality.
The internal structure of things may not concur with the appearance. Many
times professional knowledge is counter-common sense. The criteria of
validity in research should go beyond "face," "appearance," and "common
sense."
Content validity: draw an

inference from test scores to a large
domain of items similar to those on the
test. Content validity is concerned with
sample-population representativeness.
i.e. the knowledge and skills covered by
the test items should be representative to
the larger domain of knowledge and
skills.
For example, computer literacy includes
skills in operating system, word
processing, spreadsheet, database,
graphics, internet, and many others.
However, it is difficult, if not
impossible, to administer a test covering
all aspects of computing. Therefore, only several tasks are sampled from the
population of computer skills.
Content validity is usually established by content experts. Take computer
literacy as an example again. A test of computer literacy should be written or
reviewed by computer science professors because it is assumed that computer

scientists should know what are important in his discipline. By the first glance,
this approach looks similar to the validation process of face validity, but yet
there is a difference. In content validity, evidence is obtained by looking for
agreement in judgments by judges. In short, face validity can be established by
one person but content validity should be checked by a panel.
However, this approach has some drawbacks. Usually experts tend to take
their knowledge for granted and forget how little other people know. It is not
uncommon that some tests written by content experts are extremely difficult.
Second, very often content experts fail to identify the learning objectives of a
subject. Take the following question in a philosophy test as an example:
What is the time period of the philosopher Epicurus?

a. 341-270 BC
b. 331-232 BC
c. 280-207 BC
d. None of the above
This type of question tests the ability of memorizing historical facts, but not
philosophizing. The content expert may argue that "historical facts" are
important for a student to further understand philosophy. Let's change the
subject to computer science and statistics. Look at the following two
questions:
When was the founder and CEO of Microsoft, William Gates III born?
a. 1949
b. 1953
c. 1957
Which of the following statement is true about ANOVA

a. It was invented by R. A. Fisher in 1914
b. It was invented by R. A. Fisher in 1920
c. It was invented by Karl Pearson in 1920
It would be hard pressed for any computer scientist or statistician to accept

that the above questions fulfill content validity. As a matter of fact, the
memorization approach is a common practice among instructors.
Further, sampling knowledge from a larger domain of knowledge involves
subjective values. For example, a test regarding art history may include many
questions on oil paintings, but less questions on watercolor paintings and
photography because of the perceived importance of oil paintings in art
history.
Content validity is sample-oriented rather than sign-oriented. A behavior is
viewed as a sample when it is a subgroup of the same kind of behaviors. On
the other hand, a behavior is considered a sign when it is an indictor or a proxy
of a construct. (Goodenough, 1949). Construct validity and criterion validity,
which will be discussed later, are sign-oriented because both of them indicate
behaviors different from those of the test.
Criterion: draw an inference from test scores to performance. A high score of

a valid test indicates that the tester has met the performance criteria.
Regression analysis can be applied to establish criterion validity. An
independent variable could be used as a predictor variable and a dependent
variable, the criterion variable. The correlation coefficient between them is
called validity coefficients.
For instance, scores of the driving test by simulation is the predictor variable
while scores of the road test is the criterion variable. It is hypothesized that if
the tester passes the simulation test, he/she should meet the criterion of being a
safe driver. In other words, if the simulation test scores could predict the road
test scores in a regression model, the simulation test is claimed to have a high
degree of criterion validity.
In short, criterion validity is about prediction rather than explanation.

Predication is concerned with non-casual or mathematical dependence where
as explanation is pertaining to causal or logical dependence. For example, one
can predict the weather based on the height of mercury inside a thermometer.
Thus, the height of mercury could satisfy the criterion validity as a predictor.
However, one cannot explain why the weather changes by the change of
mercury height. Because of this limitation of criterion validity, an evaluator
has to conduct construct validation.
Construct: draw an inference

form test scores to a
psychological construct. Because
it is concerned with abstract and
theoretical construct, construct
validity is also known as
theoretical construct.
According to Hunter and
Schmidt (1990), construct
validity is a quantitative question
rather than a qualitative
distinction such as "valid" or
"invalid"; it is a matter of degree.
Construct validity can be
measured by the correlation
between the intended independent variable (construct) and the proxy
independent variable (indicator, sign) that is actually used.
For example, an evaluator wants to study the relationship between general
cognitive ability and job performance. However, the evaluator may not be able
to administer a cognitive test to every subject. In this case, he can use a proxy
variable such as "amount of education" as an indirect indicator of cognitive
ability. After he administered a cognitive test to a portion of all subjects and
found a strong correlation between general cognitive ability and amount of
education, the latter can be used to the larger group because its construct
validity is established.
Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that
construct validity cannot be expressed in a single coefficient; there is no
mathematical index of construct validity. Rather the nature of construct
validity is qualitative.
There are two types of indictors:
o
o
Reflective indictor: the effect of the construct.

Formative indictor: the cause of the construct.
When an indictor is expressed in terms of multiple items of an instrument,

factor analysis is used for construct validation.
Test bias is a major threat against construct validity, and therefore test bias
analyses should be employed to examine the test items (Osterlind, 1983).
The presence of test bias definitely affects the measurement of the
psychological construct. However, the absence of test bias does not guarantee
that the test possesses construct validity. In other words, the absence of test
bias is a necessary, but isn't a sufficient condition.
Construct validation as unification: The criterion and the content models

tends to be empirical-oriented while the construct model is inclined to be
theoretical. Nevertheless, all models of validity requires some form of
interpretation: What is the test measuring? Can it measure what it intends to
measure? In standard scientific inquiries, it is important to formulate an
interpretative (theoretical) framework clearly and then to subject it to
empirical challenges. In this sense, theoretical construct validation is
considered functioning as a unified framework for validity (Kane, 2001).
A modified view of reliability (Moss, 1994)
There can be validity without reliability if reliability is defined as consistency

among independent measures.
Reliability is an aspect of construct validity. As assessment becomes less
standardized, distinctions between reliability and validity blur.
In many situations such as searching faculty candidate and conferring graduate
degree, committee members are not trained to agree on a common set of
criteria and standards
Inconsistency in students' performance across tasks does not invalidate the
assessment. Rather it becomes an empirical puzzle to be solved by searching
for a more comprehensive interpretation.
Initial disagreement (e.g., among students, teachers, and parents in responsive
evaluation) would not invalidate the assessment. Rather it would provide an
impetus for dialog.
Li (2003) argued that the preceding view is incorrect:
The definition of reliability should be defined in terms of the classical test

theory: the squared correlation between observed and true scores or the
proportion of true variance in obtained test scores.
Reliability is a unitless measure and thus it is already model-free or standardfree.
It has been a tradition that multiple factors are introduced into a test to
improve validity but decrease internal-consistent reliability.
A radical view of reliability (Thompson et al, 2003)
Reliability is not a property of the test; rather it is attached to the property of

the data. Thus, psychomterics is datammetrics.
Tests are not reliable. It is important to explore reliability in virtually all
studies.
Reliability generalization, which can be used in a meta-analysis application
similar to validity generalization, should be implemented to assess variance in
measurement error across studies.
An updated perspective of reliability (Cronbach, 2004)

In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of
measuring reliability, reviewed the historical development of Cronbach Alpha. He
asserted, "I no longer regard the formula (of Cronbach Alpha) as the most appropriate
way to examine most data. Over the years, my associates and I developed the complex
generaliability (G) theory" (p. 403). Discussion of the G theory is beyond the scope of
this document. Nevertheless, Cronbach did not object use of Cronbach Alpha but he
recommended that researchers should take the following into consideration while
employing this approach:
Standard error of measurement: It is the most important piece of

information to report regarding the instrument, not a coefficient.
Independence of sampling
Heterogeneity of content
How the measurement will be used: Decide whether future uses of the
instrument are likely to be exclusively for absolute decisions, for differential
decisions, or both.
Number of conditions for the test
A critical view of validity (Pedhazur & Schmelkin,1991)
Content validity is not a type of validity at all because validity refers to

inferences made about scores, not to an assessment of the content of an
instrument.
The very definition of a construct implies a domain of content. There is no
sharp distinction between test content and test construct.
A modified view of validity (Messick, 1995)

The conventional view (content, criterion, construct) is fragmented and incomplete,
especially because it fails to take into account both evidence of the value implications
of score meaning as a basis for action and the social consequences of score use.
Validity is not a property of the test or assessment...but rather of the meaning of the
test scores.
Content: evidence of content relevance, representativeness, and technical

quality
Substantive: theoretical rationale
Structural: the fidelity of the scoring structure
Generalizability: generalization to the population and across populations
External: applications to multitrait-multimethod comparison
Consequential: bias, fairness, and justice; the social consequence of the
assessment to the society
A different view of reliability and validity (Salvucci, Walter, Conley,

Fink, & Saba (1997)
Some scholars argue that the traditional view that "reliability is a necessary but not a
sufficient condition of validity" is incorrect. This school of thought conceptualizes
reliability as invariance and validity as unbiasedness. A sample statistic may have an
expected value over samples equal to the population parameter (unbiasedness), but
have very high variance from a small sample size. Conversely, a sample statistic can
have very low sampling variance but have an expected value far departed from the
population parameter (high bias). In this view, a measure can be unreliable (high
variance) but still valid (unbiased).
Population parameter (Red line) =

Sample statistic (Yellow line) -->
unbiased
High variance (Green line)
Unreliable but valid
Caution and advice
Population parameter (Red line) <>

Sample statistic (Yellow line) -->
Biased
low variance (Green line)
Invalid but reliable
There is a common misconception that if someone adopts a validated instrument,

he/she does not need to check the reliability and validity with his/her own data.
Imagine this: When I buy a drug that has been approved by FDA and my friend asks
me whether it heals me, I tell him, "I am taking a drug approved by FDA and
therefore I don't need to know whether it works for me or not!" A responsible
evaluator should still check the instrument's reliability and validity with his/her own
subjects and make any modifications if necessary.
Low reliability is less detrimental to the performance pretest. In the pretest where
subjects are not exposed to the treatment and thus are unfamiliar with the subject
matter, a low reliability caused by random guessing is expected. One easy way to
overcome this problem is to include "I don't know" in multiple choices. In an
experimental settings where students' responses would not affect their final grades,
the experimenter should explicitly instruct students to choose "I don't know" instead
of making a guess if they really don't know the answer. Low reliability is a signal of
high measurement error, which reflects a gap between what students actually know
and what scores they receive. The choice "I don't know" can help in closing this gap.
References
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational
and psychological testing. Washington, DC: Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun
(Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the
perspective of replications. Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd Ed.). Washington, D. C.: American Council on Education.
Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor
procedures. Educational and Psychological Measurement, 64, 391-418.
Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of
Education. New York: McGraw-Hill.
Goodenough, F. L. (1949). Mental testing: Its history, principles, and applications.
New York: Rinehart.
Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error
and bias in research findings. Newsbury Park: Sage Publications.
Kane, M. (2001). Current concerns in validity theory. Journal of educational
Measurement, 38, 319-342.
Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of

text analysis methods. Journal of Management Information System, 11, 137-160.
Li, H. (2003). The resolution of some paradoxes related to reliability and validity.
Journal of Educational and Behavioral Statistics, 28, 89-95.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences
from persons' responses and performance as scientific inquiry into scoring meaning.
American Psychologist, 9, 741-749.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,
23, 5-12.
Osterlind, S. J. (1983). Test item bias. Newbury Park: Sage Publications.
Parkes, J. (2000). The relationship between the reliability and cost of performance
assessments. Education Policy Analysis Archives, 8. [On-line] Available URL:
http://epaa.asu.edu/epaa/v8n16/
Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An
integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State
University of New York Press.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error
studies at the National Center for Education Statistics. Washington D. C.: U. S.
Department of Education
Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability
issues. Thousand Oaks: Sage.
Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of
Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press.
Questions for discussion

Pick one of the following cases and determine whether the test or the assessment is
valid. Apply the concepts of reliability and validity to the situation. These cases may
be remote to this cultural context. You may use your own example.
1. In ancient China, candidates for government officials had to take the
examination regarding literature and moral philosophy, rather than public
administration.
2. Before July 1, 1997 when Hong Kong was a British colony, Hong Kong
doctors, including specialists, who graduated from non-Common Wealth
medical schools had to take a general medical examination covering all

general areas in order to be certified.
MODULE R10
RELIABILITY AND VALIDITY
Warwick and Linninger (1975) point out that there are two basic goals in questionnaire
design.
1. To obtain information relevant to the purposes of the survey.
2. To collect this information with maximal reliability and validity.
How can a researcher be sure that the data gathering instrument being used will
measure what it is supposed to measure and will do this in a consistent manner? This is
a question that can only be answered by examining the definitions for and methods of
establishing the validity and reliability of a research instrument. These two very
important aspects of research design will be discussed in this module.
Validity
Validity can be defined as the degree to which a test measures what it is supposed to
measure. There are three basic approaches to the validity of tests and measures as shown
by Mason and Bramble (1989). These are content validity, construct validity, and
criterion-related validity.
Content Validity
This approach measures the degree to which the test items represent the domain or
universe of the trait or property being measured. In order to establish the content
validity of a measuring instrument, the researcher must identify the overall content to be
represented. Items must then be randomly chosen from this content that will accurately
represent the information in all areas. By using this method the researcher should obtain
a group of items which is representative of the content of the trait or property to be
measured.
Identifying the universe of content is not an easy task. It is, therefore, usually suggested
that a panel of experts in the field to be studied be used to identify a content area. For
example, in the case of researching the knowledge of teachers about a new curriculum, a
group of curriculum and teacher education experts might be asked to identify the content
of the test to be developed.
Construct Validity
Cronbach and Meehl (1955) indicated that, "Construct validity must be investigated
whenever no criterion or universe of content is accepted as entirely adequate to define the
quality to be measured" as quoted by Carmines and Zeller (1979). The term construct in
this instance is defined as a property that is offered to explain some aspect of human
behavior, such as mechanical ability, intelligence, or introversion (Van Dalen, 1979). The
construct validity approach concerns the degree to which the test measures the construct
it was designed to measure.
There are two parts to the evaluation of the construct validity of a test. First and most
important, the theory underlying the construct to be measured must be considered.
Second the adequacy of the test in measuring the construct is evaluated (Mason and
Bramble, 1989). For example, suppose that a researcher is interested in measuring the
introverted nature of first year teachers. The researcher defines introverted as the overall
lack of social skills such as conversing, meeting and greeting people, and attending
faculty social functions. This definition is based upon the researchers own observations.
A panel of experts is then asked to evaluate this construct of introversion. The panel
cannot agree that the qualities pointed out by the researcher adequately define the
construct of introversion. Furthermore, the researcher cannot find evidence in the
research literature supporting the introversion construct as defined here. Using this
information, the validity of the construct itself can be questioned. In this case the
researcher must reformulate the previous definition of the construct.
Once the researcher has developed a meaningful, useable construct, the adequacy of the
test used to measure it must be evaluated. First, data concerning the trait being
measured should be gathered and compared with data from the test being assessed. The
data from other sources should be similar or convergent. If convergence exists, construct
validity is supported.
After establishing convergence the discriminate validity of the test must be determined.
This involves demonstrating that the construct can be differentiated from other
constructs that may be somewhat similar. In other words, the researcher must show that
the construct being measured is not the same as one that was measured under a different
name.
Criterion-Related Validity
This approach is concerned with detecting the presence or absence of one or more criteria
considered to represent traits or constructs of interest. One of the easiest ways to test for
criterion-related validity is to administer the instrument to a group that is known to
exhibit the trait to be measured. This group may be identified by a panel of experts. A
wide range of items should be developed for the test with invalid questions culled after
the control group has taken the test. Items should be omitted that are drastically
inconsistent with respect to the responses made among individual members of the group.
If the researcher has developed quality items for the instrument, the culling process
should leave only those items that will consistently measure the trait or construct being
studied. For example, suppose one wanted to develop an instrument that would identify
teachers who are good at dealing with abused children. First, a panel of unbiased experts
identifies 100 teachers out of a larger group that they judge to be best at handling abused
children. The researcher develops 400 yes/no items that will be administered to the whole
group of teachers, including those identified by the experts. The responses are analyzed
and the items to which the expert identified teachers and other teachers responding
differently are seen as those questions that will identify teachers who are good at dealing
with abused children.
Reliability
The reliability of a research instrument concerns the extent to which the instrument
yields the same results on repeated trials. Although unreliability is always present to a
certain extent, there will generally be a good deal of consistency in the results of a quality
instrument gathered at different times. The tendency toward consistency found in
repeated measurements is referred to as reliability (Carmines & Zeller, 1979).
In scientific research, accuracy in measurement is of great importance. Scientific
research normally measures physical attributes which can easily be assigned a precise
value. Many times numerical assessments of the mental attributes of human beings are
accepted as readily as numerical assessments of their physical attributes. Although we
may understand that the values assigned to mental attributes can never be completely
precise, the imprecision is often looked upon as being too small to be of any practical
concern. However, the magnitude of the imprecision is much greater in the measurement
of mental attributes than in that of physical attributes. This fact makes it very important
that the researcher in the social sciences and humanities determine the reliability of the
data gathering instrument to be used (Willmott & Nuttall, 1975).
Retest Method
One of the easiest ways to determine the reliability of empirical measurements is by the
retest method in which the same test is given to the same people after a period of time.
The reliability of the test (instrument) can be estimated by examining the consistency of
the responses between the two tests.
If the researcher obtains the same results on the two administrations of the instrument,
then the reliability coefficient will be 1.00. Normally, the correlation of measurements
across time will be less than perfect due to different experiences and attitudes that
respondents have encountered from the time of the first test.
The test-retest method is a simple, clear cut way to determine reliability, but it can be
costly and impractical. Researchers are often only able to obtain measurements at a
single point in time or do not have the resources for multiple administration.
Alternative Form Method
Like the retest method, this method also requires two testings with the same people.
However, the same test is not given each time. Each of the two tests must be designed to
measure the same thing and should not differ in any systematic way. One way to help
ensure this is to use random procedures to select items for the different tests.
The alternative form method is viewed as superior to the retest method because a
respondents memory of test items is not as likely to play a role in the data received. One
drawback of this method is the practical difficulty in developing test items that are
consistent in the measurement of a specific phenomenon.
Split-Halves Method
This method is more practical in that it does not require two administrations of the same
or an alternative form test. In the split-halves method, the total number of items is
divided into halves, and a correlation taken between the two halves. This correlation only
estimates the reliability of each half of the test. It is necessary then to use a statistical
correction to estimate the reliability of the whole test. This correction is known as the
Spearman-Brown prophecy formula (Carmines & Zeller, 1979)
Pxx" = 2Pxx'/1+Pxx'
where Pxx" is the reliability coefficient for the whole test and Pxx' is the split-half
correlation.
Example
If the correlation between the halves is .75, the reliability for the total test is:
Pxx" = [(2) (.75)]/(1 + .75) = 1.5/1.75 = .857

There are many ways to divide the items in an instrument into halves. The most typical
way is to assign the odd numbered items to one half and the even numbered items to the
other half of the test. One drawback of the split-halves method is that the correlation
between the two halves is dependent upon the method used to divide the items.
Internal Consistency Method
This method requires neither the splitting of items into halves nor the multiple
administration of instruments. The internal consistency method provides a unique
estimate of reliability for the given test administration. The most popular internal
consistency reliability estimate is given by Cronbachs alpha. It is expressed as follows:
where N equals the number of items;

equals the variance of the total composite.
equals the sum of item variance and
If one is using the correlation matrix rather than the variance-covariance matrix then
alpha reduces to the following:
alpha = Np/[1+p(N-1)]
where N equals the number of items and p equals the mean interitem correlation.
Example
The average intercorrelation of a six item scale is .5, then the alpha for the scale would
be:
alpha = 6(.5)/[1+.5(6-1)]
= 3/3.5 = .857
An example of how alpha can be calculated can be given by using the 10 item self-esteem
scale developed by Rosenberg (1965). (See table) The 45 correlations in the table are first
summed: .185+.451+.048+ . . . + .233= 14.487. Then the mean interitem correlation is
found by dividing this sum by 45: 14.487/45= .32. Now use this number to calculate
alpha:
alpha = 10(.32)/[1+.32(10-1)]
= 3.20/3.88
= .802
The coefficient alpha is an internal consistency index designed for use with tests
containing items that have no right answer. This is a very useful tool in educational and
social science research because instruments in these areas often ask respondents to rate
the degree to which they agree or disagree with a statement on a particular scale.
Cronbachs Alpha Example
Questions
10
10
10
10
10
10
10
10
10
10
10
10
26
24
22
27
32
27
17
18
23
25
2.6
2.4
2.2
2.7
3.2
2.8
1.7
1.8
2.3
2.5
90
78
60
89
130
98
35
36
65
79
22.4
20.4
11.6
16.1
27.6
19.6
6.1
3.6
12.1
16.5
2.5
2.3
1.3
1.8
3.1
2.2
.68
.4
1.3
1.8
S2
= .917 .467 .337 .455 .014 -.146 .512 -.06 .74
=
P = Mean Interitem Correlation
alpha = Np / [1+p(N-1)]
= (10)(.36)/[1+.36(10-1)]
= 3.6/4.24
=.849
Measurement
Validity and Reliability

For every dimension of interest and specific question or set of questions, there are a
vast number of ways to make questions. Although the guiding principle should be the
specific purposes of the research, there are better and worse questions for any
particular operationalization. How to evaluate the measures?
Two of the primary criteria of evaluation in any measurement or observation are:
1. Whether we are measuring what we intend to measure.
2. Whether the same measurement process yields the same results.
These two concepts are validity and reliability.
Reliability is concerned with questions of stability and consistency - does the same
measurement tool yield stable and consistent results when repeated over time. Think
about measurement processes in other contexts - in construction or woodworking, a
tape measure is a highly reliable measuring instrument.
Say you have a piece of wood that is 2 1/2 feet long. You measure it once with the
tape measure - you get a measurement of 2 1/2 feet. Measure it again and you get 2
1/2 feet. Measure it repeatedly and you consistently get a measurement of 2 1/2 feet.
The tape measure yields reliable results.
Validity refers to the extent we are measuring what we hope to measure (and what we
think we are measuring). To continue with the example of measuring the piece of
wood, a tape measure that has been created with accurate spacing for inches, feet, etc.
should yield valid results as well. Measuring this piece of wood with a "good" tape
measure should produce a correct measurement of the wood's length.
To apply these concepts to social research, we want to use measurement tools that are
both reliable and valid. We want questions that yield consistent responses when asked
multiple times - this is reliability. Similarly, we want questions that get accurate
responses from respondents - this is validity.
Reliability
Reliability refers to a condition where a measurement process yields consistent scores
(given an unchanged measured phenomenon) over repeat measurements. Perhaps the
most straightforward way to assess reliability is to ensure that they meet the following
three criteria of reliability. Measures that are high in reliability should exhibit all
three.
Test-retest Reliability
When a researcher administers the same measurement tool multiple times - asks the
same question, follows the same research procedures, etc. - does he/she obtain
consistent results, assuming that there has been no change in whatever he/she is
measuring? This is really the simplest method for assessing reliability - when a
researcher asks the same person the same question twice ("What's your name?"), does
he/she get back the same results both times. If so, the measure has test-retest
reliability. Measurement of the piece of wood talked about earlier has high test-retest
reliability.
Inter-item reliability
This is a dimension that applies to cases where multiple items are used to measure a
single concept. In such cases, answers to a set of questions designed to measure some
single concept (e.g., altruism) should be associated with each other.
Interobserver reliability
Interobserver reliability concerns the extent to which different interviewers or
observers using the same measure get equivalent results. If different observers or
interviewers use the same instrument to score the same thing, their scores should
match. For example, the interobserver reliability of an observational assessment of
parent-child interaction is often evaluated by showing two observers a videotape of a
parent and child at play. These observers are asked to use an assessment tool to score
the interactions between parent and child on the tape. If the instrument has high
interobserver reliability, the scores of the two observers should match.
Validity
To reiterate, validity refers to the extent we are measuring what we hope to measure
(and what we think we are measuring). How to assess the validity of a set of
measurements? A valid measure should satisfy four criteria.
Face Validity
This criterion is an assessment of whether a measure appears, on the face of it, to
measure the concept it is intended to measure. This is a very minimum assessment - if
a measure cannot satisfy this criterion, then the other criteria are inconsequential. We
can think about observational measures of behavior that would have face validity. For
example, striking out at another person would have face validity for an indicator of
aggression. Similarly, offering assistance to a stranger would meet the criterion of
face validity for helping. However, asking people about their favorite movie to
measure racial prejudice has little face validity.
Content Validity
Content validity concerns the extent to which a measure adequately represents all
facets of a concept. Consider a series of questions that serve as indicators of
depression (don't feel like eating, lost interest in things usually enjoyed, etc.). If there
were other kinds of common behaviors that mark a person as depressed that were not
included in the index, then the index would have low content validity since it did not
adequately represent all facets of the concept.
Criterion-related Validity
Criterion-related validity applies to instruments than have been developed for
usefulness as indicator of specific trait or behavior, either now or in the future. For
example, think about the driving test as a social measurement that has pretty good
predictive validity. That is to say, an individual's performance on a driving test
correlates well with his/her driving ability.
Construct Validity
But for a many things we want to measure, there is not necessarily a pertinent
criterion available. In this case, turn to construct validity, which concerns the extent to
which a measure is related to other measures as specified by theory or previous
research. Does a measure stack up with other variables the way we expect it to? A
good example of this form of validity comes from early self-esteem studies - selfesteem refers to a person's sense of self-worth or self-respect. Clinical observations in
psychology had shown that people who had low self-esteem often had depression.
Therefore, to establish the construct validity of the self-esteem measure, the
researchers showed that those with higher scores on the self-esteem measure had
lower depression scores, while those with low self-esteem had higher rates of
depression.
Validity and Reliability Compared

So what is the relationship between validity and reliability? The two do not
necessarily go hand-in-hand.
At best, we have a measure that has both high validity and high reliability. It yields
consistent results in repeated application and it accurately reflects what we hope to
represent.
It is possible to have a measure that has high reliability but low validity - one that is
consistent in getting bad information or consistent in missing the mark. *It is also
possible to have one that has low reliability and low validity - inconsistent and not on
target.
Finally, it is not possible to have a measure that has low reliability and high validity you can't really get at what you want or what you're interested in if your measure
fluctuates wildly.
This web page was designed to provide you with basic information on an important
characteristic of a good measurement instrument: reliability. Prior to starting any
research project, it is important to determine how you are going to measure a
particular phenomena. This process of measurement is important because it allows
you to know whether you are on the right track and whether you are measuring what
you intend to measure. Both reliability and validity are essential for good
measurement, because they are your first line of defense against forming inaccurate
conclusions (i.e., incorrectly accepting or rejecting your research hypotheses).
Although this tutorial will only address general issues of reliability, you can access
more detailed information by clicking on the words or titles that are highlighted.
What is Reliability?
I am sure you are familiar with terms such as consistency, predictability,
dependability, stablity, and repeatability. Well, these are the terms that come to mind
when we talk about reliability. Broadly defined, reliability of a measurement refers to
the consistency or repeatability of the measurement of some phenomena. If a
measurement instrument is reliable, that means the instrument can measure the same
thing more than once or using more than one method and yield the same result. When
we speak of reliability, we are not speaking of individuals, we are actually talking
about scores.
The observed score is one of the major components of reliability. The observed score
is just that, the score you would observe in a research setting. The observed score
comprised of a true score and an error score. The true score is a theoretical concept.
Why is it theoretical? Because there is no way to really know what the true score is
(unless you're God). The true score reflects the true value of a variable. The error
score is the reason why the observed is different from the true score. The error score
is further broken down into method (or systematic) error and trait (or random) error.
Method error refers to anything that causes a difference between the observed score
and true score due to the testing situation. For example, any type of disruption (loud
music, talking, traffic) that occurs while students are taking a test may cause the
students to become distracted and may affect their scores on the test. On the other
hand, trait error is caused by any factors related to the characteristic of the person
taking the test that may randomly affect measurement. An example of trait error at
work is when individuals are tired, hungry, or unmotivated. These characteristics can
affect their performance on a test, making the scores seem worse than they would be
if the individuals were alert, well-fed, or motivated.
Reliability can be viewed as the ratio of the true score over the true score plus the
error score, or:
true score
true score + error score
Okay, now that you know what reliability is and what its components are, you're
probably wondering how to achieve reliability. Simply put, the degree of reliability
can be increased by decreasing the error score. So, if you want a reliable instrument,
you must decrease the error.
As previously stated, you can never know the actual true score of a measurement.
Therefore, it is important to note that reliability cannot be calculated; it can only be
estimated. The best way to estimate reliability is to measure the degree of correlation
between the different forms of a measurement. The higher the correlation, the higher
the reliability.
3 Aspects of Reliability
Before going on to the types of reliability, I must briefly review 3 major aspects of
reliability: equivalence, stability, and homogeneity. Equivalence refers to the degree
of agreement between 2 or more measures administered nearly at the same time. In
order for stability to occur, a distinction must be made between the repeatability of the
measurement and that of the phenomena being measured. This is achieved by
employing two raters. Lastly, homogeneity deals with assessing how well the
different items in a measure seem to reflect the attribute one is trying to measure. The
emphasis here is on internal relationships, or internal consistency.
Types of Reliability
Now back to the different types of reliability. The first type of reliability is parallel
forms reliability. This is a measure of equivalence, and it involves administering two
different forms to the same group of people and obtaining a correlation between the
two forms. The higher the correlation between the two forms, the more equivalent the
forms.
The second type of reliability, test-retest reliability, is a measure of stability which
examines reliability over time. The easiest way to measure stability is to administer
the same test at two different points in time (to the same group of people, of course)
and obtain a correlation between the two tests. The problem with test-retest reliability
is the amount of time you wait between testings. The longer you wait, the lower your
estimation of reliability.
Finally, the third type of reliability is inter-rater reliability, a measure of
homogeneity. With inter-rater reliability, two people rate a behavior, object, or
phenomenon and determine the amount of agreement between them. To determine
inter-rater reliability, you take the number of agreements and divide them by the
number of total observations.
The Relationship Between Reliability and Validity

The relationship between reliability and validity is a simple one to understand: a
measurement can be reliable, but not valid. However, a measurement must first be
reliable before it can be valid. Thus reliability is a necessary, but not sufficient,
condition of validity. In other words, a measurement may consistently assess a
phenomena (or outcome), but unless that measurement tests what you want it to, it is
not valid.
Remember: When designing a research project, it is important that your
measurements are both reliable and valid. If they aren't, then your instruments are
basically useless and you decrease your chances of accurately measuring what you
intended to measure.
MEASUREMENT, RELIABILITY, AND

VALIDITY
Measurement is closely linked with operationalizing concepts. We we refer to
operationalizing concepts, we begin to consider choosing variables which could be
observed and on which we could collect data or measurements.
Measurement is the basis for all systematic inquiry because it provides us with the
tools for recording differences in the outcome of variable change.
Measurement is the procedure by which we assign numerals, numbers, or other
distinguishing values to variables according to rules. These rules help us determine
the kinds of values we will assign to certain observable phenomena or
variables. They also determine the quality of measurement. Precision and exactness
in measurement are vitally important. The measures are what are actually used to test
the hypotheses. A researcher needs good measures for both independent and
dependent variables.
Measurement consists of two basic processes called conceptualization and
operationalization, then an advanced process called determining the levels of
measurement, and then even more advanced methods of measuring reliability and
validity. This lecture will take up each of these processes in turn.
Conceptualization is the process of taking a construct or concept and refining it by
giving it a conceptual or theoretical definition. Ordinary dictionary definitions will not
do. Instead, the researcher takes keywords in their research question or hypothesis and
finds a clear and consistent definition that is agreed-upon by others in the scientific
community. Sometimes, the researcher pushes the envelope by coming up with a
novel conceptual definition, but such initiatives are rare and require the researcher to
have intimate familiarity with the topic. More common is the process by which a
researcher notes agreements and disagreements over conceptualization in the
literature review, and then comes down in favor of someone else's conceptual
definition. It's perfectly acceptable in science to borrow the conceptualizations and
operationalizations of others. Conceptualization is often guided by the theoretical
framework, perspective, or approach the researcher is committed to. For example, a

researcher operating from within a Marxist framework would have quite different
conceptual definitions for a hypothesis about social class and crime than a nonMarxist researcher. That's because there are strong value positions in different
theoretical perspectives about how some things should be measured. Most criminal
justice researchers at this point will at least decide what type of crime they're going to
study.
Operationalization is the process of taking a conceptual definition and making it
more precise by linking it to one or more specific, concrete indicators or operational
definitions. These are usually things with numbers in them that reflect empirical or
observable reality. For example, if the type of crime one has chosen to study is theft
(as representative of crime in general), creating an operational definition for it means
at least choosing between petty theft and grand theft (false taking of less or more than
$150). I don't want to give the impression from this example that researchers should
rely upon statutory or legal definitions. Some researchers do, but most often,
operational definitions are also borrowed or created anew. They're what link the world
of ideas to the world of everyday reality. It's more important that ordinary people
would agree on your indicators than other scientists or legislators, but again, avoid
dictionary definitions. If you were to use legalistic definitions, then it's your duty to
provide what is called an auxiliary theory, which is a justification for the research
utility of legal hair-splitting (as in why less or more than $150 is of theoretical
significance). The most important thing to remember at this point, however, is your
unit of analysis. You want to make absolutely sure that everything you reduce down is
defined at the same unit of analysis: societal, regional, state, communal, individual, to
name a few. You don't want to end up with a research project that has to collect
political science data, sociological data, and psychological data. In most cases, you
should break it all down so that each variable is operationally defined at the same
level of thought, attitude, trait, or behavior, although some would call this
psychological reductionism and are more comfortable with group-level units or
psychological units only as a proxy measure for more abstract, harder-to-measure
terms.
LEVELS OF MEASUREMENT
A level of measurement is a scale by which a variable is measured. For 50 years,
with few detractors, science has used the Stevens (1951) typology of measurement
levels (scales). There are three things to remember about this typology: (1) anything
that can be measured falls into one of the four types; (2) the higher the level of
measurement, the more precision in measurement; and (3) every level up contains all
the properties of the previous level. The four levels of measurement, from lowest to
highest, are:
Nominal
Ordinal
Interval
Ratio
The nominal level of measurement describes variables that are categorical in nature.
The characteristics of the data you're collecting fall into distinct categories. If there
are a limited number of distinct categories (usually only two), then you're dealing
with a dichotomous variable. If there are an unlimited or infinite number of distinct
categories, then you're dealing with a continuous variable. Nominal variables
include demographic characteristics like sex, race, and religion.
The ordinal level of measurement describes variables that can be ordered or ranked
in some order of importance. It describes most judgments about things, such as big or
little, strong or weak. Most opinion and attitude scales or indexes in the social
sciences are ordinal in nature.
The interval level of measurement describes variables that have more or less equal
intervals, or meaningful distances between their ranks. For example, if you were to
ask somebody if they were first, second, or third generation immigrant, the
assumption is that the distance, or number of years, between each generation is the
same. All crime rates in criminal justice are interval level measures, as is any kind of
rate.
The ratio level of measurement describes variables that have equal intervals and a
fixed zero (or reference) point. It is possible to have zero income, zero education, and
no involvement in crime, but rarely do we see ratio level variables in social science
since it's almost impossible to have zero attitudes on things, although "not at all",
"often", and "twice as often" might qualify as ratio level measurement.
Advanced statistics require at least interval level measurement, so the researcher
always strives for this level, accepting ordinal level (which is the most common) only
when they have to. Variables should be conceptually and operationally defined with
levels of measurement in mind since it's going to affect how well you can analyze
your data later on.
RELIABILITY AND VALIDITY
For a research study to be accurate, its findings must be both reliable and valid.
Reliability means that the findings would be consistently the same if the study were
done over again. It sounds easy, but think of a typical exam in college; if you scored a
74 on that exam, don't you think you would score differently if you took if over
again?
Validity: A valid measure is one that provides the information that it was intended to
provide. The purpose of a thermometer, for example, is to provide information on the
temperature, and if it works correctly, it is a valid thermometer.
A study can be reliable but not valid, and it cannot be valid without first being
reliable. You cannot assume validity no matter how reliable your measurements are.
There are many different threats to validity as well as reliability, but an important
early consideration is to ensure you have internal validity. This means that you are
using the most appropriate research design for what you're studying (experimental,
quasi-experimental, survey, qualitative, or historical), and it also means that you have
screened out spurious variables as well as thought out the possible contamination of
other variables creeping into your study. Anything you do to standardize or clarify
your measurement instrument to reduce user error will add to your reliability.
It's also important early on to consider the time frame that is appropriate for what
you're studying. Some social and psychological phenomena (most notably those
involving behavior or action) lend themselves to a snapshot in time. If so, your
research need only be carried out for a short period of time, perhaps a few weeks or a
couple of months. In such a case, your time frame is referred to as cross-sectional.
Sometimes, cross-sectional research is criticized as being unable to determine cause
and effect, and a longer time frame is called for, one that is called longitudinal,
which may add years onto carrying out your research. There are many different types
of longitudinal research, such as those that involve tracking a cohort of subjects (such
as schoolchildren across grade levels), or those that involve time-series (such as
tracking a third world nation's economic development over four years or so). The
general rule is to use longitudinal research the greater the number of variables you've
got operating in your study and the more confident you want to be about cause and
effect.
METHODS OF MEASURING RELIABILITY
There are four good methods of measuring reliability:
test-retest
multiple forms
inter-rater
split-half
The test-retest in the same group technique is to administer your test, instrument,
survey, or measure to the same group of people at different points in time. Most
researchers administer what is called a pretest for this, and to troubleshoot bugs at the
same time. All reliability estimates are usually in the form of a correlation coefficient,
so here, all you do is calculate the correlation coefficient between the two scores of
each group and report it as your reliability coefficient.
The multiple forms technique has other names, such as parallel forms and disguised
test-retest, but it's simply the scrambling or mixing up of questions on your survey,
for example, and giving it to the same group twice. The idea is that it's a more
rigorous test of reliability.
Inter-rater reliability is most appropriate when you use assistants to do interviewing
or content analysis for you. To calculate this kind of reliability, all you do is report the
percentage of agreement on the same subject between your raters, or assistants.
Split-half reliability is estimated by taking half of your test, instrument, or survey,
and analyzing that half as if it were the whole thing. Then, you compare the results of
this analysis with your overall analysis. There are different variations of this
technique, one of the most common being called Cronbach's alpha (a frequently
reported reliability statistic) which correlates performance on each item with overall
score. Another technique, closer to the split-half method, is the Kuder-Richardson
coefficient, or KR-20. Statistical packages on most computers will calculate these for
you, although in graduate school, you'll have to do them by hand and understand that
all test statistics are derived from the formula that all observed scores consist of a true
score and error score.
METHODS OF MEASURING VALIDITY
There are four good methods of estimating validity:
face
content
criterion
construct
Face validity is the least statistical estimate (validity overall is not as easily
quantified as reliability) as it's simply an assertion on the researcher's partclaiming
that they've reasonably measured what they intended to measure. It's essentially a
"take my word for it" kind of validity. Usually, a researcher asks a colleague or expert
in the field to vouch for the items measuring what they were intended to measure.
Content validity goes back to the ideas of conceptualization and operationalization.
If the researcher has focused in too closely on only one type or narrow dimension of a
construct or concept, then it's conceivable that other indicators were overlooked. In
such a case, the study lacks content validity. Content validity is making sure you've
covered all the conceptual space. There are different ways to estimate it, but one of
the most common is a reliability approach where you correlate scores on one domain
or dimension of a concept on your pretest with scores on that domain or dimension
with the actual test. Another way is to simply look over your inter-item correlations.
Criterion validity is using some standard or benchmark that is known to be a good
indicator. A researcher might have devised a police cynicism scale, for example, and
they compare their Cronbach's alpha to the known Cronbach's alpha of say,
Neiderhoffer's cynicism scale. There are different forms of criterion validity:
concurrent validity is how well something estimates actual day-by-day behavior;
predictive validity is how well something estimates some future event or
manifestation that hasn't happened yet. The latter type is commonly found in
criminology. Suppose you are creating a scale that predicts how and when juveniles
become mass murderers. To establish predictive validity, you would have to find at
least one mass murderer, and investigate if the predictive factors on your scale,
retrospectively, affected them earlier in life. With criterion validity, you're concerned
with how well your items are determining your dependent variable.
Construct validity is the extent to which your items are tapping into the underlying
theory or model of behavior. It's how well the items hang together (convergent
validity) or distinguish different people on certain traits or behaviors (discriminant
validity). It's the most difficult validity to achieve. You have to either do years and
years of research or find a group of people to test that have the exact opposite traits or
behaviors you're interested in measuring.
A LIST OF THREATS TO RELIABILITY AND VALIDITY
AMBIGUITY -- when correlation is taken for causation

APPREHENSION -- when people are scared to respond to your study
COMPENSATION -- when a planned group or people contrast breaks down
DEMORALIZATION -- when people get bored with your measurements
DIFFUSION -- when people figure out your test and start mimicking symptoms
HISTORY -- when some critical event occurs between pretest and posttest
INADEQUATE OPERATIONALIZATION -- unclear definitions
INSTRUMENTATION -- when the researcher changes the measuring device
INTERACTION -- when confounding treatments or influences are co-occuring
MATURATION -- when people change or mature over the research period
MONO-OPERATION BIAS - when using only one exemplar
MORTALITY -- when people die or drop out of the research
REGRESSION TO THE MEAN -- a tendency toward middle scores
RIVALRY -- the John Henry Effect, when groups compete to score good
SELECTION -- when volunteers are used, people self-select themselves
SETTING -- something about the setting or context contaminates the study
TREATMENT -- the Hawthorne Effect, when people are trying to gain attention
REVIEW QUESTIONS:
1. Why are multiple indicators better than just one indicator of a variable?
2. Give an example of a construct or concept that cannot be psychologically reduced,
and explain why.
3. State a topic of research interest to you and what kinds of reliability and validity are
most appropriate, explaining why.
4. Operationalize the "public safety" concept by listing all important theoretical and
empirical dimensions.
Reliability and Validity, Part I

I. Overview
II. Scale Development Issues
A. The Domain Sampling Model
B. Content Validity
C. The domain sampling model and the
interpretation of test scores
D. Face Validity
III. How Reliable is the Scale?
A. Theory of Measurement Error
B. Reliability Estimates
C. Reliability Standards
D. Standard error of measurement
E. Regression Towards the Mean
F. Interrater reliability
IV. Diagnostic Utility
Reliability and Validity, Part II
References
Footnotes
I. Overview
The goal of this set of notes is explore issues of reliability and validity as they apply
to psychological measurement. The approach will be to look these issues by
examining a particular scale, the PTSD-Interview (PTSD-I: Watson, Juba, Manifold,
Kucala, & Anderson, 1991). The issues to be discussed include:
(a) How would you go about developing a scale to measure posttraumatic stress
disorder?
(b) What items would you include in your scale and how would you determine the
content validity of the scale?
(c) How would you determine the reliability of the scale?
(d) How would you determine the validity of the scale?
This web page will focus on the the first three issues. A companion web page will
look at the validity question.
II. Scale Development Issues

A. The Domain Sampling Model
The first two questions posed in the overview, "How would you go about developing
a scale to measure posttraumatic stress disorder?" and "What items would you include
in your scale and how would you determine the content validity of the scale?" can be
answered by first defining the domain of possible items that are relevant for the
scale. In the case of PTSD, the relevant domain of items are specified by the DSM-IV
diagnostic criteria for PTSD.
See handout: DSM-IV Diagnostic Criteria for PTSD
How would you specify the domain for content quizzes in general psychology, for a
personality test of extraversion?
B. Content Validity
Content validity asks the question, "Do the items on the scale adequately sample the
domain of interest?"
If you are developing a test to diagnose PTSD then the test must adequately reflect all
of the DSM diagnostic criteria.
See handout: The PTSD-I Scale
Do the items on the PTSD-I scale adequately reflect the DSM-IV criteria? How
might you come to quantitative decision about the content validity of the scale?
The DSM criteria are all or nothing. Participants respond to the PTSD-I items on 7point scales:
No
Very Little
Extremely
Never
Very Rarely
Always
1
2
A little
Somewhat
Sometimes Commonly
3
6
Quite a bit
Very much
Often
Very Often
How do you determine the minimum score you need to fit the all-or-nothing criteria
of the DSM? Is 3 "A little sometimes" enough to meet the criterion? Is 7
"extremely always" necessary to meet the criterion? If you make the criteria too strict
then you will underdiagnose PTSD. On the other hand if you make the criteria too
lenient you will over diagnose PTSD.
One approach would be to go back to the DSM criteria to see if they give any
guidance. In this instance are the DSM-IV guidelines helpful in determining
minimum score that should be used to indicate that a specific criterion was met?
An empirical approach is to examine diagnostic utility indices at each possible cutting
score. You can choose a cut score for the items that will maximize sensitivity,
maximize specificity, or maximize efficiency. Watson et al. (1991) determined a score
of 4 or greater on an item indicated that the DSM-III-R criterion was met. They
choose that score because it produced an optimal sensitivity/specificity balance.
What are the advantages and disadvantages of using this type of continuous scale
rather than an all-or-nothing (yes or no) response scale?
If you are developing a content test for general psychology, how would you determine
if the test had adequate content validity?
C. The domain sampling model and the interpretation of test scores.

For content tests the proportion correct is assumed to be the proportion correct that
would have been obtained if every item in the domain had been included on the test.
D. Face Validity
Face validity refers to the issue of whether or not the items are measuring what they
appear, on the face of it, to measure.
Does the PTSD-I have face validity?
Does the MMPI have face validity?
III. How Reliable is the Scale?

Reliability is the degree of consistency of the the measure. Does the measure give
you the same results every time it is used?
A. Theory of Measurement Error

An observed score, x, has two components: the true score, t, and measurement error,
e.
x=t+e
Measurement error can be either random or systematic. Systematic error is present
each time the measure is given (e.g., questions that consistent measure some other
domain, or possible response biases such as the tendency to agree with all items on
the test). Systematic error poses a validity problem, you are not measuring what you
intend to measure.
Reliability is defined as the proportion of true variance over the obtained variance.
reliability = true variance / obtained variance
= true variance / (true variance + error
variance)
A reliability coefficient of .85 indicates that 85% of the variance in the test scores
depends upon the true variance of the trait being measured, and 15% depends on error
variance. Note that you do not square the value of the reliability coefficient in order
to find the amount of true score variance.
B. Reliability Estimates
Measured at one testing session
Measured at two testing

sessions
HOMOGENEITY
Single
Measure
TEST-RETEST
RELIABILITY
or
STABILITY
Cronbachs coefficient
alpha, r11
where n is
the number
of items in
the test;
and
rii is the
average
measured as the
correlation
between the
same test given
at different times
error variance is
due to time
sampling and
content sampling
correlation
between all
of the test
items
See also the computational
formula for Cronbach's
alpha.
Graphic representation of
reliability as a function of n and
rii
split half reliability, rtt

where rhh is
the
correlation
between the
half-tests.
error variance is due

to content sampling
and content
heterogeneity. Low
homogeneity indices
may indicate that the
test measures more
than one domain.
STABILITY and
EQUIVALENCE
EQUIVALENCE
Different
forms of
the
Measure
measured as the
correlation between
different forms of the test
given at the same time
error variance is due to
content sampling
C. Reliability Standards
measured as the
correlation between
different forms of
the test given at
different times
error variance is
due to content
sampling and time
sampling
A good rule of thumb for reliability is that if the test is going to be used to make
decisions about peoples lives (e.g., the test is used as a diagnostic tool that will
determine treatment, hospitalization, or promotion) then the minimum acceptable
coefficient alpha is .90.
This rule of thumb can be substantially relaxed is the test is going to be used for
research purposes only.
Here are the reliability estimates for the PTSD-I and the Posttraumatic Stress
Diagnostic Scale (PDS; Foa, 1995). The PTSD-I is based in the DSM-II-R. The PDS
is based on the DSM-IV.
Measured at one testing session
HOMOGENEITY (alpha)
Single
Measure
PTSD-I
.921
.87 to .932,3
TEST-RETEST RELIABILITY
PDS
PTSD-I
PDS
.926
1 week = .951
90 days =
.762,4, .912,5
10-22 days =
.837
EQUIVALENCE
Different
forms of the
Measure
Measured at two testing sessions
Different forms are not available

for the PTSD-I
or the PDS
STABILITY and
EQUIVALENCE
Different forms are not available

for the PTSD-I
or the PDS
Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson

et al,., 1991, study 1).
2 Participants: 77 noncombat, nonhospitalized individuals suffering from
traumatic memories; 37 diagnosed as PTSD (Wilson, Tinker, Becker, &
Gillette, 1995).
3 Measured at 4 different times during the study.
4 Delayed treatment condition participants, n = 39.
5 Immediate treatment condition participants, n = 37.
6 n = 248
7 n = 110
D. Standard error of measurement

The standard error of measurement is the standard deviation of the error scores. If you
know the standard error of measurement you can determine the confidence interval
around any true score or the confidence interval of a predicted true score given an
obtained score. The formula for the standard error of measurement is
where SD = the standard

deviation of the measure,
and
r11= the reliability
(typically coefficient
alpha) of the measure.
For example, given that the SD of a test is 15.00 and the reliability is .90, then the
standard error of measurement would be
SEmeas = 15.00 * (1- .90) = 15.00 * ( .1) = 15.00 * .316 = 4.74
If the same test had a reliability of .60, then the standard error of measurement would
be
SEmeas = 15.00 * (1- .60) = 15.00 * ( .4) = 15.00 * .632 = 9.48.
95% confidence interval around a true score. You can use the SEmeas to build a
95% confidence interval around the true score. For a given true score, t, 95% of the
obtained scores will fall between t 1.96*SEmeas. In the example of the test with a
standard deviation of 15.00 and a reliability of .90, for a given true score of 100, the
95% confidence interval of the obtained scores would be 100 1.96*4.74. That is, the
95% confidence interval would range between 90.91 and 109.29.
Predicting the true score from an obtained score. You can use information about
the reliability of a measure to predict the true score from an obtained score. The
predicted true score, t', is found as
t' = r11x1
where t' is the estimated true deviation score,

x1 is the deviation score ( x1 = X - M) obtained on
test 1, and
r11 is the reliability coefficient.
A set of obtained scores, X, for a

Table 1. Estimated true deviation scores, t', and
hypothetical test that has a reliability of
the upper and lower bounds of the 95%
.90 are shown in Table 1. The mean of confidence interval (95% CI) when the reliability
the scores is 20.0, the standard
(r11) of the test = .90
deviation is 6.06.
95% CI
true
(deviation
Obtained Deviation
The deviation scores, X1, are computed by
deviation
scores)
Score1
Score
subtracting the mean (20.0) from each
score2
X
1
X
obtained score, X1 = X - M. An obtained
lower upper
t'
score (or raw score) of 12 on this test is
bound3 bound4
equivalent to a deviation score of -8.00.
1
11
-9.00
-8.10 -11.86
-4.34
An obtained score of 25 is equivalent to a

deviation score of 5.00.
12
-8.00
-7.20
-10.96
-3.44
17
-3.00
-2.70
-6.46
1.06
The estimated true deviation scores, t' are

4
18
compted by multiplying the deviation
5
19
score by the reliability, t' = r11*X1. If a
person obtained a score of 12 on this test,
6
20
then the estimated true deviation score
7
21
would be -7.20. In terms of the original
8
25
scale the estimated true score would be M
+ t', or 12.80 (20.00 + (-7.20) = 12.80) . If
9
28
a person obtained a score of 25 on the test
10
29
the estimated true deviation score would
be score would be 4.5. In terms of the
Mean
20.00
original scale, the estimated true score
SD
6.06
would be 20.00 + 4.5 or 24.5. Note that
whenever the reliability of the test is less SEmeas
than 1.00, then the estimated true score is 1 X1 = X - M
= X - 20.00
always closer to the mean.
-2.00
-1.80
-5.56
1.96
-1.00
-0.90
-4.66
2.86
.00
0.00
-3.76
3.76
1.00
0.90
-2.86
4.66
5.00
4.50
0.74
8.26
8.00
7.20
3.44
10.96
9.00
8.10
4.34
11.86
0.00
6.06
0.00
5.45
-3.76
5.45
3.76
5.45
The standard error of measurement, 1.91

(shown at the bottom of the true scores
column), was found by multiplying the
standard deviation, 6.06, by the square
root of the 1 - the reliability coefficient,
SEmeas = 6.06 * sqrt(1 - .90).
Confidence intervals are constructed
around each estimated true score. The
95% confidence interval around the
estimated true deviation score of -7.20
ranges from 3.46 to 10.94.
Recall that the reliability coefficient can
be interpreted in terms of the percent of of
obtained score variance that is true score
variance. In this example the reliability if
.90 so the true score variance should be
90% of the obtained score variance.
The deviation true scores and deviation
confidence interval scores can be
converted back to the original scale by
adding the deviation score to the mean of
the scale.
1.9163
t' = r11*X1
= .90*X1
95% CI lower bound = t' - 1.96*SEmeas

= t' - 1.96*1.9163
= t' - 3.76
95% CI lower bound= t' + 1.96*SEmeas

= t' + 1.96*1.9163
= t' + 3.76
The numeric data presented in the

previous table (r11 = .90) are shown in
graphic form in the figure at the right.
Obtained scores are shown on the x-axis
and true scores are shown on the y-axis.
The red diagonal line represents a set of
scores that are perfectly reliable. If the
scores are perfectly reliable then the true
score is equal to the obtained score.
Figure 1. The relationship

between obtained scores
(x-axis) and true scores (yaxis) for r11 = 1.00 (red line)
and for r11 = .90 (green
lines).
The green lines represent the estimated true

scores when the reliability of the scale is .90
(r11 = .90). The center green line is the
predicted true score, the outer green lines
represent the upper and lower bounds of the
95% confidence interval for the predicted true
scores.
Note that the 95% confidence interval is built
around the estimated true scores rather than
around the obtained scores.
The estimated true scores and 95% confidence intervals

are presented in the animated graphic (Figure 2) for the
following reliabilities: 1.00, .95, .90, .80, .70., .60, .50, .40
.03, .20, and .10.
The red diagonal line indicates a test with perfect reliability.
The true score is the obtained score. You can use that diagonal
red line as a comparison when viewing the true scores and
confidence intervals at other levels of reliability.
As you watch the graphic notice that as the reliability of the test
decreases the estimated true scores become closer and closer to
the mean of the set of scores. What would be the estimated true
score, t', if the reliability of the test were 0?
Also notice that the 95% confidence interval gets wider and
wider as the reliability of the test decreases.
Also notice that because the 95% confidence interval is built
around the estimated true score, the confidence interval is not
symmetric around the obtained score. The degree of asymmetry
becomes greater as the reliability decreases and as you go from
a score near the mean to score that is distant from the mean.
Figure 2. The relationship

between obtained scores (x-axis)
and true scores (y-axis) at
various scale reliabilities.
Introductory level measurement books typically say that the confidence interval for an
obtained score can be constructed around that obtained score rather than around the
true score. They are technically incorrect, but the confidence interval so constructed
will not be too far off as long as the reliability of the test is high.
E. Regression Towards the Mean

The above discussion of true scores and their confidence intervals provides the
statistical basis for what is known as the regression toward the mean effect. The
regression towards the mean effect occurs when people are selected for a study
because they have an extreme score on some test. For example, children are selected
for a special reading class because they score low on a reading test, or adults are
selected for a treatment outcome study because they score high on a PTSD diagnostic
test. If those same people are retested using the same instrument, then we would
expect that the new set of scores would be closer to the true score which is closer to
the mean. That is, children selected because of low reading scores should get higher
reading scores. Adults selected because of high PTSD scores should get lower PTSD
scores. This is expected to happen in the absence of any treatment.
One common sense explanation of this effect begins with the expectation that
measurement errors will be random. For someone who has an extreme score, it is
assumed that the errors for that testing were not random. It is likely that the errors all
happened to converge in a manner that they artificially inflated the score on that
particular test given at that particular time. It is unlikely that the random errors will
happen in the same manner on another testing so that person's score should be closer
to the true score on subsequent testings.
The magnitude of the regression towards the mean effect depends upon two things:
(a) the reliability of the test and (b) the extremity of the selection
scores. The magnitude of the regression towards the mean effect will increase as the
reliability of the test decreases. The regression towards the mean effect will increase
as the selection scores become more extreme. Check out your understanding of this
by going back and looking at the graphs showing the relationship between obtained
and true scores at different levels of reliability and different levels of extremity.
F. Interrater reliability
Interrater reliability is concerned with the consistency between the judgments of 2 or
more raters.
In the past, interrater reliability has been measured by having 2 (or more) people
make decisions (e.g. meets PTSD diagnosis criteria) or ratings across a number of
cases and then find the correlation between those two sets of decisions or ratings. That
method gives an overestimate of the interrater reliability so it is rarely used.
One of the more commonly used measures of interrater reliability is kappa. Kappa
takes into account the expected level of agreements between judges or raters. For that
reason it is considered to be a more appropriate measure on interrater reliability For
a discussion of kappa and how to compute it using SPSS see Crosstabs: Kappa.
The interrater reliability indices for the PTSD-I are shown below.
Interrater Reliability for the PTSD-I and PDS
PTSD-I: kappa = .611
PTSD-I: kappa = >.842
PDS: kappa = .773
1
Participants: 31Vietnam veteran inpatients; 24

diagnosed with PTSD (Watson et al,., 1991, study 1;
2 raters).
2 Participants: 80 noncombat,
nonhospitalized individuals suffering from traumatic
memories; 37 diagnosed as PTSD; 3 raters judged
all participants (Wilson, Tinker, Becker, & Gillette,
1995).
3 Participants: 110 individuals recruited from
treatment and research centers
In both of those studies, disagreements in diagnoses were resolved by a discussion
among the raters.
IV. Diagnostic Utility

Diagnostic utility refers to the extent to which the measure correctly identifies
individuals who meet or do not meet diagnostic criteria. There are three common
measures of diagnostic utility.
1. Sensitivity - the probability that those with the diagnosis will be
correctly identified by the test as meeting the diagnostic criteria.
2. Specificity - the probability that those without the diagnosis will be
correctly identified by the test as not meeting the diagnostic criteria.
3. Efficiency - the overall probability that correctly classifying both those
with the diagnosis and those without the diagnosis.
Sensitivity, specificity, and efficiency are reported as percents or as a decimal
numbers ranging from 0 to 1.0.
Consider the following crosstabulation table where the cells could be filled in with the
number of people in that cell.
"True" Diagnosis
Test
Classification
Has the Disorder
Does not have

the Disorder
Meets Criteria
Sensitivity =
column percent
(Type ? error)2
Does not Meet

Criteria
(Type ? error)2
Specificity =
column percent
Efficiency = (nsensitivity cell + nspecifity cell ) / ntotal

And the following distribution of hypothetical scores "True" Diagnosis
Meets Criteria
Has the Disorder
Does not have

the Disorder
Total
n = 20
n = 10
n = 30
n = 65
n = 70
sensitivity = 20/25
= .80
Test
Classification
n=5
Does not Meet Criteria
Total
Specificity = 65/75
= .87
n = 25
n = 75
Efficiency = (20 + 65) / 100 = .85

The utility indices are dependent upon the leniency or strictness of the test in
diagnosing individuals. For example, if the test were very strict and classified only 10
of the individuals as meeting the criteria, rather than the 30 shown above, then the
sensitivity, specificity and efficiency indices would be different.
The utility indices are also dependent upon how you determined the "true" diagnosis.
How would you determine the "True" diagnosis for an individual?
Watson et al. (1991) used the DIS PTSD scale (Robins & Helzer, 1985) as the "gold
standard" for diagnosing PTSD. The DIS PTSD scale is widely used scale that has
good utility scores for clinical populations (but not for nonclinical populations). The
participants (study 2) were 61 Vietnam veterans, 27 of whom were diagnosed as
PTSD using the DIS PTSD scale. Foa (1995) used the Structured Clinical Interview
for the DSM-II-R (Williams, et al., 1992) as the "gold standard." The participants
were 248 individuals recruited from treatment and research centers. The utility scores
are shown below
Utility Indices for the PTSD-I
(from Watson, et al., 1991,
study 2) and the PDS (from
Foa, 1995)
PTSDPDS
I
n = 100
Sensitivity
.89
.82
Specificity
.94
.77
Efficiency
.92
.79
References
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297-334.
Foa. E. B. (1995). Posttraumatic Stress Diagnostic Scale. Minneapolis: National
Computer Systems
Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version
III-A). St. Louis, MO: Washington University, Department of Psychiatry.
Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991).
The PTSD interview: Rationale, description, reliability, and concurrent validity of a
DSM-III based technique. Journal of Clinical Psychology, 47, 179-188.\
Williams, J. B., W., Gibbon, M., First, M. B., Spitzer, R. L., Davies, M., Borus, J.,
Howes, M. J., Kane, J., Pope, H., G., Rounsaville, B., & Wittchen, H.U. (1992). The
Structured Clinical Interview for DSM-III-R (SCID). Archives of General Psychiatry,
49, 630-636.
Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSD-I as an
outcome measure. Poster presented at the annual meeting of the International Society
for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.
Footnotes
1. The computational formula for Cronbach's alpha uses the standard deviations of
each of the items and the standard deviation of the test as a whole rather than the
average intercorrelation between all of the items:
where n = the number
of items in the test,
is the sum of the
squared standard
deviations of each of
the items( i.e., the sum
of the item variances),
and
is the squared
standard deviation of
the total test scores
(i.e., the variance of the
test).
Or, you could use a computer program to compute alpha. In SPSS for windows,
Cronbach's alpha can be found under:
Statistics
Scale
Reliability analysis ...
2. Type I error = rejecting the null hypothesis when it is true.

Type II error = failing to reject the null hypothesis when it is false.
In this instance the null hypothesis is that the person does not meet the diagnostic
criteria.
Reliability and Validity, Part II

V. Validity
A. Criterion-related validity
1. Predictive validity
2. Concurrent validity
B. Construct validity
1. Convergent validity
a. Correlational
b. Contrasted Groups
c. Experimental
2. Discriminant validity
Reliability and Validity, Part I

References
Footnotes
V. Validity
Validity refers to the issue of whether the test measures what it is intending to
measure. Does a test of, say, mathematics ability measure that ability, or is reading
comprehension a part of what is measured by the test? The validity of a test is
constrained by its reliability. If a test does not consistently measure a construct or
domain then it cannot expect to have high validity coefficients.
A. Criterion-related validity
Predictive validity. Criterion related validity refers to how strongly the scores on the
test are related to other behaviors. A test may be used to predict future behavior, e.g.,
will people who score high on the "in basket" test be good supervisors if they are
promoted, or will people who score high on the GRE exam be successful in graduate
school? In order to find the relationship between the test and some behavior you need
to clearly specify the behavior that you want to predict, that is, you need to specify the
criterion. This is not often an easy task. What do you mean by being a "good
supervisor" or "success in graduate school" and how would you measure those
characteristics?
As far as I know there are no studies that have looked at the predictive validity of the
PTSD-I. How would you devise a study to explore the predictive validity of the
PTSD-I?
Concurrent validity. Rather than predicting future behavior you may be interested
in the relationship between your test and other tests that purport to measure the same
domain. For example, if you are creating a new, shorter, less costly test of
intelligence, you would be interested in the relationship between your test and a
standard test of intelligence, for example the WAIS. If you gave both tests at the
same time and found the correlation between the two you would be determining the
concurrent validity of your test.
Watson, et. al. (1991) examined the concurrent validity of the PTSD-I and the
posttraumatic disorder section of the Diagnostic Interview Schedule (DIS: Rollins &
Helzer, 1985). As mentioned earlier, the DIS is a commonly used diagnostic measure
of PTSD. They computed the correlations between the individual items of the PTSDI and the corresponding items of the DIS and found that they ranged from .57 to .99,
with a median correlation of .77. They also looked at the correlation between the total
score of the PTSD-I and the diagnosis of PTSD based on the DIS, it was .94. We
know from our earlier discussion that the reliability of a test depends upon the number
of items. So we would expect that the reliability of the total test score would be
higher than the reliability of the individual items on the test.
Wilson et al. (1994) administered the PTSD-I, the Impact of Events Scale (IES;
Horowitz, Wilner, & Alvarez, 1979) and the Symptom Check List -90-Revised (SCL90-R; Derogatis, 1992) to 80 adult participants in a study of traumatic memory. The
IES scale measures the PTSD symptoms of intrusion and avoidance. It was not
designed to be used to diagnose PTSD. The SCL-90-R measures the occurrence of 9
psychological symptoms for psychiatric and medical patients. Many of the items on
the PTSD-I are phrased in terms of the lifetime incidence of PTSD symptoms. For
example, item C-5 reads, "Have you felt more cut off emotionally from other people
at some period than you did before (the stressor)? This form of the PTSD-I was used
at the pretest, prior to the administration of the treatment. The concurrent validities of
the PTSD-I with the IES and the Global Severity Index (GSI) of the SCL-90R were
.54and .66, respectively. We belatedly recognized that questions of that form would
not be useful in measuring changes due to treatment so we revised the time frame of
the scale to match that used by the IES and the SCL-90-R, which was one
week. Using the one-week time frame for the PTSD-I the concurrent validities of the
PTSD-I with the IES and the GSI were substantially higher, .85 and .88 respectively,
see Table 1.
Why do you think that the two versions of the PTSD-I would have such different
concurrent validities? Which are the "best" concurrent validities?
Table 1. Concurrent Validities for the PTSD-I and the PDS
Criterion Measure
PTSD-I
Individual items of the

DIS1
ranged from .57 to

.99, median = .77
PTSD diagnosis
made by the DIS
.941
PTSD-I symptoms as
lifetime measure,
Impact of Events
Scale (IES)
.542
PTSD-I symptoms
within the past week,
Impact of Events
Scale (IES)
.852
PTSD-I symptoms as
lifetime measure,
Global Severity Index
(GSI) of the SCL-90R
.662
PTSD-I symptoms
within the past week,
Global Severity Index
(GSI) of the SCL-90R
.882
PDS
(symptom severity
score)
IES-I = .803
IES-A= .663
Beck Depression
Inventory (BDI)
.793
State-Trait Anxiety
Scale (STAI-State)
.733
State-Trait Anxiety
Scale (STAI-Trait)
.743
1Watson,
et al., 1991, study 2

Tinker, Becker, & Gillette, 1994.
3Foa, 1995
2Wilson,
B. Construct validity
When you ask about construct validity you are taking a broad view of your test. Does
the test adequately measure the underlying construct? The question is asked both in
terms of convergent validity, are test scores related to behaviors and tests that it
should be related to and in terms of divergent validity, are test scores unrelated to
behaviors and tests that it should be unrelated to?
There is no single measure of construct validity. Construct validity is based on the
accumulation of knowledge about the test and its relationship to other tests and
behaviors.
Convergent validity
Correlational approach. The concurrent validities between the test and other
measures of the same domain are correlational measures of convergent validity.
Contrasted groups. Another way of measuring convergent validity is to look at
the differences in test scores between groups of people who would be expected to
score differently on the test. For example, Watson, et al. (1991) looked the
differences in PTSD-I scores for those who were and were not diagnosed as PTSD by
the DIS. They found that the PTSD-I score for those diagnosed with PTSD was higher
(M = 58.2; SD = 14.5) than for those not diagnosed as PTSD (M = 28.0; SD = 12.2),
t(59) = 8.68, p < .0001.
Experimental. Meaningful treatment effect size demonstrates experimental
intervention validity for a measure. Wilson, et al. (1994) computed effect sizes
between the EMDR treated group and the delayed treatment group by dividing the
difference between the two groups by the standard deviation of the delayed treatment
control group (Glass, McGaw, & Smith, 1981). The effect size for the 7-day version
of the PTSD-I was 1.28. The comparable effect sizes for the IES and GSI were 1.41
and 0.66, respectively. You might also report significance tests as measures of
experimental convergent validity, but effect sizes are more informative.
Discriminant validity
Discriminant validity, are the test scores unrelated to tests and behaviors in different
domains, seems to be less often assessed than is convergent validity. But the question
of discriminant validity is important when you are trying to distinguish your theory
from another theory. The subtitle of Daniel Goleman's book Emotional Intelligence
is "Why it can matter more than IQ." His argument is that emotional IQ is different
from traditional IQ and so measures of emotional IQ should not correlate very highly
with measures of traditional IQ. This is a question of discriminant validity.
References
Derogatis, L. R. (1992). SCL-90-R: Administration, scoring and procedures manual-II. Baltimore, MD: Clinical Psychometric Research.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research.
Beverly Hills, CA: Sage Publications.
Goleman, D. (1995). Emotional intelligence: Why it can matter more than IQ. New
York: Bantam Books.
Horowitz, M. J., Wilner, N., & Alvarez, W. (1979). Impact of Events Scale: A
measure of subjective distress. Psychosomatic Medicine, 41, 209-218.
Robins, L. H., & Helzer, J. E. (1985). Diagnostic Interview Schedule (DIS Version
III-A). St. Louis, MO: Washington University, Department of Psychiatry.
Watson, C. G., Juba, M. P., Manifold, V., Kucala, T., & Anderson, P. E. D. (1991).
The PTSD interview: Rationale, description, reliability, and concurrent validity of a
DSM-III based technique. Journal of Clinical Psychology, 47, 179-188.
Wilson, S. A., Tinker, R. H., Becker, L. A., & Gillette, C. S. Using the PTSD-I as an
outcome measure. Poster presented at the annual meeting of the International Society
for Traumatic Stress Studies, Chicago, IL, November 7th, 1994.

Conventional Views of Reliability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Conventional Views of Reliability

Uploaded by

Copyright:

Available Formats

Conventional views of reliability (AERA et al.

Temporal stability: the same form of a test on two or more separate

Replications as unification: Users may be confused by the diversity of

Conventional views of validity (Cronbach, 1971)

Content validity: draw an

reviewed by computer science professors because it is assumed that computer

What is the time period of the philosopher Epicurus?

Which of the following statement is true about ANOVA

It would be hard pressed for any computer scientist or statistician to accept

Criterion: draw an inference from test scores to performance. A high score of

In short, criterion validity is about prediction rather than explanation.

Construct: draw an inference

Reflective indictor: the effect of the construct.

When an indictor is expressed in terms of multiple items of an instrument,

Construct validation as unification: The criterion and the content models

A modified view of reliability (Moss, 1994)

There can be validity without reliability if reliability is defined as consistency

Li (2003) argued that the preceding view is incorrect:

The definition of reliability should be defined in terms of the classical test

A radical view of reliability (Thompson et al, 2003)

Reliability is not a property of the test; rather it is attached to the property of

An updated perspective of reliability (Cronbach, 2004)

Standard error of measurement: It is the most important piece of

A critical view of validity (Pedhazur & Schmelkin,1991)

Content validity is not a type of validity at all because validity refers to

A modified view of validity (Messick, 1995)

Content: evidence of content relevance, representativeness, and technical

A different view of reliability and validity (Salvucci, Walter, Conley,

Population parameter (Red line) =

Caution and advice

Population parameter (Red line) <>

There is a common misconception that if someone adopts a validated instrument,

Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of

Questions for discussion

medical schools had to take a general medical examination covering all

Pxx" = [(2) (.75)]/(1 + .75) = 1.5/1.75 = .857

where N equals the number of items;

equals the sum of item variance and

= .917 .467 .337 .455 .014 -.146 .512 -.06 .74

Validity and Reliability

Validity and Reliability Compared

The Relationship Between Reliability and Validity

MEASUREMENT, RELIABILITY, AND

framework, perspective, or approach the researcher is committed to. For example, a

AMBIGUITY -- when correlation is taken for causation

Reliability and Validity, Part I

II. Scale Development Issues

C. The domain sampling model and the interpretation of test scores.

III. How Reliable is the Scale?

A. Theory of Measurement Error

Measured at two testing

split half reliability, rtt

error variance is due

Measured at two testing sessions

Different forms are not available

Different forms are not available

Participants: 31Vietnam veteran inpatients; 24 diagnosed with PTSD (Watson

D. Standard error of measurement

where SD = the standard

where t' is the estimated true deviation score,

A set of obtained scores, X, for a

An obtained score of 25 is equivalent to a

The estimated true deviation scores, t' are

The standard error of measurement, 1.91