Item Analysis - Education

7/27/2015
ItemAnalysis|Education.com
Item Analysis
By Susan Eaves | Bradley Erford
Updated on Dec 23, 2009
THE PURPOSE OF ITEM ANALYSIS
Download Article
ITEM DIFFICULTY
DISCRIMINATION INDEX
CHARACTERISTIC CURVE
Item analysis is a general term that refers to the specific methods used in education to evaluate test
items, typically for the purpose of test construction and revision. Regarded as one of the most
important aspects of test construction and increasingly receiving attention, it is an approach
incorporated into item response theory (IRT), which serves as an alternative to classical
measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a
score to be the direct result of a person's true score plus error. It is this error that is of interest as
previous measurement theories have been unable to specify its source. However, item response
theory uses item analysis to differentiate between types of error in order to gain a clearer
understanding of any existing deficiencies. Particular attention is given to individual test items, item
characteristics, probability of answering items correctly, overall ability of the test taker, and degrees
or levels of knowledge being assessed.
THE PURPOSE OF ITEM ANALYSIS

There must be a match between what is taught and what is assessed. However, there must also be
an effort to test for more complex levels of understanding, with care taken to avoid over-sampling
items that assess only basic levels of knowledge. Tests that are too difficult (and have an
insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are too
easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores.
Tests can be improved by maintaining and developing a pool of valid items from which future tests
can be drawn and that cover a reasonable span of difficulty levels.
Item analysis helps improve test items and identify unfair or biased items. Results should be used to
refine test item wording. In addition, closer examination of items will also reveal which questions
were most difficult, perhaps indicating a concept that needs to be taught more thoroughly. If a
particular distracter (that is, an incorrect answer choice) is the most often chosen answer, and
especially if that distracter positively correlates with a high total score, the item must be examined
more closely for correctness. This situation also provides an opportunity to identify and examine
common misconceptions among students about a particular concept.
http://www.education.com/reference/article/itemanalysis/
1/6
7/27/2015
In general, once test items have been created, the value of these items can be systematically
assessed using several methods representative of item analysis: a) a test item's level of difficulty, b)
an item's capacity to discriminate, and c) the item characteristic curve. Difficulty is assessed by
examining the number of persons correctly endorsing the answer. Discrimination can be examined
by comparing the number of persons getting a particular item correct with the total test score.
Finally, the item characteristic curve can be used to plot the likelihood of answering correctly with
the level of success on the test.
ITEM DIFFICULTY
In test construction, item difficulty is determined by the number of people who answer a particular
test item correctly. For example, if the first question on a test was answered correctly by 76% of the
class, then the difficulty level (p or percentage passing) for that question is p = .76. If the second
question on a test was answered correctly by only 48% of the class, then the difficulty level for that
question is p = .48. The higher the percentage of people who answer correctly, the easier the item,
so that a difficulty level of .48 indicates that question two was more difficult than question one,
which had a difficulty level of .76.
Many educators find themselves wondering how difficult a good test item should be. Several things
must be taken into consideration in order to determine appropriate difficulty level. The first task of
any test maker should be to determine the probability of answering an item correctly by chance
alone, also referred to as guessing or luck. For example, a true-false item, because it has only two
choices, could be answered correctly by chance half of the time. Therefore, a true-false item with a
demonstrated difficulty level of only p = .50 would not be a good test item because that level of
success could be achieved through guessing alone and would not be an actual indication of
knowledge or ability level. Similarly, a multiple-choice item with five alternatives could be answered
correctly by chance 20% of the time. Therefore, an item difficulty greater than .20 would be
necessary in order to discriminate between respondents' ability to guess correctly and
respondents' level of knowledge. Desirable difficulty levels usually can be estimated as halfway
between 100 percent and the percentage of success expected by guessing. So, the desirable
difficulty level for a true-false item, for example, should be around p = .75, which is halfway between
100% and 50% correct.
In most instances, it is desirable for a test to contain items of various difficulty levels in order to
distinguish between students who are not prepared at all, students who are fairly prepared, and
students who are well prepared. In other words, educators do not want the same level of success
for those students who did not study as for those who studied a fair amount, or for those who
studied a fair amount and those who studied exceptionally hard. Therefore, it is necessary for a test
to be composed of items of varying levels of difficulty. As a general rule for norm-referenced tests,
items in the difficulty range of .30 to .70 yield important differences between individuals' level of
knowledge, ability, and preparedness. There are a few exceptions to this, however, with regard to
the purpose of the test and the characteristics of the test takers. For instance, if the test is to help
determine entrance into graduate school, the items should be more difficult to be able to make
finer distinctions between test takers. For a criterion-referenced test, most of the item difficulties
should be clustered around the criterion cut-off score or higher. For example, if a passing score is
70%, the vast majority of items should have percentage passing values of
2/6
7/27/2015
Figure 1ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE.

p = .60 or higher, with a number of items in the p > .90 range to enhance motivation and test for
mastery of certain essential concepts.
DISCRIMINATION INDEX
According to Wilson (2005), item difficulty is the most essential component of item analysis.
However, it is not the only way to evaluate test items. Discrimination goes beyond determining the
proportion of people who answer correctly and looks more specifically at who answers correctly. In
other words, item discrimination determines whether those who did well on the entire test did well
on a particular item. An item should in fact be able to discriminate between upper and lower
scoring groups. Membership in these groups is usually determined based on their total test score,
and it is expected that those scoring higher on the overall test will also be more likely to endorse
the correct response on a particular item. Sometimes an item will discriminate negatively, that is, a
larger proportion of the lower group select the correct response, as compared to those in the
higher scoring group. Such an item should be revised or discarded.
One way to determine an item's power to discriminate is to compare those who have done very
well with those who have done very poorly, known as the extreme group method. First, identify the
students who scored in the top one-third as well as those in the bottom one-third of the class. Next,
calculate the proportion of each group that answered a particular test item correctly (i.e.,
percentage passing for the high and low groups on each item). Finally, subtract the p of the bottom
performing group from the p for the top performing group to yield an item discrimination index (D).
Item discriminations of D = .50 or higher are considered excellent. D = 0 means the item has no
discrimination ability, while D = 1.00 means the item has perfect discrimination ability.
In Figure 1, it can be seen that Item 1 discriminates well with those in the top performing group
obtaining the correct response far more often (p = .92) than those in the
Figure 2ILLUSTRATION BY GGS

INFORMATION SERVICES. CENGAGE LEARNING, GALE.
3/6
7/27/2015
low performing group (p = .40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2 is
not difficult enough with a discriminability index of only .04, meaning this particular item was not
useful in discriminating between the high and low scoring individuals. Finally, Item 3 is in need of
revision or discarding as it discriminates negatively, meaning low performing group members
actually obtained the correct keyed answer more often than high performing group members.
Another way to determine the discriminability of an item is to determine the correlation coefficient
between performance on an item and performance on a test, or the tendency of students selecting
the correct answer to have high overall scores. This coefficient is reported as the item
discrimination coefficient, or the point-biserial correlation between item score (usually scored right
or wrong) and total test score. This coefficient should be positive, indicating that students
answering correctly tend to have higher overall scores or that students answering incorrectly tend
to have lower overall scores. Also, the higher the magnitude, the better the item discriminates. The
point-biserial correlation can be computed with procedures outlined in Figure 2.
In Figure 2, the point-biserial correlation between item score and total score is evaluated similarly
to the extreme group discrimination index. If the resulting value is negative or low, the item should
be revised or discarded. The closer the value is to 1.0, the stronger the item's discrimination power;
the closer the value is to 0,
Figure 3ILLUSTRATION BY GGS

INFORMATION SERVICES. CENGAGE LEARNING, GALE.
the weaker the power. Items that are very easy and answered correctly by the majority of
respondents will have poor point-biserial correlations.
CHARACTERISTIC CURVE
4/6
7/27/2015
A third parameter used to conduct item analysis is known as the item characteristic curve (ICC).
This is a graphical or pictorial depiction of the characteristics of a particular item, or taken
collectively, can be representative of the entire test. In the item characteristic curve the total test
score is represented on the horizontal axis and the proportion of test takers passing the item within
that range of test scores is scaled along the vertical axis.
For Figure 3, three separate item characteristic curves are shown. Line A is considered a flat curve
and indicates that test takers at all score levels were equally likely to get the item correct. This item
was therefore not a useful discriminating item. Line B demonstrates a troublesome item as it
gradually rises and then drops for those scoring highest on the overall test. Though this is unusual,
it can sometimes result from those who studied most having ruled out the answer that was keyed
as correct. Finally, Line C shows the item characteristic curve for a good test item. The gradual and
consistent positive slope shows that the proportion of people passing the item gradually increases
as test scores increase. Though it is not depicted here, if an ICC was seen in the shape of a
backward S, negative item discrimination would be evident, meaning that those who scored lowest
were most likely to endorse a correct response on the item.
See also:Item Response Theory
BIBLIOGRAPHY
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice
Hall.
Brown, F. (1983). Principles of education and psychological testing (3rd ed.). New York: Holt,
Rinehart, & Winston.
DeVellis, R. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA:
Sage.
Grunlund, N. (1993). How to make achievement tests and assessments (5th ed.). Boston: Allyn and
Bacon.
Kaplan, R., & Saccuzzo, D. (2004). Psychological testing: Principles, applications, and issues (6th
ed.) Pacific Grove, CA: Brooks/Cole.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research &
Evaluation, 4(10), retrieved April 1, 2008, from http://pareonline.net/getvn.asp?v=4&n=10.
Patten, M. (2001). Questionnaire research: A practical guide (2nd ed.). Los Angeles: Pyrczak.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:
Lawrence Erlbaum.
Copyright 2003-2009 The Gale Group, Inc. All rights reserved.
Next Article: Classroom Environment
5/6
7/27/2015
Comments
Japanese Alphabet for Beginners
Classroom Compass
Learn Animals in Spanish!
Copyright 2006 - 2015 Education.com,Inc. All rights reserved.
6/6

Item Analysis - Education

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Item Analysis - Education

Uploaded by

Copyright:

Available Formats

7/27/2015

THE PURPOSE OF ITEM ANALYSIS

THE PURPOSE OF ITEM ANALYSIS

Figure 1ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE.

Figure 2ILLUSTRATION BY GGS

Figure 3ILLUSTRATION BY GGS

Japanese Alphabet for Beginners

Learn Animals in Spanish!

Copyright 2006 - 2015 Education.com,Inc. All rights reserved.

You might also like