IR Performance Evaluation

Information Retrieval
Chap 3. Retrieval Evaluation

Retrieval Performance Evaluation
Measures of retrieval system performance

Recall and Precision
How relevant the answer set is

How precise the answer set is
Measures of system performance
Time and Space
The shorter the response time, the smaller the space used, the
better the system is
2
Experiment Settings
Real life experiments

A lot more in the 1990s
Laboratory experiments
Still dominant
Two main reasons
Repeatability
Scalability
3
Ra Ra
Re call , Pr ecision
R A
R = the set of relevant docs

|R| = the number of relevant docs
A = the docs retrieved (answer set)

|A| = the number of retrieved docs
|Ra| = the number of docs in the intersection of R and A
4
Ra Ra
Re call , Pr ecision
R A
Recall is the fraction of the relevant docs

that is retrieved
Precision is the fraction of the retrieved docs

that is relevant
5
High-Recall and High-Precision
6
Recall-Precision Curve: Example
Rq contains the relevant documents for query q

Rq {d 3 , d 5 , d 9 , d 25 , d 39 , d 44 , d 56 , d 71 , d 89 , d123}
Ranking retrieved documents for query q:
1. d123 6. d9 11. d 38
2. d 84 7. d511 12. d 48
3. d 56 8. d129 13. d 250
4. d 6 9. d187 14. d113
5. d 8 10. d 25 15. d 3
Documents are relevant to query q
marked with a bullet after the document id.
7
Ranked Retrieval: Example
n relevant recall precision Rq{d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
1 yes 1/10=0.1 1/1=1.00
2 no 1/10=0.1 1/2=0.50 Ranked documents:
3 yes 2/10=0.2 2/3=0.67
1. d123 6. d9 11. d 38
4 no 2/10=0.2 2/4=0.50
5 no 2/10=0.2 2/5=0.40 2. d 84 7. d511 12. d 48
6 yes 3/10=0.3 3/6=0.50 3. d 56 8. d129 13. d 250
7 no 3/10=0.3 3/7=0.43
4. d 6 9. d187 14. d113
8 no 3/10=0.3 3/8=0.38
9 no 3/10=0.3 3/9=0.33 5. d 8 10. d 25 15. d 3
10 yes 4/10=0.4 4/10=0.40
11 no 4/10=0.4 4/11=0.36
12 no 4/10=0.4 4/12=0.33
13 no 4/10=0.4 4/13=0.31
14 no 4/10=0.4 4/14=0.29
15 yes 5/10=0.5 5/15=0.33 8
Recall-Precision Curve: Example
The precision at levels of recall higher than 50% drops to 0

not all relevant documents are retrieved
9
Average Precision
The average precision at the recall level
Pi ( )
Nq
P( )
1 Nq
i
Nq = the number of queries used

Pi ( ) = the precision at recall level for the i-th query
10
Interpolated Precision
The interpolated precision at the j-th standard recall level is

the max precision between the j-th and the (j+1)-th recall level
P( j ) maxj j1 P( )
j = the j-th standard recall level

j { 0 ,1, 2 ,..., 10 }
e.g., 3 = the recall level 30%

If the interpolated precision values are used
the curve is non-increasing
11
n relevant recall precision R q {d 3 , d 56 , d 129 }

1 no ? ?
2 no ? ? Ranked documents:
3 yes ? ?
1. d123 6. d9 11. d 38
4 no ? ?
5 no ? ? 2. d 84 7. d511 12. d 48
6 no ? ? 3. d 56 8. d129 13. d 250
7 no ? ?
4. d 6 9. d187 14. d113
8 yes ? ?
9 no ? ? 5. d 8 10. d 25 15. d 3
10 no ? ?
11 no ? ?
12 no ? ?
13 no ? ?
14 no ? ?
15 yes ? ? 12
Interpolated Precision: Example
recall precision Interpolated R q {d 3 , d 56 , d 129 }

0% - ?
Ranked documents:
10 % ? ?
20 % ? ? 1. d123 6. d9 11. d 38
30 % ? ? 2. d 84 7. d511 12. d 48
40 % ? ? 3. d 56 8. d129 13. d 250
50 % ? ? 4. d 6 9. d187 14. d113
60 % ? ?
5. d 8 10. d 25 15. d 3
70 % ? ?
80 % ? ?
90 % ? ?
100 % ? ?
13
Recall and Precision Curve: Example
Interpolated Precision
Interpolated precision at 11 standard recall levels

relative to R q { d 3 , d 56 , d 129 }
14
Performance of Distinct Retrieval Algorithms
One algorithm has higher precision at lower recall levels

The other algorithm is superior at higher recall levels 15
Horizontal Recall-Precision Line
If perfect ranking algorithm could be developed

All relevant documents would be ranked ahead of
all irrelevant documents
Precision would be 100 percent at all recall levels
Recallprecision curve would be a horizontal line
at 100 percent
16
Compare Ranking Algorithms
Plot the recall-precision curves

If one curve lies completely above the other
The algorithm is better
recall
- The red system appears better
1.0 than the black
0.75 - In practice, the curves usually
intersect, perhaps several times
0.5
0.25
precision
0.25 0.5 0.75 1.0 17
Recall and Precision: Example
Collection of 10,000 documents, 50 on a specific topic

Ideal search finds these 50 documents and reject others
Actual search identifies 25 documents; 20 are relevant
but 5 were on other topics
Precision: ?
Recall: ?
18
Recall and Precision: Example
Collection of 10,000 documents, 50 on a specific topic

Ideal search finds these 50 documents and reject others
Actual search retrieves 25 documents; 20 are relevant
but 5 were on other topics
Precision: 20/ 25 = 0.8
Recall: 20/50 = 0.4
19
Measuring Precision and Recall
Precision is easy to measure:

A knowledgeable person looks at each document that is
retrieved and decides whether it is relevant
In the example, only the 25 documents that are found
need to be examined
Recall is difficult to measure:
To know all relevant items, a knowledgeable person
must go through the entire collection, looking at every
object to decide if it fits the criteria
In the example, all 10,000 documents must be examined
20
Relevance as a set comparison
D = set of documents
A = set of B = set of
documents that documents
satisfy some user- identified by
based criterion the search
system
21
Measures based on relevance
recall = retrieved relevant = | A B |

relevant |A|
precision = retrieved relevant = | A B |

retrieved |B|
fallout = retrieved not-relevant = | B - A B |

not-relevant |D-A|
22
n relevant recall precision
1 yes 1/5=0.2 1/1=1.0
2 yes 2/5=0.4 2/2=1.0 SMART system using
3 no 0.4 0.67 Cranfield data, 200
4 yes 0.6 0.75 documents in aeronautics
5 no 0.6 0.60 of which 5 are relevant
6 yes 0.8 0.67
7 no 0.8 0.57
8 no 0.8 0.50
9 no 0.8 0.44
10 no 0.8 0.40
11 no 0.8 0.36
12 no 0.8 0.33
13 yes 1.0 0.38 23
14 no 1.0 0.36
Precision-recall graph
precision
Note: Some authors plot
1 recall against precision.
2
1.0
0.75 4 6
3
0.5 5
13
12
0.25
200
recall 24
0.25 0.5 0.75 1.0
11 Point Interpolated Precision
(Recall Cut Off)
p(n) is precision at that point where recall has first reached n

Define 11 standard recall points p(r0), p(r1), ... p(r10),
where p(rj) = p(j/10), j { 0 ,1, 2 ,..., 10 }
Note: if p(rj) is not an exact data point, use interpolation
25
Recall Cutoff Graph:
Choice of Interpolation Points
precision
1 2 The blue line is
1.0
the recall cutoff
4 graph.
0.75 6
3
0.5 5
13
12
0.25
200
recall 26
0.25 0.5 0.75 1.0
Example: SMART System on Cranfield Data
Recall Precision
0.0 1.0
0.1 1.0 Precision values in
0.2 1.0 blue are actual data.
0.3 1.0 Precision values in red
0.4 1.0 are by interpolation
0.5 0.75 (by convention equal
0.6 0.75 to the next actual data
0.7 0.67 value).
0.8 0.67
0.9 0.38
1.0 0.38
27
Average precision
Average precision for a single topic is the mean of the

precision obtained after each relevant document is obtained.
Example:
p = (1.0 + 1.0 + 0.75 + 0.67 + 0.38) / 5
= 0.75
Mean average precision for a run consisting of many topics
is the mean of the average precision scores for each individual
topic in the run.
Definitions from TREC-8
28
Statistical Tests Superior System
Suppose that a search is carried out on systems i and j
System i is superior to system j

if, for all test cases,
recall(i) >= recall(j)
precisions(i) >= precision(j)
29
Single Value Summaries
R-Precision- The idea here is to generate a single value

summary of the ranking by computing the precision at
the R-th position in the ranking, where R is the total
number of relevant documents for the current query.
(i.e., number of documents in the set Rq).
For instance, consider the examples in Figures 3.2 and

3.3. The value of R-precision is 0.4 for the first
example (R=10) and 0.3 for the second example (R=3).
30
Single Value Summaries (Cont.)
Precision Histograms The R-precision measures for

several queries can be used to compare the retrieval
history of two algorithms as follows. Let RPA ( i ) and RPB ( i )
be the R-precision values of the retrieval algorithms A
and B for the i-th query. Define, for instance, the
difference.
RPA / B ( i ) RPA ( i ) RPB ( i )
A value of RPA / B ( i ) equal to 0 indicates that both
algorithms have equivalent performance for the i-th
query. A positive value of RPA / B ( i ) indicates a better
retrieval performance by algorithms A while a negative
value indicates a better retrieval performance by
31
algorithms B.
Figure 3.5 illustrates the RPA / B ( i ) values for two

hypothetical retrieval algorithms over ten example
queries. The algorithm A is superior for eight queries
while the algorithm B performs better for the two other
queries (numbered 4 and 5).
32
Summary Table Statistics Single value measures can

also be stored in a table to provide a statistical summary
regarding the set of all the queries in a retrieval task.
For instance, these summary table statistics could

include: the number of queries used in the task, the
total number of documents retrieved by all queries, the
total number of relevant documents which were
effectively retrieved when all queries are considered, the
total number of relevant documents which could have
been retrieved by all queries, etc.
33
Precision and Recall Appropriateness
Precisionand recall have been used extensively to
evaluate the retrieval performance of retrieval
algorithms. However, a more careful reflection reveals
problems with these two measures.
The proper estimation of maximum recall for a query

requires detailed knowledge of all the documents in
the collection.
Recall and precision are related measures which
capture different aspects of the set of retrieved
documents.
Recall and precision measure the effectiveness over a
set of queries processed in batch mode. 34
Precision and Recall Appropriateness (Cont.)
Measures which quantify the informativeness of the

retrieval process might now be more appropriate.
Recalland precision are easy to define when a linear

ordering of the retrieved documents is enforced.
35
Experimental Evaluation Measures
Precision and Recall

Precision is the proportion of retrieved material that is relevant
Recall is the proportion of relevant material retrieved
Percentage Accuracy
The fraction of positive user feedback in the top n items
recommended by the system
Normalized Distance-based Performance Measure
Distance between the user's ranking and the system's ranking of
the same set of documents
Normalized to be between 0 and 1
36
Experimental Evaluation Methodology
Training data set and User ratings

| Test examples | Training examples |
d1 d50 d51 d52 d1999 d2000
User 1 .. .. .. .. .. ..
User 2 .. .. .. .. .. ..
. .. .. .. .. .. ..
. .. .. .. .. .. ..
User n .. .. .. .. .. ..
The training data set will be split into 40 sessions

One session will be used as the test examples and the remaining 39
sessions as the training examples
Each session will be used as the test examples in one of the 40
experiments
The results of these 40 experiments will be averaged
The same training data set is used to evaluate the other filtering
methods
37
Alternative Measures
Since recall and precision, despite their popularity, are

not always the most appropriate measures for
evaluating retrieval performance, alternative measures
have been proposed over the years. A brief review of
some of them is as follows.
- The Harmonic Mean

- The E Measure
- User-Oriented Measures
- Other Measures
38
The Harmonic Mean
Asdiscussed above, a single measure which combines

recall and precision might be of interest. One such
measure is the harmonic mean F of recall and precision
which is computed as
2
F ( j) 1
r( j) 1
P( j)
where( j ) is the recall for the j-th document in the

ranking, P( j ) is the precision for the j-th document in the
ranking, and F ( j ) is the harmonic mean of ( j ) and P( j ).
39
The E Measure
Another measure which combines recall and precision
was proposed by van Rijsbergen and is called the E
evaluation measure. The idea is to allow the user to
specify whether he is more interested in recall or in
precision. The E measure is defined as follows.
1 b 2
E ( j ) 1 b2
r( j) 1
P( j)
where ( j) is the recall for the j-th document in the

ranking, P( j ) is the precision for the j-th document in the
ranking, E( j ) is the E evaluation measure relative to ( j )
and P( j ) , and b is a user specified parameter which
reflects the relative importance of recall and precision.
40
User-Oriented Measures
Recall and precision are based on the assumption that
the set of relevant documents for a query is the same,
independent of the user. However, different users might
have a different interpretation of which document is
relevant and which one is not. To cope with this
problem,
user-oriented measures have been proposed such as
coverage ratio, novelty ratio, relative recall, and recall
effort.
Let R be the set of relevant documents for I and A be the

answer set retrieved. Also, let U be the subset of R
which is known to the user. The number of documents in
U is |U|. 41
User-Oriented Measures (Cont.)
Let|Rk| be the number of documents in this set.

Further, let |Ru| be the number of relevant documents
previously unknown to the user which were retrieved.
Rk
cov erage
U
The novelty ratio is defined as the fraction of the

relevant documents retrieved which was unknown to the
user i.e.,
Ru
novelty
Ru Rk
42
User-Oriented Measures (Cont.)
Figure 3.6 Coverage and novelty ratio for a given

43
example information request.
Other Measures
Othermeasures which might be of interest include the

expected search length, which is good for dealing with
sets of documents weakly ordered, the satisfaction,
which takes into account only the relevant documents,
and the frustration, which takes into account only the
non-relevant documents.
44
History
Measuring the effectiveness of information discovery is an

unresolved research topic.
Much of the work on evaluation of information retrieval derives
from the ASLIB Cranfied projects led by Cyril Cleverdon, which
began in 1957.
Recent work on evaluation has centered around the Text Retrieval
Conferences (TREC), which began in 1992, led by Donna
Harman and Ellen Voorhees of NIST.
45
Reference Collections
We first discuss the TIPSTER/TREC collection which, due

to its large size and thorough experimentation, is usually
considered to be the reference test collection in
information retrieval nowadays
Following that, we cover the CACM and ISI collections

due to their historical importance in the area of
information retrieval
TREC - http://trec.nist.gov/
CACM - http://www.cs.hope.edu/csci481/articles.html
46
The TREC Collection
First,
that it lacks a solid formal framework as a basic
foundation. The first of these criticisms is difficult to
dismiss entirely due to the inherent degree of
psychological subjectiveness associated with the task
of deciding on the relevant of a given document.
Second, that is lacks robust and consistent testbeds and

benchmarks. The second of these criticisms, however,
can be acted upon. For three decades, experimentation
in information retrieval was based on relatively small
test collections which did not reflect the main issues
present in a large bibliographical environment.
47
TREC - (Text REtrieval Conference)
Initiated
under the National Institute of Standards and
Technology(NIST)
Goals:
Providinga large test collection
Uniform scoring procedures
Forum
7thTREC conference in 1998:

Document collection: test collections, example
information requests (topics), relevant docs
The benchmarks tasks
48
The Document Collection
Table3.1 Document collection used at TERC-6

stopwords are not removed and no stemming is
performed.
49
The Document Collection (Cont.)
Documents from all subcollections are tagged with SGML

to allow east parsing.
<doc>
<doc>
<docno>WSJ880406-0090</docno>
<docno>WSJ880406-0090</docno>
<hl>AT&T
<hl>AT&T Unveils
Unveils Services
Services to
to Upgrade
Upgrade Phone
Phone Networks
Networks Under
Under Global
Global Plan</hl>
Plan</hl>
<author>Janet
<author>Janet Guyon
Guyon (WSJ
(WSJ Staff)</author>
Staff)</author>
<dateline>New
<dateline>New York</dateline>
York</dateline>
<text>
<text>
American
American Telephone
Telephone & & Telegraph
Telegraph Co.
Co. introduced
introduced the
the first
first of
of aa new
new generation
generation of
of
phone
phone service
service with broad
with broad
</text>
</text>
</doc>
</doc> 50
The Example Information Requests (Topics)
TheTREC collection includes a set of example

information requests which can be used for testing a
new ranking algorithm. Each request is a description of
an information need in natural language.
<top>
<num> Number:168
<title>Topic:Financing AMTRAK
<desc>Description:
..
<nar>Narrative:A ..
</top>
Figure 3.8 Topic numbered 168 in the TREC collection. 51

The Example Information Requests (Topics)
The task of converting an

Information request into a system
query must be done by the
system itself and is considered to
be an integral part of the
evaluation procedure.
52
The Relevant Documents for
Each Example Information Request
At the TREC conferences, the set of relevant documents

for each example information request is obtained from a
pool of possible relevant documents.
Thispool is created by taking the top K documents in the

rankings generated by the various participating retrieval
systems.
This technique for assessing relevance is called the

pooling method and is based on two assumptions. First,
that the vast majority of the relevant documents is
collected in the assembled pool. Second, that the
documents which are not in the pool can be considered
to be not relevant. 53
The Tasks at the TREC Conference
Adhoc task:
Receive new requests and execute them on a pre-
specified document collection
Routing task
Receive test info. Requests, two document collections
First doc:training and tuning retrieval algorithm
Second doc:testing the tuned retrieval algorithm
54
The Tasks at the TREC Conference (Cont.)
Other Tasks
Chinese
Filtering
Interactive
NLP(natural language procedure)
Cross languages
High precision
Spoken document retrieval
Query Task(TREC-7)
55
Evaluation Measures at the
TREC Conferences
Summary table statistics

Recall-precision
Document level averages
Average precision histogram
56
The CACM Collections
Small collections about computer science literature
Text of doc
Structured subfields
Word stems from the title and abstract sections
Categories
Direct references between articles:a list of pairs of
documents[da,db]
Bibliographic coupling connections:a list of
triples[d1,d2,ncited]
Number of co-citations for each pair of articles[d1,d2,nciting]
A unique environment for testing retrieval algorithms

which are based on information derived from cross-citing
patterns
57
The ISI Collections
The 1460 documents in the ISI test collection were

selected from a previous collection assembled by Small
at the Institute of Scientific Information.
The documents selected were those most cited in a

cross-citation study done by Small. The main purpose of
the ISI collection is to support investigation of
similarities based on terms and on cross-citation
patterns.
58
The ISI Collections (Cont.)
The documents in the ISI collection include three types

of subfields as follows.
author names.
word stems from the title and abstract sections.
number of co-citations for each pair of articles.
59
The Cystic Fibrosis Collection
The cystic fibrosis (CF) collection is composed of 1239
documents indexed with the term cystic fibrosisin the
National Library of Medicine s MEDLINE database.
The collection also includes 100 information requests

(generated by an expert with two decades of clinical and
research experience with cystic fibrosis) and the
documents relevant to each query.
Relevance scores
0:non-relevance
1:marginal relevance
2:high relevance
60
Trends and Research Issues
A major trend today is research in interactive user
interface. The motivation is a general belief that
effective retrieval is highly dependent on obtaining
proper feedback from the user. Thus, evaluation studies
of interactive interfaces will tend to become more
common in the near future. The main issues revolve
around deciding which evaluation measures are most
appropriate in this scenario. A typical example is the
informativeness measure introduced in 1992.
Furthermore, the proposal, the study, and the
characterization of alternative measures to recall and
precision, such as the harmonic mean and the E
measures, continue to be interest. 61

IR Performance Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Performance Evaluation

Uploaded by

Copyright:

Available Formats

Information Retrieval

Chap 3. Retrieval Evaluation

Measures of retrieval system performance

How relevant the answer set is

Real life experiments

Two main reasons

R = the set of relevant docs

A = the docs retrieved (answer set)

|Ra| = the number of docs in the intersection of R and A

Recall is the fraction of the relevant docs

Precision is the fraction of the retrieved docs

Rq contains the relevant documents for query q

The precision at levels of recall higher than 50% drops to 0

The average precision at the recall level

Nq = the number of queries used

The interpolated precision at the j-th standard recall level is

j = the j-th standard recall level

e.g., 3 = the recall level 30%

n relevant recall precision R q {d 3 , d 56 , d 129 }

recall precision Interpolated R q {d 3 , d 56 , d 129 }

Interpolated precision at 11 standard recall levels

One algorithm has higher precision at lower recall levels

If perfect ranking algorithm could be developed

Plot the recall-precision curves

Collection of 10,000 documents, 50 on a specific topic

Collection of 10,000 documents, 50 on a specific topic

Precision is easy to measure:

recall = retrieved relevant = | A B |

precision = retrieved relevant = | A B |

fallout = retrieved not-relevant = | B - A B |

p(n) is precision at that point where recall has first reached n

Average precision for a single topic is the mean of the

Suppose that a search is carried out on systems i and j

System i is superior to system j

R-Precision- The idea here is to generate a single value

For instance, consider the examples in Figures 3.2 and

Precision Histograms The R-precision measures for

Figure 3.5 illustrates the RPA / B ( i ) values for two

Summary Table Statistics Single value measures can

For instance, these summary table statistics could

The proper estimation of maximum recall for a query

Measures which quantify the informativeness of the

Recalland precision are easy to define when a linear

Precision and Recall

Training data set and User ratings

The training data set will be split into 40 sessions

Since recall and precision, despite their popularity, are

- The Harmonic Mean

Asdiscussed above, a single measure which combines

where( j ) is the recall for the j-th document in the

where ( j) is the recall for the j-th document in the

Let R be the set of relevant documents for I and A be the

Let|Rk| be the number of documents in this set.

The novelty ratio is defined as the fraction of the

Figure 3.6 Coverage and novelty ratio for a given

Othermeasures which might be of interest include the

Measuring the effectiveness of information discovery is an

We first discuss the TIPSTER/TREC collection which, due

Following that, we cover the CACM and ISI collections

Second, that is lacks robust and consistent testbeds and

7thTREC conference in 1998:

Table3.1 Document collection used at TERC-6

Documents from all subcollections are tagged with SGML