You are on page 1of 61

Information Retrieval

Chap 3. Retrieval Evaluation


Retrieval Performance Evaluation

Measures of retrieval system performance


Recall and Precision

How relevant the answer set is


How precise the answer set is
Measures of system performance
Time and Space

The shorter the response time, the smaller the space used, the
better the system is

2
Experiment Settings

Real life experiments


A lot more in the 1990s

Laboratory experiments
Still dominant

Two main reasons

Repeatability
Scalability

3
Recall and Precision

Ra Ra
Re call , Pr ecision
R A

R = the set of relevant docs


|R| = the number of relevant docs

A = the docs retrieved (answer set)


|A| = the number of retrieved docs

|Ra| = the number of docs in the intersection of R and A

4
Recall and Precision

Ra Ra
Re call , Pr ecision
R A

Recall is the fraction of the relevant docs


that is retrieved

Precision is the fraction of the retrieved docs


that is relevant

5
High-Recall and High-Precision

6
Recall-Precision Curve: Example

Rq contains the relevant documents for query q


Rq {d 3 , d 5 , d 9 , d 25 , d 39 , d 44 , d 56 , d 71 , d 89 , d123}
Ranking retrieved documents for query q:
1. d123 6. d9 11. d 38
2. d 84 7. d511 12. d 48
3. d 56 8. d129 13. d 250
4. d 6 9. d187 14. d113
5. d 8 10. d 25 15. d 3
Documents are relevant to query q
marked with a bullet after the document id.

7
Ranked Retrieval: Example

n relevant recall precision Rq{d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
1 yes 1/10=0.1 1/1=1.00
2 no 1/10=0.1 1/2=0.50 Ranked documents:
3 yes 2/10=0.2 2/3=0.67
1. d123 6. d9 11. d 38
4 no 2/10=0.2 2/4=0.50
5 no 2/10=0.2 2/5=0.40 2. d 84 7. d511 12. d 48
6 yes 3/10=0.3 3/6=0.50 3. d 56 8. d129 13. d 250
7 no 3/10=0.3 3/7=0.43
4. d 6 9. d187 14. d113
8 no 3/10=0.3 3/8=0.38
9 no 3/10=0.3 3/9=0.33 5. d 8 10. d 25 15. d 3
10 yes 4/10=0.4 4/10=0.40
11 no 4/10=0.4 4/11=0.36
12 no 4/10=0.4 4/12=0.33
13 no 4/10=0.4 4/13=0.31
14 no 4/10=0.4 4/14=0.29
15 yes 5/10=0.5 5/15=0.33 8
Recall-Precision Curve: Example

The precision at levels of recall higher than 50% drops to 0


not all relevant documents are retrieved

9
Average Precision

The average precision at the recall level

Pi ( )
Nq

P( )
1 Nq
i

Nq = the number of queries used


Pi ( ) = the precision at recall level for the i-th query

10
Interpolated Precision

The interpolated precision at the j-th standard recall level is


the max precision between the j-th and the (j+1)-th recall level

P( j ) maxj j1 P( )

j = the j-th standard recall level


j { 0 ,1, 2 ,..., 10 }

e.g., 3 = the recall level 30%


If the interpolated precision values are used
the curve is non-increasing

11
Ranked Retrieval: Example

n relevant recall precision R q {d 3 , d 56 , d 129 }


1 no ? ?
2 no ? ? Ranked documents:
3 yes ? ?
1. d123 6. d9 11. d 38
4 no ? ?
5 no ? ? 2. d 84 7. d511 12. d 48
6 no ? ? 3. d 56 8. d129 13. d 250
7 no ? ?
4. d 6 9. d187 14. d113
8 yes ? ?
9 no ? ? 5. d 8 10. d 25 15. d 3
10 no ? ?
11 no ? ?
12 no ? ?
13 no ? ?
14 no ? ?
15 yes ? ? 12
Interpolated Precision: Example

recall precision Interpolated R q {d 3 , d 56 , d 129 }


0% - ?
Ranked documents:
10 % ? ?
20 % ? ? 1. d123 6. d9 11. d 38
30 % ? ? 2. d 84 7. d511 12. d 48
40 % ? ? 3. d 56 8. d129 13. d 250
50 % ? ? 4. d 6 9. d187 14. d113
60 % ? ?
5. d 8 10. d 25 15. d 3
70 % ? ?
80 % ? ?
90 % ? ?
100 % ? ?
13
Recall and Precision Curve: Example
Interpolated Precision

Interpolated precision at 11 standard recall levels


relative to R q { d 3 , d 56 , d 129 }

14
Performance of Distinct Retrieval Algorithms

One algorithm has higher precision at lower recall levels


The other algorithm is superior at higher recall levels 15
Horizontal Recall-Precision Line

If perfect ranking algorithm could be developed


All relevant documents would be ranked ahead of
all irrelevant documents
Precision would be 100 percent at all recall levels
Recallprecision curve would be a horizontal line
at 100 percent

16
Compare Ranking Algorithms

Plot the recall-precision curves


If one curve lies completely above the other
The algorithm is better

recall
- The red system appears better
1.0 than the black
0.75 - In practice, the curves usually
intersect, perhaps several times
0.5

0.25

precision
0.25 0.5 0.75 1.0 17
Recall and Precision: Example

Collection of 10,000 documents, 50 on a specific topic


Ideal search finds these 50 documents and reject others
Actual search identifies 25 documents; 20 are relevant
but 5 were on other topics
Precision: ?
Recall: ?

18
Recall and Precision: Example

Collection of 10,000 documents, 50 on a specific topic


Ideal search finds these 50 documents and reject others
Actual search retrieves 25 documents; 20 are relevant
but 5 were on other topics
Precision: 20/ 25 = 0.8
Recall: 20/50 = 0.4

19
Measuring Precision and Recall

Precision is easy to measure:


A knowledgeable person looks at each document that is
retrieved and decides whether it is relevant
In the example, only the 25 documents that are found
need to be examined
Recall is difficult to measure:
To know all relevant items, a knowledgeable person
must go through the entire collection, looking at every
object to decide if it fits the criteria
In the example, all 10,000 documents must be examined
20
Relevance as a set comparison

D = set of documents

A = set of B = set of
documents that documents
satisfy some user- identified by
based criterion the search
system

21
Measures based on relevance

recall = retrieved relevant = | A B |


relevant |A|

precision = retrieved relevant = | A B |


retrieved |B|

fallout = retrieved not-relevant = | B - A B |


not-relevant |D-A|

22
Ranked Retrieval: Example
n relevant recall precision
1 yes 1/5=0.2 1/1=1.0
2 yes 2/5=0.4 2/2=1.0 SMART system using
3 no 0.4 0.67 Cranfield data, 200
4 yes 0.6 0.75 documents in aeronautics
5 no 0.6 0.60 of which 5 are relevant
6 yes 0.8 0.67
7 no 0.8 0.57
8 no 0.8 0.50
9 no 0.8 0.44
10 no 0.8 0.40
11 no 0.8 0.36
12 no 0.8 0.33
13 yes 1.0 0.38 23
14 no 1.0 0.36
Precision-recall graph

precision
Note: Some authors plot
1 recall against precision.
2
1.0

0.75 4 6
3

0.5 5
13
12
0.25

200
recall 24
0.25 0.5 0.75 1.0
11 Point Interpolated Precision
(Recall Cut Off)

p(n) is precision at that point where recall has first reached n


Define 11 standard recall points p(r0), p(r1), ... p(r10),
where p(rj) = p(j/10), j { 0 ,1, 2 ,..., 10 }
Note: if p(rj) is not an exact data point, use interpolation

25
Recall Cutoff Graph:
Choice of Interpolation Points

precision
1 2 The blue line is
1.0
the recall cutoff
4 graph.
0.75 6
3

0.5 5
13
12
0.25

200
recall 26
0.25 0.5 0.75 1.0
Example: SMART System on Cranfield Data

Recall Precision
0.0 1.0
0.1 1.0 Precision values in
0.2 1.0 blue are actual data.
0.3 1.0 Precision values in red
0.4 1.0 are by interpolation
0.5 0.75 (by convention equal
0.6 0.75 to the next actual data
0.7 0.67 value).
0.8 0.67
0.9 0.38
1.0 0.38
27
Average precision

Average precision for a single topic is the mean of the


precision obtained after each relevant document is obtained.
Example:
p = (1.0 + 1.0 + 0.75 + 0.67 + 0.38) / 5
= 0.75
Mean average precision for a run consisting of many topics
is the mean of the average precision scores for each individual
topic in the run.
Definitions from TREC-8
28
Statistical Tests Superior System

Suppose that a search is carried out on systems i and j

System i is superior to system j


if, for all test cases,
recall(i) >= recall(j)
precisions(i) >= precision(j)

29
Single Value Summaries

R-Precision- The idea here is to generate a single value


summary of the ranking by computing the precision at
the R-th position in the ranking, where R is the total
number of relevant documents for the current query.
(i.e., number of documents in the set Rq).

For instance, consider the examples in Figures 3.2 and


3.3. The value of R-precision is 0.4 for the first
example (R=10) and 0.3 for the second example (R=3).

30
Single Value Summaries (Cont.)

Precision Histograms The R-precision measures for


several queries can be used to compare the retrieval
history of two algorithms as follows. Let RPA ( i ) and RPB ( i )
be the R-precision values of the retrieval algorithms A
and B for the i-th query. Define, for instance, the
difference.
RPA / B ( i ) RPA ( i ) RPB ( i )
A value of RPA / B ( i ) equal to 0 indicates that both
algorithms have equivalent performance for the i-th
query. A positive value of RPA / B ( i ) indicates a better
retrieval performance by algorithms A while a negative
value indicates a better retrieval performance by
31
algorithms B.
Single Value Summaries (Cont.)

Figure 3.5 illustrates the RPA / B ( i ) values for two


hypothetical retrieval algorithms over ten example
queries. The algorithm A is superior for eight queries
while the algorithm B performs better for the two other
queries (numbered 4 and 5).

32
Single Value Summaries (Cont.)

Summary Table Statistics Single value measures can


also be stored in a table to provide a statistical summary
regarding the set of all the queries in a retrieval task.

For instance, these summary table statistics could


include: the number of queries used in the task, the
total number of documents retrieved by all queries, the
total number of relevant documents which were
effectively retrieved when all queries are considered, the
total number of relevant documents which could have
been retrieved by all queries, etc.

33
Precision and Recall Appropriateness
Precisionand recall have been used extensively to
evaluate the retrieval performance of retrieval
algorithms. However, a more careful reflection reveals
problems with these two measures.

The proper estimation of maximum recall for a query


requires detailed knowledge of all the documents in
the collection.
Recall and precision are related measures which
capture different aspects of the set of retrieved
documents.
Recall and precision measure the effectiveness over a
set of queries processed in batch mode. 34
Precision and Recall Appropriateness (Cont.)

Measures which quantify the informativeness of the


retrieval process might now be more appropriate.

Recalland precision are easy to define when a linear


ordering of the retrieved documents is enforced.

35
Experimental Evaluation Measures

Precision and Recall


Precision is the proportion of retrieved material that is relevant
Recall is the proportion of relevant material retrieved
Percentage Accuracy
The fraction of positive user feedback in the top n items
recommended by the system
Normalized Distance-based Performance Measure
Distance between the user's ranking and the system's ranking of
the same set of documents
Normalized to be between 0 and 1

36
Experimental Evaluation Methodology

Training data set and User ratings


| Test examples | Training examples |
d1 d50 d51 d52 d1999 d2000
User 1 .. .. .. .. .. ..
User 2 .. .. .. .. .. ..
. .. .. .. .. .. ..
. .. .. .. .. .. ..
User n .. .. .. .. .. ..

The training data set will be split into 40 sessions


One session will be used as the test examples and the remaining 39
sessions as the training examples
Each session will be used as the test examples in one of the 40
experiments
The results of these 40 experiments will be averaged
The same training data set is used to evaluate the other filtering
methods
37
Alternative Measures

Since recall and precision, despite their popularity, are


not always the most appropriate measures for
evaluating retrieval performance, alternative measures
have been proposed over the years. A brief review of
some of them is as follows.

- The Harmonic Mean


- The E Measure
- User-Oriented Measures
- Other Measures

38
The Harmonic Mean

Asdiscussed above, a single measure which combines


recall and precision might be of interest. One such
measure is the harmonic mean F of recall and precision
which is computed as
2
F ( j) 1
r( j) 1
P( j)

where( j ) is the recall for the j-th document in the


ranking, P( j ) is the precision for the j-th document in the
ranking, and F ( j ) is the harmonic mean of ( j ) and P( j ).

39
The E Measure
Another measure which combines recall and precision
was proposed by van Rijsbergen and is called the E
evaluation measure. The idea is to allow the user to
specify whether he is more interested in recall or in
precision. The E measure is defined as follows.
1 b 2
E ( j ) 1 b2
r( j) 1
P( j)

where ( j) is the recall for the j-th document in the


ranking, P( j ) is the precision for the j-th document in the
ranking, E( j ) is the E evaluation measure relative to ( j )
and P( j ) , and b is a user specified parameter which
reflects the relative importance of recall and precision.
40
User-Oriented Measures
Recall and precision are based on the assumption that
the set of relevant documents for a query is the same,
independent of the user. However, different users might
have a different interpretation of which document is
relevant and which one is not. To cope with this
problem,
user-oriented measures have been proposed such as
coverage ratio, novelty ratio, relative recall, and recall
effort.

Let R be the set of relevant documents for I and A be the


answer set retrieved. Also, let U be the subset of R
which is known to the user. The number of documents in
U is |U|. 41
User-Oriented Measures (Cont.)

Let|Rk| be the number of documents in this set.


Further, let |Ru| be the number of relevant documents
previously unknown to the user which were retrieved.

Rk
cov erage
U

The novelty ratio is defined as the fraction of the


relevant documents retrieved which was unknown to the
user i.e.,

Ru
novelty
Ru Rk
42
User-Oriented Measures (Cont.)

Figure 3.6 Coverage and novelty ratio for a given


43
example information request.
Other Measures

Othermeasures which might be of interest include the


expected search length, which is good for dealing with
sets of documents weakly ordered, the satisfaction,
which takes into account only the relevant documents,
and the frustration, which takes into account only the
non-relevant documents.

44
History

Measuring the effectiveness of information discovery is an


unresolved research topic.
Much of the work on evaluation of information retrieval derives
from the ASLIB Cranfied projects led by Cyril Cleverdon, which
began in 1957.
Recent work on evaluation has centered around the Text Retrieval
Conferences (TREC), which began in 1992, led by Donna
Harman and Ellen Voorhees of NIST.

45
Reference Collections

We first discuss the TIPSTER/TREC collection which, due


to its large size and thorough experimentation, is usually
considered to be the reference test collection in
information retrieval nowadays

Following that, we cover the CACM and ISI collections


due to their historical importance in the area of
information retrieval

TREC - http://trec.nist.gov/
CACM - http://www.cs.hope.edu/csci481/articles.html

46
The TREC Collection

First,
that it lacks a solid formal framework as a basic
foundation. The first of these criticisms is difficult to
dismiss entirely due to the inherent degree of
psychological subjectiveness associated with the task
of deciding on the relevant of a given document.

Second, that is lacks robust and consistent testbeds and


benchmarks. The second of these criticisms, however,
can be acted upon. For three decades, experimentation
in information retrieval was based on relatively small
test collections which did not reflect the main issues
present in a large bibliographical environment.
47
TREC - (Text REtrieval Conference)

Initiated
under the National Institute of Standards and
Technology(NIST)

Goals:
Providinga large test collection
Uniform scoring procedures
Forum

7thTREC conference in 1998:


Document collection: test collections, example
information requests (topics), relevant docs
The benchmarks tasks

48
The Document Collection

Table3.1 Document collection used at TERC-6


stopwords are not removed and no stemming is
performed.

49
The Document Collection (Cont.)

Documents from all subcollections are tagged with SGML


to allow east parsing.

<doc>
<doc>
<docno>WSJ880406-0090</docno>
<docno>WSJ880406-0090</docno>
<hl>AT&T
<hl>AT&T Unveils
Unveils Services
Services to
to Upgrade
Upgrade Phone
Phone Networks
Networks Under
Under Global
Global Plan</hl>
Plan</hl>
<author>Janet
<author>Janet Guyon
Guyon (WSJ
(WSJ Staff)</author>
Staff)</author>
<dateline>New
<dateline>New York</dateline>
York</dateline>
<text>
<text>
American
American Telephone
Telephone & & Telegraph
Telegraph Co.
Co. introduced
introduced the
the first
first of
of aa new
new generation
generation of
of
phone
phone service
service with broad
with broad
</text>
</text>
</doc>
</doc> 50
The Example Information Requests (Topics)

TheTREC collection includes a set of example


information requests which can be used for testing a
new ranking algorithm. Each request is a description of
an information need in natural language.

<top>
<num> Number:168
<title>Topic:Financing AMTRAK
<desc>Description:
..
<nar>Narrative:A ..
</top>

Figure 3.8 Topic numbered 168 in the TREC collection. 51


The Example Information Requests (Topics)

The task of converting an


Information request into a system
query must be done by the
system itself and is considered to
be an integral part of the
evaluation procedure.

52
The Relevant Documents for
Each Example Information Request

At the TREC conferences, the set of relevant documents


for each example information request is obtained from a
pool of possible relevant documents.

Thispool is created by taking the top K documents in the


rankings generated by the various participating retrieval
systems.

This technique for assessing relevance is called the


pooling method and is based on two assumptions. First,
that the vast majority of the relevant documents is
collected in the assembled pool. Second, that the
documents which are not in the pool can be considered
to be not relevant. 53
The Tasks at the TREC Conference

Adhoc task:
Receive new requests and execute them on a pre-

specified document collection

Routing task
Receive test info. Requests, two document collections

First doc:training and tuning retrieval algorithm

Second doc:testing the tuned retrieval algorithm

54
The Tasks at the TREC Conference (Cont.)

Other Tasks
Chinese
Filtering
Interactive
NLP(natural language procedure)
Cross languages
High precision
Spoken document retrieval
Query Task(TREC-7)

55
Evaluation Measures at the
TREC Conferences

Summary table statistics


Recall-precision
Document level averages
Average precision histogram

56
The CACM Collections
Small collections about computer science literature
Text of doc
Structured subfields
Word stems from the title and abstract sections
Categories

Direct references between articles:a list of pairs of

documents[da,db]
Bibliographic coupling connections:a list of

triples[d1,d2,ncited]
Number of co-citations for each pair of articles[d1,d2,nciting]

A unique environment for testing retrieval algorithms


which are based on information derived from cross-citing
patterns
57
The ISI Collections

The 1460 documents in the ISI test collection were


selected from a previous collection assembled by Small
at the Institute of Scientific Information.

The documents selected were those most cited in a


cross-citation study done by Small. The main purpose of
the ISI collection is to support investigation of
similarities based on terms and on cross-citation
patterns.

58
The ISI Collections (Cont.)

The documents in the ISI collection include three types


of subfields as follows.
author names.
word stems from the title and abstract sections.
number of co-citations for each pair of articles.

59
The Cystic Fibrosis Collection
The cystic fibrosis (CF) collection is composed of 1239
documents indexed with the term cystic fibrosisin the
National Library of Medicine s MEDLINE database.

The collection also includes 100 information requests


(generated by an expert with two decades of clinical and
research experience with cystic fibrosis) and the
documents relevant to each query.

Relevance scores
0:non-relevance

1:marginal relevance

2:high relevance
60
Trends and Research Issues
A major trend today is research in interactive user
interface. The motivation is a general belief that
effective retrieval is highly dependent on obtaining
proper feedback from the user. Thus, evaluation studies
of interactive interfaces will tend to become more
common in the near future. The main issues revolve
around deciding which evaluation measures are most
appropriate in this scenario. A typical example is the
informativeness measure introduced in 1992.
Furthermore, the proposal, the study, and the
characterization of alternative measures to recall and
precision, such as the harmonic mean and the E
measures, continue to be interest. 61

You might also like