Professional Documents
Culture Documents
The shorter the response time, the smaller the space used, the
better the system is
2
Experiment Settings
Laboratory experiments
Still dominant
Repeatability
Scalability
3
Recall and Precision
Ra Ra
Re call , Pr ecision
R A
4
Recall and Precision
Ra Ra
Re call , Pr ecision
R A
5
High-Recall and High-Precision
6
Recall-Precision Curve: Example
7
Ranked Retrieval: Example
n relevant recall precision Rq{d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
1 yes 1/10=0.1 1/1=1.00
2 no 1/10=0.1 1/2=0.50 Ranked documents:
3 yes 2/10=0.2 2/3=0.67
1. d123 6. d9 11. d 38
4 no 2/10=0.2 2/4=0.50
5 no 2/10=0.2 2/5=0.40 2. d 84 7. d511 12. d 48
6 yes 3/10=0.3 3/6=0.50 3. d 56 8. d129 13. d 250
7 no 3/10=0.3 3/7=0.43
4. d 6 9. d187 14. d113
8 no 3/10=0.3 3/8=0.38
9 no 3/10=0.3 3/9=0.33 5. d 8 10. d 25 15. d 3
10 yes 4/10=0.4 4/10=0.40
11 no 4/10=0.4 4/11=0.36
12 no 4/10=0.4 4/12=0.33
13 no 4/10=0.4 4/13=0.31
14 no 4/10=0.4 4/14=0.29
15 yes 5/10=0.5 5/15=0.33 8
Recall-Precision Curve: Example
9
Average Precision
Pi ( )
Nq
P( )
1 Nq
i
10
Interpolated Precision
P( j ) maxj j1 P( )
11
Ranked Retrieval: Example
14
Performance of Distinct Retrieval Algorithms
16
Compare Ranking Algorithms
recall
- The red system appears better
1.0 than the black
0.75 - In practice, the curves usually
intersect, perhaps several times
0.5
0.25
precision
0.25 0.5 0.75 1.0 17
Recall and Precision: Example
18
Recall and Precision: Example
19
Measuring Precision and Recall
D = set of documents
A = set of B = set of
documents that documents
satisfy some user- identified by
based criterion the search
system
21
Measures based on relevance
22
Ranked Retrieval: Example
n relevant recall precision
1 yes 1/5=0.2 1/1=1.0
2 yes 2/5=0.4 2/2=1.0 SMART system using
3 no 0.4 0.67 Cranfield data, 200
4 yes 0.6 0.75 documents in aeronautics
5 no 0.6 0.60 of which 5 are relevant
6 yes 0.8 0.67
7 no 0.8 0.57
8 no 0.8 0.50
9 no 0.8 0.44
10 no 0.8 0.40
11 no 0.8 0.36
12 no 0.8 0.33
13 yes 1.0 0.38 23
14 no 1.0 0.36
Precision-recall graph
precision
Note: Some authors plot
1 recall against precision.
2
1.0
0.75 4 6
3
0.5 5
13
12
0.25
200
recall 24
0.25 0.5 0.75 1.0
11 Point Interpolated Precision
(Recall Cut Off)
25
Recall Cutoff Graph:
Choice of Interpolation Points
precision
1 2 The blue line is
1.0
the recall cutoff
4 graph.
0.75 6
3
0.5 5
13
12
0.25
200
recall 26
0.25 0.5 0.75 1.0
Example: SMART System on Cranfield Data
Recall Precision
0.0 1.0
0.1 1.0 Precision values in
0.2 1.0 blue are actual data.
0.3 1.0 Precision values in red
0.4 1.0 are by interpolation
0.5 0.75 (by convention equal
0.6 0.75 to the next actual data
0.7 0.67 value).
0.8 0.67
0.9 0.38
1.0 0.38
27
Average precision
29
Single Value Summaries
30
Single Value Summaries (Cont.)
32
Single Value Summaries (Cont.)
33
Precision and Recall Appropriateness
Precisionand recall have been used extensively to
evaluate the retrieval performance of retrieval
algorithms. However, a more careful reflection reveals
problems with these two measures.
35
Experimental Evaluation Measures
36
Experimental Evaluation Methodology
38
The Harmonic Mean
39
The E Measure
Another measure which combines recall and precision
was proposed by van Rijsbergen and is called the E
evaluation measure. The idea is to allow the user to
specify whether he is more interested in recall or in
precision. The E measure is defined as follows.
1 b 2
E ( j ) 1 b2
r( j) 1
P( j)
Rk
cov erage
U
Ru
novelty
Ru Rk
42
User-Oriented Measures (Cont.)
44
History
45
Reference Collections
TREC - http://trec.nist.gov/
CACM - http://www.cs.hope.edu/csci481/articles.html
46
The TREC Collection
First,
that it lacks a solid formal framework as a basic
foundation. The first of these criticisms is difficult to
dismiss entirely due to the inherent degree of
psychological subjectiveness associated with the task
of deciding on the relevant of a given document.
Initiated
under the National Institute of Standards and
Technology(NIST)
Goals:
Providinga large test collection
Uniform scoring procedures
Forum
48
The Document Collection
49
The Document Collection (Cont.)
<doc>
<doc>
<docno>WSJ880406-0090</docno>
<docno>WSJ880406-0090</docno>
<hl>AT&T
<hl>AT&T Unveils
Unveils Services
Services to
to Upgrade
Upgrade Phone
Phone Networks
Networks Under
Under Global
Global Plan</hl>
Plan</hl>
<author>Janet
<author>Janet Guyon
Guyon (WSJ
(WSJ Staff)</author>
Staff)</author>
<dateline>New
<dateline>New York</dateline>
York</dateline>
<text>
<text>
American
American Telephone
Telephone & & Telegraph
Telegraph Co.
Co. introduced
introduced the
the first
first of
of aa new
new generation
generation of
of
phone
phone service
service with broad
with broad
</text>
</text>
</doc>
</doc> 50
The Example Information Requests (Topics)
<top>
<num> Number:168
<title>Topic:Financing AMTRAK
<desc>Description:
..
<nar>Narrative:A ..
</top>
52
The Relevant Documents for
Each Example Information Request
Adhoc task:
Receive new requests and execute them on a pre-
Routing task
Receive test info. Requests, two document collections
54
The Tasks at the TREC Conference (Cont.)
Other Tasks
Chinese
Filtering
Interactive
NLP(natural language procedure)
Cross languages
High precision
Spoken document retrieval
Query Task(TREC-7)
55
Evaluation Measures at the
TREC Conferences
56
The CACM Collections
Small collections about computer science literature
Text of doc
Structured subfields
Word stems from the title and abstract sections
Categories
documents[da,db]
Bibliographic coupling connections:a list of
triples[d1,d2,ncited]
Number of co-citations for each pair of articles[d1,d2,nciting]
58
The ISI Collections (Cont.)
59
The Cystic Fibrosis Collection
The cystic fibrosis (CF) collection is composed of 1239
documents indexed with the term cystic fibrosisin the
National Library of Medicine s MEDLINE database.
Relevance scores
0:non-relevance
1:marginal relevance
2:high relevance
60
Trends and Research Issues
A major trend today is research in interactive user
interface. The motivation is a general belief that
effective retrieval is highly dependent on obtaining
proper feedback from the user. Thus, evaluation studies
of interactive interfaces will tend to become more
common in the near future. The main issues revolve
around deciding which evaluation measures are most
appropriate in this scenario. A typical example is the
informativeness measure introduced in 1992.
Furthermore, the proposal, the study, and the
characterization of alternative measures to recall and
precision, such as the harmonic mean and the E
measures, continue to be interest. 61