Professional Documents
Culture Documents
Assignment 1
DD2476 Search Engines and Information Retrieval Systems Lecture 4: Scoring, Weighting, Vector Space Model
Hedvig Kjellstrm hedvig@kth.se www.csc.kth.se/DD2476
! Good for:
-! Expert users with precise understanding of their needs and the collection -! Computer applications
Ch. 6!
Ranked Retrieval
! It takes a lot of skill to come up with a query that produces a manageable number of hits.
-! AND gives too few; OR gives too many
Ch. 6!
Today
! Tf-idf and the vector space model (Manning Chapter 6.2-4)
-! Term frequency, collection statistics -! Vector space scoring and ranking
Ch. 6!
! Wish to return in order the documents most likely to be useful to the searcher ! Rank-order documents with respect to a query ! Assign a score say in [0, 1] to each document
-! Measures how well document and query match
Ch. 6!
Sec. 6.2!
1 1 1 0 1 1 1
1 1 1 1 0 0 0
0 0 0 0 0 1 1
0 1 1 0 0 1 1
0 0 1 0 0 1 1
0 0 1 0 0 1 0
! Term not in document: score 0 ! More appearances of term in document: higher score
{0,1}|V|
Sec. 6.2!
Hamlet
Othello
Macbeth
157 4 232 0 57 2 2
73 157 227 10 0 0 0
0 0 0 0 0 3 1
0 1 2 0 0 5 1
0 0 1 0 0 5 1
0 0 1 0 0 1 0
! This is called the bag of words model ! In a sense, step back: The positional index was able to distinguish these two documents ! Task 2.1 Ranked Retrieval: Back to Bag-of-Words
!v
Sec. 6.2!
Term Frequency tf
Antony and Cleopatra ANTONY
Log-Frequency Weighting
! Log-frequency weight of term t in document d
frequency = count in IR
157
! Term frequency tft,d of term t in document d number of times that t occurs in d ! How use tf for query-document match scores? ! Raw term frequency is not what we want:
-! A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term -! But not 10 times more relevant
Sec. 6.2!
Exercise 2 Minutes
! Document d
term airplane shakespeare calpurnia under the
t "q#d
Sec. 6.2.1!
Sec. 6.2.1!
Document Frequency
! Rare terms are more informative than frequent terms ! Example: rare word ARACHNOCENTRIC
-! Document containing this term is very likely to be relevant to query ARACHNOCENTRIC High weight for rare terms like ARACHNOCENTRIC
idf Weight
! dft is the document frequency of term t: the number of documents that contain t
-! dft is an inverse measure of the informativeness of t -! dft ! N
Exercise 2 Minutes
! Suppose N = 1,000,000
term calpurnia animal sunday fly under the dft 1 100 1,000 10,000 100,000 1,000,000
! Does idf have an effect on ranking for one-term queries, like IPHONE? ! Only effect for >1 term
-! Query CAPRICIOUS PERSON: idf puts more weight on CAPRICIOUS than PERSON.
Sec. 6.2.1!
Sec. 6.2.2!
tf-idf Weighting
! tf-idf weight of a term: product of tf weight and idf weight
insurance try
10440 10422
3997 8760
! Increases with the number of occurrences within a document ! Increases with the rarity of the term in the collection
! Which word is a better search term (and should get a higher weight)?
DD2476 Lecture 4, February 5, 2013 DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Sec. 6.3!
Binary
Antony and Cleopatra ANTONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER
Count
Julius Caesar The Tempest
Weight Matrix
Documents as Vectors
! So we have a |V|-dimensional vector space
-! Terms are axes/dimensions -! Documents are points in this space
Hamlet
Othello
Macbeth
0 0 0 0 0 1.9 0.11
! Very high-dimensional
-! Order of 107 dimensions when for a web search engine
Sec. 6.3!
Sec. 6.3!
Queries as Vectors
! Key idea 1: Represent queries as vectors in same space ! Key idea 2: Rank documents according to proximity to query in this space
-! proximity = similarity of vectors -! proximity ! inverse of distance
q " dn
! Recall:
-! Get away from Boolean model -! Rank more relevant documents higher than less relevant documents
! Euclidean distance is a bad idea . . . ! . . . because Euclidean distance is large for vectors of different lengths
-! What determines length here?
Sec. 6.3!
Exercise 5 Minutes
! Euclidean distance bad for
-! vectors of different length (documents with different #words) -! high-dimensional vectors (large dictionaries)
! Discuss in pairs:
-! Can you come up with a better difference measure?
Sec. 6.3!
Sec. 6.3!
cos(angle)
monotonically increasing/
! Any ideas?
angle
Monotonically decreasing!
DD2476 Lecture 4, February 5, 2013 DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Length Normalization
! Computing cosine similarity involves lengthnormalizing document and query vectors ! L2 norm:
Cosine Similarity
dot / scalar / inner product
! x2=
!x
2 i i
! ! ! q cos(q, d ) = ! q2
! d ! d
! ! qd = ! ! = q2 d
2
" "
V i= 1
i= 1 i i
qd
qi2
"
V i= 1
di2
! Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) ! Recall:
-! Length unimportant -! Rank documents according to angle from query
! qi is the tf-idf weight of term i in the query ! di is the tf-idf weight of term i in the document ! cos(q, d ) is the cosine similarity of q and d = the cosine of the angle between q and d.
! !
Cosine Similarity
! In reality:
-! Length-normalize when document added ! to index: ! d
No idf weighting!
d" ! d
-! Length-normalize query:
! ! q q" ! q2
PaP 58 7 0 0
WH 20 11 6 38
V ! ! ! ! cos(q, d ) = q d = " qi di i= 1
DD2476 Lecture 4, February 5, 2013
cos(SaS,PaP) ! 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 ! 0.94 cos(SaS,WH) ! 0.79 cos(PaP,WH) ! 0.69
Sec. 6.3!
Sec. 7.1!
Speedup here
Sec. 7.1.1!
Sec. 7.1!
! Do selection:
-! avoid visiting all documents
! Already do selection:
-! Sparse term-document incidence matrix, |d| << N -! Many cosine scores = 0 -! Only visits documents with nonzero cosine scores (#1 term in common with query)
Sec. 7.1.1!
Sec. 7.1.2!
! Think of A as pruning non-contenders ! Same approach used for any scoring function! ! Will look at several schemes following this approach
Sec. 7.1.2!
Sec. 7.1.2!
3! 2! 1! 13!
4! 4! 2! 16!
8! 8! 3! 32!
16! 16! 5!
32! 32! 8!
! Benefit:
-! Posting lists of low-idf terms have many documents # eliminated from set A of contenders
Sec. 7.1.3!
Sec. 7.1.3!
Exercise 5 Minutes
! Index Elimination: consider only high-idf query terms and only documents with many query terms ! Champion Lists: for each term t, consider only the r documents with highest tf-idftd values
! Benefit:
-! At query time, only compute scores for documents in the champion lists fast
! Issue:
-! r chosen at index build time -! Too large: slow -! Too small: r < K
Sec. 7.1.4!
Sec. 7.1.4!
More in Lecture 5
Sec. 7.1.4!
Sec. 7.1.6!
! Variant:
-! More than one leader for each document
More in Lecture 11
Sec. 7.1.6!
Query
Leader
DD2476 Lecture 4, February 5, 2013
Follower
DD2476 Lecture 4, February 5, 2013
Sec. 6.1!
Sec. 6.1!
! Zone
-! Title, body etc. that contain arbitrary amount of text -! Zone index: postings for terms in each zone
Sec. 6.1!
Sec. 7.2.2!
! Ranking:
-! Term proximity in documents higher for closer terms -! Term order higher for terms in right order
Sec. 7.2.3!
Query Parser
! Query phrase: RISING INTEREST RATES ! Sequence:
-! Run as a phrase query -! If <K documents contain the phrase RISING INTEREST RATES, run phrase queries RISING INTEREST and INTEREST RATES -! If still <K docs, run vector space query RISING INTEREST RATES -! Rank matching docs by vector space scoring
Next
! Assignment 1 left? Email Johan or Hedvig ! Lecture 5 (February 8, 10.15-12.00)
-! D3 -! Readings: Manning Chapter 21 Avrachenkov Sections 1-2