You are on page 1of 45

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval
CS276:InformationRetrievalandWebSearch ChristopherManningandPrabhakar Raghavan Lecture6:Scoring,TermWeightingandthe VectorSpaceModel
1

IntroductiontoInformationRetrieval

R Recap of fl lecture t 5
Collectionandvocabulary ystatistics:Heaps p andZipfs p laws DictionarycompressionforBooleanindexes Dictionarystring,blocks,frontcoding Postingscompression:Gapencoding,prefixuniquecodes VariableByteandGammacodes
collection (text, xml markup etc) collection (text) Term doc incidence matrix Term-doc postings, uncompressed (32-bit words) postings, uncompressed (20 bits) postings, variable byte encoded postings, encoded 3,600.0 960.0 40 000 0 40,000.0 400.0 250.0 116.0 101.0
2

MB

IntroductiontoInformationRetrieval

Thi lecture; This l t IIRSections S ti 6.2 6 26.4.3 643


Rankedretrieval Scoringdocuments Termfrequency Collectionstatistics Weightingschemes Vectorspacescoring

IntroductiontoInformationRetrieval

Ch. 6

R k dretrieval Ranked ti l
Thusfar,ourquerieshaveallbeenBoolean.
Documentseithermatchordont.

Goodforexpert p userswithprecise p understanding gof theirneedsandthecollection.


Alsogoodforapplications:Applicationscaneasily consume1000sofresults.

Notgoodforthemajorityofusers.
MostusersincapableofwritingBooleanqueries(orthey are,buttheythinkitstoomuchwork). don twanttowadethrough1000sofresults. results Mostusersdont
Thisisparticularlytrueofwebsearch.
4

IntroductiontoInformationRetrieval

ProblemwithBooleansearch: f torfamine feast f i

Ch. 6

Booleanqueriesoftenresultineithertoofew( (=0) 0)or toomany(1000s)results. Q Query y1:standarduserdlink650 200,000 , hits Query2:standarduserdlink650nocardfound:0 hits Ittakesalotofskilltocomeupwithaquerythat producesamanageablenumberofhits.
ANDgivestoofew;ORgivestoomany

IntroductiontoInformationRetrieval

R k dretrieval Ranked t i lmodels d l


Ratherthanasetofdocumentssatisfyingaquery expression,inrankedretrievalmodels,thesystem returnsanorderingoverthe(top)documentsinthe collectionwithrespecttoaquery Freetextqueries:Ratherthanaquerylanguageof operatorsandexpressions,theusersqueryisjust oneormorewordsinahumanlanguage Inprinciple,therearetwoseparatechoiceshere,but inpractice,rankedretrievalmodelshavenormally beenassociatedwithfreetextqueriesandviceversa
6

IntroductiontoInformationRetrieval

Feastorfamine:notaproblemin ranked k dretrieval ti l


Indeed,thesizeoftheresultsetisnotanissue Wejustshowthetopk(10)results Wedontoverwhelmtheuser Premise:therankingalgorithmworks

Ch. 6

Whenasystemproducesarankedresultset,large resultsetsarenotanissue

IntroductiontoInformationRetrieval

Ch. 6

S i asthe Scoring th basis b i of franked k dretrieval ti l


Wewishtoreturninorderthedocumentsmostlikely tobeusefultothesearcher Howcanwerankorderthedocumentsinthe collectionwithrespecttoaquery? g ascore say yin[0,1] toeachdocument Assign Thisscoremeasureshowwelldocumentandquery match.

IntroductiontoInformationRetrieval

Ch. 6

Q Query document d tmatching t hi scores


Weneedawayofassigningascoretoa query/documentpair Letsstartwithaonetermquery q y Ifthequerytermdoesnotoccurinthedocument: scoreshouldbe0 Themorefrequentthequeryterminthedocument, thehigherthescore(shouldbe) Wewilllookatanumberofalternativesforthis.

IntroductiontoInformationRetrieval

Ch. 6

T k 1: Take 1 Jaccard J dcoefficient ffi i t


RecallfromLecture3:Acommonlyusedmeasureof overlapoftwosetsA andB j jaccard( (A,B) , )=|A B|/|A B| jaccard(A,A)=1 jaccard(A,B)=0 ifA B=0 A andB donthavetobethesamesize. 1. Alwaysassignsanumberbetween0and1

10

IntroductiontoInformationRetrieval

Ch. 6

J Jaccard dcoefficient: ffi i t S Scoring i example l


Whatisthequerydocumentmatchscorethatthe Jaccardcoefficientcomputesforeachofthetwo documentsbelow? Query:idesofmarch Document 1:caesardiedinmarch Document 2:thelongmarch

11

IntroductiontoInformationRetrieval

Ch. 6

I Issues with ithJ Jaccard df forscoring i


Itdoesnt doesn tconsidertermfrequency(howmanytimes atermoccursinadocument) Raretermsinacollectionaremoreinformativethan frequentterms.Jaccarddoesntconsiderthis information Weneedamoresophisticatedwayofnormalizingfor length Laterinthislecture,welluse | A I B | / | A U B | ...insteadof|A B|/|A B|(Jaccard)forlength normalization.
12

IntroductiontoInformationRetrieval

Recall(Lecture1):Binaryterm d document ti incidence id matrix ti


Antony and Cleopatra Julius Caesar The Tempest Hamlet

Sec. 6.2

Othello

Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

1 1 1 0 1 1 1

1 1 1 1 0 0 0

0 0 0 0 0 1 1

0 1 1 0 0 1 1

0 0 1 0 0 1 1

1 0 1 0 0 1 0

Each document is represented by a binary vector {0,1}|V|


13

IntroductiontoInformationRetrieval

Sec. 6.2

T Term document d tcount tmatrices ti


Considerthenumberofoccurrencesofatermina document:
Eachdocumentisacountvectorinv:acolumnbelow

Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

157 4 232 0 57 2 2

73 157 227 10 0 0 0

0 0 0 0 0 3 1

0 1 2 0 0 5 1

0 0 1 0 0 5 1

0 0 1 0 0 1 0
14

IntroductiontoInformationRetrieval

B of Bag fwords d model d l


Vectorrepresentationdoesnt doesn tconsidertheordering ofwordsinadocument Johnisq quickerthanMary y andMary yisquicker q than John havethesamevectors gofwords model. Thisiscalledthebag Inasense,thisisastepback:Thepositionalindex wasabletodistinguishthesetwodocuments. Wewilllookatrecoveringpositionalinformation laterinthiscourse. Fornow:bagofwordsmodel
15

IntroductiontoInformationRetrieval

T Term frequency f tf
Thetermfrequencytft,d t d oftermt indocumentd is definedasthenumberoftimesthattoccursind. Wewanttousetfwhencomputing p g q query ydocument matchscores.Buthow? q yisnotwhatwewant: Rawtermfrequency
Adocumentwith10occurrencesofthetermismore relevantthanadocumentwith1occurrenceoftheterm. Butnot10timesmorerelevant.

Relevancedoesnotincreaseproportionallywith termfrequency. frequency


NB:frequency=countinIR
16

IntroductiontoInformationRetrieval

Sec. 6.2

L frequency Log f weighting i hti


Thelogfrequencyweightoftermtindis
w t,d 1 + log 10 tf = , 0,
t,d

if tf

> 0 otherwise
t,d

00,11,21.3,102,10004,etc. Scoreforadocumentquerypair:sumovertermst in bothq andd: score = tqd (1 + log tf t ,d )

Thescoreis0ifnoneofthequerytermsispresentin thedocument.
17

IntroductiontoInformationRetrieval

Sec. 6.2.1

D Document tfrequency f
Rare R terms t aremoreinformative i f ti th thanfrequent f tterms t
Recallstopwords

Consideraterminthequerythatisrareinthe collection(e.g.,arachnocentric) Adocumentcontainingthistermisverylikelytobe relevanttothequeryarachnocentric Wewantahighweightforraretermslike arachnocentric.

18

IntroductiontoInformationRetrieval

Sec. 6.2.1

D Document tfrequency, f continued ti d


Frequenttermsarelessinformativethanrareterms Consideraquerytermthatisfrequentinthe collection( (e.g., g ,high, g ,increase, ,line) Adocumentcontainingsuchatermismorelikelyto berelevantthanadocumentthatdoesnt Butitsnotasureindicatorofrelevance. Forfrequent q terms, ,wewanthigh g p positiveweights g forwordslikehigh,increase,andline g thanforrareterms. Butlowerweights Wewillusedocumentfrequency(df)tocapturethis.
19

IntroductiontoInformationRetrieval

Sec. 6.2.1

idfweight i ht
dft isthedocumentfrequencyoft:thenumberof documentsthatcontaint
dft isaninversemeasureoftheinformativeness oft dft N

Wedefinetheidf (inversedocumentfrequency)oft by

idf t = log10 ( N/df t )

Weuselog(N/dft)insteadofN/dft todampentheeffect ofidf. Will turn out the base of the log is immaterial.

20

IntroductiontoInformationRetrieval

Sec. 6.2.1

idfexample, l supposeN=1million illi


term calpurnia animal sunday fly under the dft 1 100 1,000 10,000 100,000 1,000,000 idft

idf t = log10 ( N/df t )


There is one idf value for each term t in a collection.
21

IntroductiontoInformationRetrieval

Eff tof Effect fidfonranking ki


Doesidfhaveaneffectonrankingforoneterm queries,like
iPhone

idfhasnoeffectonrankingonetermqueries
idfaffectstherankingofdocumentsforquerieswithat leasttwoterms Forthequerycapriciousperson,idfweightingmakes occurrencesofcapricious countformuchmoreinthefinal documentrankingthanoccurrencesofperson.

22

IntroductiontoInformationRetrieval

Sec. 6.2.1

C ll ti vs.D Collection Document tfrequency f


Thecollectionfrequencyoft isthenumberof occurrencesoft inthecollection,counting multiple p occurrences. Example:
Word Collection frequency Document frequency

insurance try

10440 10422

3997 8760

Whichwordisabettersearchterm(andshould getahigherweight)?
23

IntroductiontoInformationRetrieval

Sec. 6.2.2

tfidf weighting i hti


Th Thetfidfweight i htof fat termi isth theproduct d tof fit itstf weightanditsidfweight.

w t ,d = (1 + log l tf t ,d ) log l 10 ( N / df t )
Best B tk knownweighting i hti scheme h i ini information f ti retrieval ti l
Note:theintfidfisahyphen,notaminussign! Alternativenames:tf.idf, tf idf tfxidf

Increaseswiththenumberofoccurrenceswithina document Increaseswiththerarityoftheterminthecollection


24

IntroductiontoInformationRetrieval

Sec. 6.2.2

Fi lranking Final ki of fdocuments d t f foraquery

Score(q, d ) =

t qd

tf.idft ,d

25

IntroductiontoInformationRetrieval

Sec. 6.3

Bi Binary count t weight i htmatrix ti


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

5.25 1.21 8.59 0 2.85 1.51 1.37

3.18 6.1 2.54 1.54 0 0 0

0 0 0 0 0 1.9 0.11

0 1 1.51 0 0 0.12 4.15

0 0 0.25 0 0 5.25 0.25

0.35 0 0 0 0 0.88 1.95

Each document is now represented by a real-valued vector of tf-idf tf idf weights R|V|
26

IntroductiontoInformationRetrieval

Sec. 6.3

D Documents t asvectors t
Sowehavea|V|dimensionalvectorspace Termsareaxesofthespace Documentsarepointsorvectorsinthisspace Veryhighdimensional:tensofmillionsof dimensionswhenyouapplythistoawebsearch engine Thesearevery ysparse p vectors mostentriesarezero.

27

IntroductiontoInformationRetrieval

Sec. 6.3

Q i asvectors Queries t
Keyidea1: Dothesameforqueries:representthem asvectorsinthespace Key yidea2: Rankdocumentsaccording gtotheir proximitytothequeryinthisspace proximity y=similarity yofvectors p proximityinverseofdistance getaway yfrom Recall:Wedothisbecausewewanttog theyoureeitherinoroutBooleanmodel. g than Instead:rankmorerelevantdocumentshigher lessrelevantdocuments
28

IntroductiontoInformationRetrieval

Sec. 6.3

F Formalizing li i vector t spaceproximity i it


Firstcut:distancebetweentwopoints
(=distancebetweentheendpointsofthetwovectors)

Euclideandistance? Euclideandistanceisabadidea... ...becauseEuclideandistanceislargeforvectorsof differentlengths.

29

IntroductiontoInformationRetrieval

Sec. 6.3

Whydistanceisabadidea
TheEuclidean distancebetweenq andd2 islargeeven though h hthe h distributionofterms inthequeryq andthe distributionof termsinthe documentd2 are verysimilar.

30

IntroductiontoInformationRetrieval

Sec. 6.3

U angle Use l i instead t dof fdi distance t


Thoughtexperiment:takeadocumentd andappend ittoitself.Callthisdocumentd. y dandd havethesamecontent Semantically TheEuclideandistancebetweenthetwodocuments canbeq quitelarge g Theanglebetweenthetwodocumentsis0, correspondingtomaximalsimilarity. Key yidea:Rankdocumentsaccording gtoangle g with query.
31

IntroductiontoInformationRetrieval

Sec. 6.3

F From angles l t tocosines i


Thefollowingtwonotionsareequivalent.
Rankdocumentsindecreasing orderoftheanglebetween queryanddocument Rankdocumentsinincreasing orderof cosine(query,document)

C Cosine i is i amonotonically t i ll decreasing d i function f ti for f the th interval[0o,180o]

32

IntroductiontoInformationRetrieval

Sec. 6.3

F From angles l t tocosines i

Buthow andwhy shouldwebecomputingcosines?


33

IntroductiontoInformationRetrieval

Sec. 6.3

L thnormalization Length li ti
Avectorcanbe(length)normalizedbydividingeach ofitscomponentsbyitslength forthisweusethe L2 norm: r x 2 = i xi2 Dividing gavectorby yitsL2 normmakesitaunit (length)vector(onsurfaceofunithypersphere) Effectonthetwodocumentsdandd (dappendedto itself)fromearlierslide:theyhaveidenticalvectors afterlengthnormalization.
Longandshortdocumentsnowhavecomparableweights
34

IntroductiontoInformationRetrieval

Sec. 6.3

cosine(query,document) i ( d t)
Dot product Unit vectors

r r r r r r qd q d cos( q , d ) = r r = r r = q d qd

i =1 i

q di
2 d i=1 i V

2 i =1 i

qi is the tf-idf weight of term i in the query di is the tf-idf tf idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
35

IntroductiontoInformationRetrieval

C i f Cosine forl length thnormalized li dvectors t


Forlengthnormalizedvectors,cosinesimilarityis simplythedotproduct(orscalarproduct):
V r r r r cos(q, d ) = q d = qi di i= 1

forq, q dlengthnormalized. normalized

36

IntroductiontoInformationRetrieval

C i similarity Cosine i il it ill illustrated t t d

37

IntroductiontoInformationRetrieval

Sec. 6.3

C i similarity Cosine i il i amongst3d documents


Howsimilarare thenovels SaS:Senseand Sensibility PaP:Prideand Prejudice,and WH Wuthering WH: W th i Heights?
term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38

Term frequencies (counts)

Note: To simplify this example, we dont do idf weighting. 38

IntroductiontoInformationRetrieval

Sec. 6.3

3d documents t example l contd. td


Logfrequencyweighting
term affection ff ti jealous gossip wuthering SaS 3 06 3.06 2.00 1.30 0 PaP 2 76 2.76 1.85 0 0 WH 2 30 2.30 2.04 1.78 2.58

Afterlengthnormalization
term affection ff ti jealous gossip wuthering SaS 0 789 0.789 0.515 0.335 0 PaP 0 832 0.832 0.555 0 0 WH 0 524 0.524 0.465 0.405 0.588

cos(SaS,PaP) 0 789 0.832 0.789 0 832 + 0 0.515 515 0.555 0 555 + 0 0.335 335 0.0 00+0 0.0 0 0.0 00 0.94 cos(SaS,WH) ( , ) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
39

IntroductiontoInformationRetrieval

Sec. 6.3

Computingcosinescores

40

IntroductiontoInformationRetrieval

Sec. 6.4

tfidfweighting i hti has h manyvariants i t

Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?
41

IntroductiontoInformationRetrieval

Weightingmaydifferinqueriesvs d documents t

Sec. 6.4

Manysearchenginesallowfordifferentweightings forqueriesvs.documents SMARTNotation:denotesthecombinationinusein anengine,withthenotationddd.qqq, usingthe acronymsfromtheprevioustable Averystandardweightingschemeis:lnc.ltc Document:logarithmictf (lasfirstcharacter),noidf andcosinenormalization Q y logarithmic g tf ( (linleftmostcolumn), ),idf ( (tin Query: secondcolumn),nonormalization

A bad idea?

42

IntroductiontoInformationRetrieval

Sec. 6.4

tfidfexample: l lnc.ltc l lt
Document: car insurance auto insurance Query: best car insurance
Term tf- tf-wt raw auto t best car insurance 0 1 1 1 0 Query df 5000 idf 23 2.3 1.3 2.0 3.0 wt 0 1.3 2.0 3.0 nliz e 0 0.34 0.52 0.78 tf-raw 1 0 1 2 Document tf-wt 1 0 1 1.3 wt 1 0 1 1.3 nliz e 0 52 0.52 0 0.52 0.68 0 0 0.27 0.53 Pro d

1 50000 1 10000 1 1000

Exercise: what is N, the number of docs? D length Doc l th = 12 + 0 2 + 12 + 1.32 1.92 Score = 0+0+0.27+0.53 = 0.8
43

IntroductiontoInformationRetrieval

S Summary vector t spaceranking ki


R Represent tth thequeryasaweighted i ht dtfidfvector t Representeachdocumentasaweightedtfidfvector Compute C the h cosine i similarity i il i scoref forthe h query vectorandeachdocumentvector Rank R kd documents t with ithrespect tt toth thequeryb byscore ReturnthetopK (e.g.,K =10)totheuser

44

IntroductiontoInformationRetrieval

Ch. 6

R Resources f fort todays d l lecture t


IIR6.2 6.4.3 http://www http://www.miislita.com/information miislita com/informationretrieval tutorial/cosinesimilaritytutorial.html
Termweighting g gandcosinesimilarity ytutorialforSEOfolk!

45

You might also like