TFIDF

IntroductiontoInformationRetrieval
Introductionto
InformationRetrieval
CS276:InformationRetrievalandWebSearch ChristopherManningandPrabhakar Raghavan Lecture6:Scoring,TermWeightingandthe VectorSpaceModel
1
R Recap of fl lecture t 5
Collectionandvocabulary ystatistics:Heaps p andZipfs p laws DictionarycompressionforBooleanindexes Dictionarystring,blocks,frontcoding Postingscompression:Gapencoding,prefixuniquecodes VariableByteandGammacodes
collection (text, xml markup etc) collection (text) Term doc incidence matrix Term-doc postings, uncompressed (32-bit words) postings, uncompressed (20 bits) postings, variable byte encoded postings, encoded 3,600.0 960.0 40 000 0 40,000.0 400.0 250.0 116.0 101.0
2
MB
Thi lecture; This l t IIRSections S ti 6.2 6 26.4.3 643

Rankedretrieval Scoringdocuments Termfrequency Collectionstatistics Weightingschemes Vectorspacescoring
Ch. 6
R k dretrieval Ranked ti l
Thusfar,ourquerieshaveallbeenBoolean.
Documentseithermatchordont.
Goodforexpert p userswithprecise p understanding gof theirneedsandthecollection.

Alsogoodforapplications:Applicationscaneasily consume1000sofresults.
Notgoodforthemajorityofusers.
MostusersincapableofwritingBooleanqueries(orthey are,buttheythinkitstoomuchwork). don twanttowadethrough1000sofresults. results Mostusersdont
Thisisparticularlytrueofwebsearch.
4
ProblemwithBooleansearch: f torfamine feast f i
Ch. 6
Booleanqueriesoftenresultineithertoofew( (=0) 0)or toomany(1000s)results. Q Query y1:standarduserdlink650 200,000 , hits Query2:standarduserdlink650nocardfound:0 hits Ittakesalotofskilltocomeupwithaquerythat producesamanageablenumberofhits.
ANDgivestoofew;ORgivestoomany
R k dretrieval Ranked t i lmodels d l

Ratherthanasetofdocumentssatisfyingaquery expression,inrankedretrievalmodels,thesystem returnsanorderingoverthe(top)documentsinthe collectionwithrespecttoaquery Freetextqueries:Ratherthanaquerylanguageof operatorsandexpressions,theusersqueryisjust oneormorewordsinahumanlanguage Inprinciple,therearetwoseparatechoiceshere,but inpractice,rankedretrievalmodelshavenormally beenassociatedwithfreetextqueriesandviceversa
6
Feastorfamine:notaproblemin ranked k dretrieval ti l

Indeed,thesizeoftheresultsetisnotanissue Wejustshowthetopk(10)results Wedontoverwhelmtheuser Premise:therankingalgorithmworks
Ch. 6
Whenasystemproducesarankedresultset,large resultsetsarenotanissue
Ch. 6
S i asthe Scoring th basis b i of franked k dretrieval ti l

Wewishtoreturninorderthedocumentsmostlikely tobeusefultothesearcher Howcanwerankorderthedocumentsinthe collectionwithrespecttoaquery? g ascore say yin[0,1] toeachdocument Assign Thisscoremeasureshowwelldocumentandquery match.
Ch. 6
Q Query document d tmatching t hi scores

Weneedawayofassigningascoretoa query/documentpair Letsstartwithaonetermquery q y Ifthequerytermdoesnotoccurinthedocument: scoreshouldbe0 Themorefrequentthequeryterminthedocument, thehigherthescore(shouldbe) Wewilllookatanumberofalternativesforthis.
Ch. 6
T k 1: Take 1 Jaccard J dcoefficient ffi i t

RecallfromLecture3:Acommonlyusedmeasureof overlapoftwosetsA andB j jaccard( (A,B) , )=|A B|/|A B| jaccard(A,A)=1 jaccard(A,B)=0 ifA B=0 A andB donthavetobethesamesize. 1. Alwaysassignsanumberbetween0and1
10
Ch. 6
J Jaccard dcoefficient: ffi i t S Scoring i example l

Whatisthequerydocumentmatchscorethatthe Jaccardcoefficientcomputesforeachofthetwo documentsbelow? Query:idesofmarch Document 1:caesardiedinmarch Document 2:thelongmarch
11
Ch. 6
I Issues with ithJ Jaccard df forscoring i

Itdoesnt doesn tconsidertermfrequency(howmanytimes atermoccursinadocument) Raretermsinacollectionaremoreinformativethan frequentterms.Jaccarddoesntconsiderthis information Weneedamoresophisticatedwayofnormalizingfor length Laterinthislecture,welluse | A I B | / | A U B | ...insteadof|A B|/|A B|(Jaccard)forlength normalization.
12
Recall(Lecture1):Binaryterm d document ti incidence id matrix ti

Antony and Cleopatra Julius Caesar The Tempest Hamlet
Sec. 6.2
Othello
Macbeth
Antony Brutus Caesar Calpurnia Cleopatra mercy worser
1 1 1 0 1 1 1
1 1 1 1 0 0 0
0 0 0 0 0 1 1
0 1 1 0 0 1 1
0 0 1 0 0 1 1
1 0 1 0 0 1 0
Each document is represented by a binary vector {0,1}|V|

13
Sec. 6.2
T Term document d tcount tmatrices ti

Considerthenumberofoccurrencesofatermina document:
Eachdocumentisacountvectorinv:acolumnbelow
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
157 4 232 0 57 2 2
73 157 227 10 0 0 0
0 0 0 0 0 3 1
0 1 2 0 0 5 1
0 0 1 0 0 5 1
0 0 1 0 0 1 0
14
B of Bag fwords d model d l

Vectorrepresentationdoesnt doesn tconsidertheordering ofwordsinadocument Johnisq quickerthanMary y andMary yisquicker q than John havethesamevectors gofwords model. Thisiscalledthebag Inasense,thisisastepback:Thepositionalindex wasabletodistinguishthesetwodocuments. Wewilllookatrecoveringpositionalinformation laterinthiscourse. Fornow:bagofwordsmodel
15
T Term frequency f tf
Thetermfrequencytft,d t d oftermt indocumentd is definedasthenumberoftimesthattoccursind. Wewanttousetfwhencomputing p g q query ydocument matchscores.Buthow? q yisnotwhatwewant: Rawtermfrequency
Adocumentwith10occurrencesofthetermismore relevantthanadocumentwith1occurrenceoftheterm. Butnot10timesmorerelevant.
Relevancedoesnotincreaseproportionallywith termfrequency. frequency

NB:frequency=countinIR
16
Sec. 6.2
L frequency Log f weighting i hti

Thelogfrequencyweightoftermtindis
w t,d 1 + log 10 tf = , 0,
t,d
if tf
> 0 otherwise
t,d
00,11,21.3,102,10004,etc. Scoreforadocumentquerypair:sumovertermst in bothq andd: score = tqd (1 + log tf t ,d )
Thescoreis0ifnoneofthequerytermsispresentin thedocument.
17
Sec. 6.2.1
D Document tfrequency f
Rare R terms t aremoreinformative i f ti th thanfrequent f tterms t
Recallstopwords
Consideraterminthequerythatisrareinthe collection(e.g.,arachnocentric) Adocumentcontainingthistermisverylikelytobe relevanttothequeryarachnocentric Wewantahighweightforraretermslike arachnocentric.
18
Sec. 6.2.1
D Document tfrequency, f continued ti d

Frequenttermsarelessinformativethanrareterms Consideraquerytermthatisfrequentinthe collection( (e.g., g ,high, g ,increase, ,line) Adocumentcontainingsuchatermismorelikelyto berelevantthanadocumentthatdoesnt Butitsnotasureindicatorofrelevance. Forfrequent q terms, ,wewanthigh g p positiveweights g forwordslikehigh,increase,andline g thanforrareterms. Butlowerweights Wewillusedocumentfrequency(df)tocapturethis.
19
Sec. 6.2.1
idfweight i ht
dft isthedocumentfrequencyoft:thenumberof documentsthatcontaint
dft isaninversemeasureoftheinformativeness oft dft N
Wedefinetheidf (inversedocumentfrequency)oft by
idf t = log10 ( N/df t )
Weuselog(N/dft)insteadofN/dft todampentheeffect ofidf. Will turn out the base of the log is immaterial.
20
Sec. 6.2.1
idfexample, l supposeN=1million illi

term calpurnia animal sunday fly under the dft 1 100 1,000 10,000 100,000 1,000,000 idft
idf t = log10 ( N/df t )

There is one idf value for each term t in a collection.
21
Eff tof Effect fidfonranking ki

Doesidfhaveaneffectonrankingforoneterm queries,like
iPhone
idfhasnoeffectonrankingonetermqueries
idfaffectstherankingofdocumentsforquerieswithat leasttwoterms Forthequerycapriciousperson,idfweightingmakes occurrencesofcapricious countformuchmoreinthefinal documentrankingthanoccurrencesofperson.
22
Sec. 6.2.1
C ll ti vs.D Collection Document tfrequency f

Thecollectionfrequencyoft isthenumberof occurrencesoft inthecollection,counting multiple p occurrences. Example:
Word Collection frequency Document frequency
insurance try
10440 10422
3997 8760
Whichwordisabettersearchterm(andshould getahigherweight)?
23
Sec. 6.2.2
tfidf weighting i hti

Th Thetfidfweight i htof fat termi isth theproduct d tof fit itstf weightanditsidfweight.
w t ,d = (1 + log l tf t ,d ) log l 10 ( N / df t )
Best B tk knownweighting i hti scheme h i ini information f ti retrieval ti l
Note:theintfidfisahyphen,notaminussign! Alternativenames:tf.idf, tf idf tfxidf
Increaseswiththenumberofoccurrenceswithina document Increaseswiththerarityoftheterminthecollection

24
Sec. 6.2.2
Fi lranking Final ki of fdocuments d t f foraquery
Score(q, d ) =
t qd
tf.idft ,d
25
Sec. 6.3
Bi Binary count t weight i htmatrix ti

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
5.25 1.21 8.59 0 2.85 1.51 1.37
3.18 6.1 2.54 1.54 0 0 0
0 0 0 0 0 1.9 0.11
0 1 1.51 0 0 0.12 4.15
0 0 0.25 0 0 5.25 0.25
0.35 0 0 0 0 0.88 1.95
Each document is now represented by a real-valued vector of tf-idf tf idf weights R|V|
26
Sec. 6.3
D Documents t asvectors t
Sowehavea|V|dimensionalvectorspace Termsareaxesofthespace Documentsarepointsorvectorsinthisspace Veryhighdimensional:tensofmillionsof dimensionswhenyouapplythistoawebsearch engine Thesearevery ysparse p vectors mostentriesarezero.
27
Sec. 6.3
Q i asvectors Queries t
Keyidea1: Dothesameforqueries:representthem asvectorsinthespace Key yidea2: Rankdocumentsaccording gtotheir proximitytothequeryinthisspace proximity y=similarity yofvectors p proximityinverseofdistance getaway yfrom Recall:Wedothisbecausewewanttog theyoureeitherinoroutBooleanmodel. g than Instead:rankmorerelevantdocumentshigher lessrelevantdocuments
28
Sec. 6.3
F Formalizing li i vector t spaceproximity i it

Firstcut:distancebetweentwopoints
(=distancebetweentheendpointsofthetwovectors)
Euclideandistance? Euclideandistanceisabadidea... ...becauseEuclideandistanceislargeforvectorsof differentlengths.
29
Sec. 6.3
Whydistanceisabadidea
TheEuclidean distancebetweenq andd2 islargeeven though h hthe h distributionofterms inthequeryq andthe distributionof termsinthe documentd2 are verysimilar.
30
Sec. 6.3
U angle Use l i instead t dof fdi distance t

Thoughtexperiment:takeadocumentd andappend ittoitself.Callthisdocumentd. y dandd havethesamecontent Semantically TheEuclideandistancebetweenthetwodocuments canbeq quitelarge g Theanglebetweenthetwodocumentsis0, correspondingtomaximalsimilarity. Key yidea:Rankdocumentsaccording gtoangle g with query.
31
Sec. 6.3
F From angles l t tocosines i

Thefollowingtwonotionsareequivalent.
Rankdocumentsindecreasing orderoftheanglebetween queryanddocument Rankdocumentsinincreasing orderof cosine(query,document)
C Cosine i is i amonotonically t i ll decreasing d i function f ti for f the th interval[0o,180o]
32
Sec. 6.3
F From angles l t tocosines i
Buthow andwhy shouldwebecomputingcosines?

33
Sec. 6.3
L thnormalization Length li ti
Avectorcanbe(length)normalizedbydividingeach ofitscomponentsbyitslength forthisweusethe L2 norm: r x 2 = i xi2 Dividing gavectorby yitsL2 normmakesitaunit (length)vector(onsurfaceofunithypersphere) Effectonthetwodocumentsdandd (dappendedto itself)fromearlierslide:theyhaveidenticalvectors afterlengthnormalization.
Longandshortdocumentsnowhavecomparableweights
34
Sec. 6.3
cosine(query,document) i ( d t)
Dot product Unit vectors
r r r r r r qd q d cos( q , d ) = r r = r r = q d qd
i =1 i
q di
2 d i=1 i V
2 i =1 i
qi is the tf-idf weight of term i in the query di is the tf-idf tf idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
35
C i f Cosine forl length thnormalized li dvectors t

Forlengthnormalizedvectors,cosinesimilarityis simplythedotproduct(orscalarproduct):
V r r r r cos(q, d ) = q d = qi di i= 1
forq, q dlengthnormalized. normalized
36
C i similarity Cosine i il it ill illustrated t t d
37
Sec. 6.3
C i similarity Cosine i il i amongst3d documents

Howsimilarare thenovels SaS:Senseand Sensibility PaP:Prideand Prejudice,and WH Wuthering WH: W th i Heights?
term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38
Term frequencies (counts)
Note: To simplify this example, we dont do idf weighting. 38
Sec. 6.3
3d documents t example l contd. td

Logfrequencyweighting
term affection ff ti jealous gossip wuthering SaS 3 06 3.06 2.00 1.30 0 PaP 2 76 2.76 1.85 0 0 WH 2 30 2.30 2.04 1.78 2.58
Afterlengthnormalization
term affection ff ti jealous gossip wuthering SaS 0 789 0.789 0.515 0.335 0 PaP 0 832 0.832 0.555 0 0 WH 0 524 0.524 0.465 0.405 0.588
cos(SaS,PaP) 0 789 0.832 0.789 0 832 + 0 0.515 515 0.555 0 555 + 0 0.335 335 0.0 00+0 0.0 0 0.0 00 0.94 cos(SaS,WH) ( , ) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
39
Sec. 6.3
Computingcosinescores
40
Sec. 6.4
tfidfweighting i hti has h manyvariants i t
Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?
41
Weightingmaydifferinqueriesvs d documents t
Sec. 6.4
Manysearchenginesallowfordifferentweightings forqueriesvs.documents SMARTNotation:denotesthecombinationinusein anengine,withthenotationddd.qqq, usingthe acronymsfromtheprevioustable Averystandardweightingschemeis:lnc.ltc Document:logarithmictf (lasfirstcharacter),noidf andcosinenormalization Q y logarithmic g tf ( (linleftmostcolumn), ),idf ( (tin Query: secondcolumn),nonormalization
A bad idea?
42
Sec. 6.4
tfidfexample: l lnc.ltc l lt
Document: car insurance auto insurance Query: best car insurance
Term tf- tf-wt raw auto t best car insurance 0 1 1 1 0 Query df 5000 idf 23 2.3 1.3 2.0 3.0 wt 0 1.3 2.0 3.0 nliz e 0 0.34 0.52 0.78 tf-raw 1 0 1 2 Document tf-wt 1 0 1 1.3 wt 1 0 1 1.3 nliz e 0 52 0.52 0 0.52 0.68 0 0 0.27 0.53 Pro d
1 50000 1 10000 1 1000
Exercise: what is N, the number of docs? D length Doc l th = 12 + 0 2 + 12 + 1.32 1.92 Score = 0+0+0.27+0.53 = 0.8
43
S Summary vector t spaceranking ki

R Represent tth thequeryasaweighted i ht dtfidfvector t Representeachdocumentasaweightedtfidfvector Compute C the h cosine i similarity i il i scoref forthe h query vectorandeachdocumentvector Rank R kd documents t with ithrespect tt toth thequeryb byscore ReturnthetopK (e.g.,K =10)totheuser
44
Ch. 6
R Resources f fort todays d l lecture t

IIR6.2 6.4.3 http://www http://www.miislita.com/information miislita com/informationretrieval tutorial/cosinesimilaritytutorial.html
Termweighting g gandcosinesimilarity ytutorialforSEOfolk!
45

TFIDF

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TFIDF

Uploaded by

Copyright:

Available Formats

IntroductiontoInformationRetrieval

Thi lecture; This l t IIRSections S ti 6.2 6 26.4.3 643

Goodforexpert p userswithprecise p understanding gof theirneedsandthecollection.

ProblemwithBooleansearch: f torfamine feast f i

R k dretrieval Ranked t i lmodels d l

Feastorfamine:notaproblemin ranked k dretrieval ti l

S i asthe Scoring th basis b i of franked k dretrieval ti l

Q Query document d tmatching t hi scores

T k 1: Take 1 Jaccard J dcoefficient ffi i t

J Jaccard dcoefficient: ffi i t S Scoring i example l

I Issues with ithJ Jaccard df forscoring i

Recall(Lecture1):Binaryterm d document ti incidence id matrix ti

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

Each document is represented by a binary vector {0,1}|V|

T Term document d tcount tmatrices ti

Antony and Cleopatra

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

B of Bag fwords d model d l

Relevancedoesnotincreaseproportionallywith termfrequency. frequency

L frequency Log f weighting i hti

00,11,21.3,102,10004,etc. Scoreforadocumentquerypair:sumovertermst in bothq andd: score = tqd (1 + log tf t ,d )

Consideraterminthequerythatisrareinthe collection(e.g.,arachnocentric) Adocumentcontainingthistermisverylikelytobe relevanttothequeryarachnocentric Wewantahighweightforraretermslike arachnocentric.

D Document tfrequency, f continued ti d

idf t = log10 ( N/df t )

idfexample, l supposeN=1million illi

idf t = log10 ( N/df t )

Eff tof Effect fidfonranking ki

C ll ti vs.D Collection Document tfrequency f

tfidf weighting i hti

Increaseswiththenumberofoccurrenceswithina document Increaseswiththerarityoftheterminthecollection

Fi lranking Final ki of fdocuments d t f foraquery

Bi Binary count t weight i htmatrix ti

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

5.25 1.21 8.59 0 2.85 1.51 1.37

3.18 6.1 2.54 1.54 0 0 0

0 1 1.51 0 0 0.12 4.15

0 0 0.25 0 0 5.25 0.25

0.35 0 0 0 0 0.88 1.95

F Formalizing li i vector t spaceproximity i it

Euclideandistance? Euclideandistanceisabadidea... ...becauseEuclideandistanceislargeforvectorsof differentlengths.

U angle Use l i instead t dof fdi distance t

F From angles l t tocosines i

C Cosine i is i amonotonically t i ll decreasing d i function f ti for f the th interval[0o,180o]

F From angles l t tocosines i

Buthow andwhy shouldwebecomputingcosines?

C i f Cosine forl length thnormalized li dvectors t

forq, q dlengthnormalized. normalized

C i similarity Cosine i il it ill illustrated t t d

C i similarity Cosine i il i amongst3d documents

Term frequencies (counts)

Note: To simplify this example, we dont do idf weighting. 38

3d documents t example l contd. td

tfidfweighting i hti has h manyvariants i t

1 50000 1 10000 1 1000

S Summary vector t spaceranking ki

R Resources f fort todays d l lecture t

You might also like