You are on page 1of 4

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programs Division


Second Semester 2016-2017

Mid-Semester Test Solutions


(EC-2 Regular)

Course No. : SS ZG537


Course Title : INFORMATION RETRIEVAL
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =4
Duration : 2 Hours No. of Questions = 7
Date of Exam : 25/02/2017 (FN)
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q1. Discuss in brief the limitations of the Boolean retrieval model. [2 Marks]
Ans. (2 marks for any 4 out of 5)
Not tolerant to spelling mistakes
Phrase search (Stanford University) and proximity search (Gates /s Microsoft) requires the index
to be augmented.
Giving more weight to documents containing higher number of instances of terms is not considered.
Positional information of terms in a document is not considered.
No ranking of returned results so that the documents can be ranked according to degree of
relevance.

Q2. Give the name of the index we need to use if [1 + 1 + 2 = 4 marks]


a. We want to consider word order in the queries and the documents for a random number of words?
Ans. Positional Index (1 mark)
b. What kind of Index can we use if we assume that word order is only important for two consecutive
terms?
Ans. Biword Index (1 mark)
c. What is the soundex code for the following two names, Robert and Rupert? Assume that the
alphabets are mapped to numbers as follows: (B, F, P, V 1), (C, G, J, K, Q, S, X, Z 2 ), (D,T
3), (L 4), (M, N 5) and (R 6) .
Ans. Both produce the same soundex code i.e. R163 (1 mark each)

Q3. Discuss briefly the index construction algorithm used in Distributed Indexing with a suitable diagram.
[5 marks]

Ans. Map Phase (Parsers) -- 2 marks


Master assigns a split to an idle parser machine
Each parser writes its output to local intermediate files, the segment files
Parser writes pairs into j partitions
Each partition is for a range of terms first letters
i. (e.g., a-f, g-p, q-z) here j = 3.
ii. Each term partition thus corresponds to r segments files, where r is the number of parsers
Reduce phase - Inverters --- 1 mark
An inverter collects all (term, doc) pairs (= postings) for one term-partition.
Sorts and writes to postings lists
Functional diagram 2 marks

Q4. a. An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20
relevant documents in the collection. What is the precision of the system on this search, and what is
its recall? [2 Marks]
Ans. Precision = 8/18 = 0.44; Recall = 8/20 =0.4

(1 mark each)

b. What is the likely effect of Stemming and Lemmatization on [3 Marks]


a. Vocabulary size: Increase, Decrease, Unpredictable?
b. Precision: Increase, Decrease, Unpredictable?
c. Recall: Increase, Decrease, Unpredictable?
(1 mark each for correct answer in bold)

Q5. Consider the following documents: [1 + 2 = 3 Marks]

Doc1: catholic church in brisbane


Doc2: garden city church brisbane
Doc3: brisbane courier garden city
Doc4: where in brisbane catholic church

a. Draw a term-document incidence matrix for this document collection.


Ans. (1 mark)
Doc1 Doc2 Doc3 Doc4
catholic 1 0 0 1
church 1 1 0 1
in 1 0 0 1
brisbane 1 1 1 1
garden 0 1 1 0
city 0 1 1 0
courier 0 0 1 0
where 0 0 0 1

b. Draw the positional inverted index representation for this collection.


Ans. (2 marks)

catholic -> 1: <1>, 4: <4>


church -> 1: <2>, 2: <3>, 4 : <5>
in -> 1 <3>, 4 <2>
brisbane -> 1: <4>, 2: <4>, 3: <1>, 4: <3>
garden -> 2: <1>, 3: <3>
city -> 2: <2>, 3: <4>
courier -> 3: <2>
where -> 4: <1>

Q6. Consider the following document:The universe contains many different universities
[1 + 2 + 3 + 2 = 8 Marks]
a. How many entries a bigram index would contain? Ans: 50 (1 mark)
b. If a booloean query of answering is used on this index for the initial query uni*, what terms would
you search in this permuterm index?
Ans. ( 2 marks)

c. How do you process queries such as univ*,uni*rse,uni*e*se by using the permuterm index? Show
what terms will you search for and how?
Ans. (1 mark each for each query)

d. Use the 2-gram index and 3-gram index for processing the following wildcard queries tol* and rea*
. Is "tool" result for the wildcard query tol* ? If the answer is yes, solve this problem.
Ans. ( 1 mark each for each part)

Q7. Assume that Simple term frequency weights are used (with no IDF factor), and the stop words is,
am and are are removed. Compute the cosine similarity of the following two documents: [Show
the term frequency matrix] [3 marks]
Doc1: Precision is very very high
Doc2: high precision is very very very important
Ans. (1 mark for matrix and 2 marks for calculations.)
***********

You might also like