You are on page 1of 6

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.

E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System


Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section)

PART- A (ANSWER ALL QUESTIONS) 1)Define tf-idf weights of term t in document d.

(6 MARKS)

ANS The term frequency and inverse document frequency in a document d is defined as ->if t is the number of distinct terms in the document collection then tfi,j is defined as the the term frequency which is the number of occurence of term tj in document d(1Mark) dfi,j is the number of documents which contain the term tj also called as the document frequency idfj is defined as the inverse document frequency which is equal to log(d/dfj)(1Mark) Evaluation Criteria: Definition must be written to score 1 mark. Course outcome : Maps to Course Outcome - 1 2)Differentiate between browsing and searching Searching Browsing Searching is done for specific task by giving Browsing is defined as casual looking in the a query(1Mark) material with no specific task Enter keywords/operators in the search Enter a URL in the address box engine box(1 Mark) The essential tasks involved in searching The essential tasks involved in browsing are: are: Use keywords in queries Use focused concepts to keep to guide your Scan results for relevant information reading Select information from snippets in order Scan for relevant information to: Select information from web pages in order conclude the search by scanning, to: skimming or reading conclude the search by scanning, enter new keywords in a query skimming or reading click a link to further information click a link to further information

Evaluation Criteria: At least 2 differences must be written to score 2 marks. Course outcome : Maps to Course Outcome - 1 3)Describe Levenshtein edit distance with example (2Marks) ANS Levenshtein edit distance between two strings is the minimum number of character insertion, deletion and replacement needed to make them equal(1Mark) Example 1)Typing error splits flower into flo wer the edit distance here is 1 deletion of space to make the two strings equal(1Mark). 2)The Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: 1. kitten sitten (substitution of "s" for "k")

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System
Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section)

2. sitten sittin (substitution of "i" for "e") 3. sittin sitting (insertion of "g" at the end). Evaluation Criteria: Definition must be written to score 1 mark and at least one example should be written to score another mark. Course outcome : Maps to Course Outcome - 2 Part B(Marks 2*7=14) 4)Explain the following 4 a) Structural Queries(3 Marks) Mixing contents and structure in queries - contents: words, phrases, or patterns - structural constraints: containment, proximity, or other restrictions on structural elements Three main structures - Fixed structure - Hypertext structure - Hierarchical structure Fixed Structure(1Mark) Document: a fixed set of fields EX: a mail has a sender, a receiver, a date, a subject and a body field Search for the mails sent to a given person with football in the Subject field Hypertext(1Mark) A hypertext is a directed graph where nodes hold some text (text contents) and the links represent connections between nodes or between positions inside nodes (structural connectivity)

Web is a gigantic hypertext like database spread around the world. Hierarchical Structure(1Mark) An intermediate structuring model which lies between fixed structure and hypertext Represents a recursive decomposition of text and is a natural model for any text collection.

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System
Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section)

Sample Hierarchical Models PAT Expressions Overlapped Lists Lists of References Proximal Nodes Tree Matching b)Keyword based querying(4 Marks): A query is formulation of a user information need Keyword-based queries are popular 1. Single-Word Queries 2. Context Queries 3. Boolean Queries 4. Natural Language Single-Word Queries(1 Mark) A query is formulated by a word A document is formulated by long sequences of words A word is a sequence of letters surrounded by separators What are letters and separators? e.g,on-line The division of the text into words is not arbitrary Context Queries(1 Mark) Definition - Search words in a given context Types Phrase >a sequence of single-word queries >e.g, enhance retrieval Proximity >a sequence of single words or phrases, and a maximum allowed distance between them are specified >e.g, within distance (enhance, retrieval, 4) will match enhance the power of retrieval Boolean Queries(1 Mark) Definition

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System
Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section)

A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands e.g, translation AND syntax OR syntactic Fuzzy Boolean Retrieve documents appearing in some operands (The AND may require it to appear in more operands than the OR) Pattern Matching(1 Mark) A pattern is a set of syntactic features that must occur in a text segment Types Words Prefixes e.q comput->computer ,computation,computing,etc Suffixes e.q ters->computers,testers,painters,etc Substrings e.q tal->coastal,talk,metallic,etc Ranges between held and hold->hoax and hissing Regular expressions union: if e1 and e2 are regular expressions , then(e1|e2) matches what e1 or e2 matches concatenation: if e1 and e2 are regular expressions, the occurrences of (e1e2) are formed by the occurrences of e1 immediately followed by those of e2 repetition: if e is a regular expression , then (e*) matches a sequence of zero or more contiguous occurrence of e pro(blem|tein)(s|)(0|1|2)*->problem2 and proteins Evaluation criteria: Each kind of query type carries 1 mark. For each uery type one line
description must be written to get the 1 mark. Course Outcome : Maps to Course Outcome 2

5 Explain the following Classic IR Models -----a) Boolean and Vector(5 Marks) ANS Boolean model (2 Marks) Binary decision criterion(1 Mark) Data retrieval model Advantage clean formalism, simplicity Disadvantage It is not simple to translate an information need into a Boolean expression. exact matching may lead to retrieval of too few or too many documents Example: Can be represented as a disjunction of conjunction vectors (in DNF). (1 Mark) Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0) Vector model(3Marks) Assign non-binary weights to index terms in queries and in documents. => TFxIDF (1 Mark) Compute the similarity between documents and query. => Sim(Dj, Q) More precise than Boolean model.

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System
Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section)

Idea for TFxIDF TF: inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents. IDF : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term k i among the documents in the collection.This frequency is often referred to as the inverse document frequenc Index terms are assigned positive and non-binary weights.(1 Mark) The index terms in the query are also weighted. Term weights are used to compute the degree of similarity between documents and the user query. Then, retrieved documents are sorted in decreasing order. Degree of similarity(1 Mark)

b)What is the Logical View of Documents ?Explain with a Diagram(2Marks) Logical View of Documents Full text representation(1 Mark) A set of index terms Elimination of stop-words The use of stemming The identification of noun groups From full text to a set of index terms(1 Mark)

Evaluation criteria: For each model type description must be written to get the 1 mark and the
formula for similarity to get mark. Logical view diagram carries 1 mark and description carries 1 mark

MUFFAKHAM JAH COLLEGE OF ENGINEERING & TECHNOLOGY Department of Information Technology B.E IV/IV II SEM (A & B) SECTION -- BIT-452:Information Retrieval System
Max Marks: 20 CLASS TEST 1: KEY (with stepwise marks distribution) Faculty: Ms S. Fouzia Sayeedunnisa (A-section) Mr. Md. Riyazuddin (B-section) Course Outcome : Maps to Course Outcome 1.

6)Define Precision and Recall Precision: Fraction of retrieved documents which is relevant.(1Mark) |R| Number of relevant documents |A| Number of documents in the answer set |Ra | Number of documents in the intersection of R and A Precision =|Ra |/|A| Recall: Fraction of retrieved documents which is relevant(1Mark) Recall=|Ra |/R| Top 8 ranked result is [+,+,+,-,-,-,-,+] + indicates non relevant document - indicates relevant document total 10 Relevant document i.e. |R|= 10 |A| =8 |Ra |=4{ all the - are relevant so from 10 relevant documents only 4 has appeared in the answer set , i.e. Ra } i)Precision =|Ra |/|A| = 4/8 =50% (1Mark) ii)Recall=|Ra |/R| = 4/10= 40% (1Mark) iii) Precision at 5 documents |Ra |/|A| = 2/5=40% (1Mark) b)Query Protocols(2Mark) Query Protocols Z39.50 WAIS (Wide Area Information Service) Z39.50(1Mark) American National Standard Information Retrieval Application Service Definition Can be implemented on any platform Query bibliographical information using a standard interface between the client and the host database manager Z39.50 protocol is part of WAIS WAIS (Wide Area Information Service) (1Mark) Beginning in the 1990s Goal was to be a Network Publishing protocol Query databases through the Internet Evaluation criteria: Definition must be written to get the 1 mark and each bit solved correctly is given 1 mark. Course Outcome : Maps to Course Outcome 2

You might also like