You are on page 1of 53

AUTOMATIC EXTRACTION OF USEFUL FACET HIERARCHIES

ABSTRACT

Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. [1] Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. Thus far, the identification of the facets was either a manual procedure, or relied on apriori knowledge of the facets that can potentially appear in the underlying collection. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful facet terms. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that is superior to existing approaches and help users locate interesting items faster.

TABLE OF CONTENTS S.NO DESCRIPTION PAGE NO

1. INTRODUCTION 1.1 ABOUT THE PROJECT 1.2 EXISTING SYSTEM 1.3 DRAWBACKS OF EXISTING SYSTEM 1.4 KEYWORDS 2. SYSTEM ANALYSIS 2.1 PROBLEM DEFINITION 2.2 PROPOSED SYSTEM 2.3 SELECTION OF SOFTWARE STRUCTURE 2.4 SELECTION OF HARDWARE STRUCTURE 3. SYSTEM DESIGN 3.1 DESIGN METHODOLOGY 3.2 SOFTWARE STRUCTURE 3.3 DFD/UML DIAGRAMS 3.4 DATABASE DESIGN 3.5 CODE DESIGN 3.6 REPORT DESIGN 4. SYSTEM IMPLEMENTATION 4.1 TYPES OF TESTING 4.2 TEST CASES 5. DOCUMENTATION 5.1 HOW TO OPERATE PACKAGE 5.2 SCREEN SHOTS 39 6. CONCLUSION AND SCOPE 7. REFERENCES 47 49 37 25 26 8 9 11 15 17 22 5 5 6 6 2 2 3

INTRODUCTION

1.1 ABOUT THE PROJECT


The main aim of this project is to provide interface to extract useful documents using facets from large database with high recall. So that end user has fast response in searching database. We present an unsupervised technique for automatic extraction using facets useful for browsing text databases. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful documents using facet terms. For this, first identify given phrase in each document. Then, have to search whole database using this facet. Total database is searched. And useful documents are retrieved suing this facet with high precision. So that users have flexible interface to retrieve documents.[1] Finally, we compare the term distributions in the original database and the expanded database to identify the terms that can be used to construct browsing facets. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster. We collected datasets from Google.

1.2 EXISTING SYSTEM


Earlier users have to search entire database manually to get user required information from volumes of data. Each document has to be searched using facet. So that response time is less. So that waiting time and throughput of system is also less and searching time is more. Precision is more and Recall is less. Users have to perform manual search on database to get user required information. Each document is searched using facet which takes more processing time.

1.3 DRAWBACKS OF EXISTING SYSTEM


Requires more searching time. User has needed to search entire database manually. Precision is low. More waiting time.

1.4 KEYWORDS
Frequency Faceted interface Facet Hierarchy

SYSTEM ANALYSIS

2.1 PROBLEM DEFINITION


For the purpose of fast retrieval we implemented this system. Users are busy with works. So for the fast retrieval of documents using facet terms from large volumes of data we are using this system. Valid user has to enter facet or key term to search entire database to get required documents.

2.2 PROPOSED SYSTEM


The proposed system is a application which provides user friendly interface and maximizes its operational efficiency and, at last, put you ahead of schedule. In this project, it creates a database for storing data. It provides a browser accessible or client resident graphics rendering component for providing Graphical User Interface that includes registration component and a search database component. In registration component, it permits to enter user details, uploading data to or calling data from, the database, or both, relating to a particular user login. In search database component, it retrieves matched documents from large volumes of data with high Recall.[3] User has to register with full details like first name, second name and city. He has to login each and every time to use this application. Then database checks whether he is valid user or not. If login is valid, he can enter keyword to search large volumes of data. So that waiting time and searching time will be reduced. Using facet or key term database will be verified to retrieve required documents.

Frequency Alogrithm [1]


Input: original database D, term extractors E1; : : :Ek Output: annotated database I(D) foreach document d in D do Extract all terms from d /* Compute term frequencies */ foreach term t in d do FreqO(t) = FreqO(t) + 1 end

/* Identify important terms */ I(d) = ; foreach term extractors Ei do Use the extractor Ei to identify the important terms Ei(d) in document d Add Ei(d) to I(d) end end

2.3 SELECTION OF SOFTWARE STRUCTURE Front End:


Operating System Language : : Windows 95 / 98 / 2000 / XP. HTML & JSP

Back End:
Database Database Connectivity IDE : : : MS ACESS and Folder JDBC-ODBC driver. MyEclipse 8.0.

2.4 SELECTION OF HARDWARE STRUCTURE


Main Memory Microprocessor Hard Disk Cache Memory : : : : 64MB. Pentium III/IV. 4.3 GB. 512 KB.

SYSTEM DESIGN

3.1 DESIGN PRINCIPLES & METHODOLOGY


To produce the design for large module can be extremely complex task. The design principles are used to provide effective handle the complexity of the design process, effectively handling the complexity will not only reduce the effort needed for design but can also reduce the scope of introducing errors during design. In design, the most important quality criteria are simplicity and understandability. Proper partitioning will make the system to maintain by making the designer to understand, problem partitioning also aids design verification. Abstraction of a component describes the external behavior of that component, without considering the internal behavior. Abstraction is essential for problem partitioning and is used for exiting components plays an important role in the maintenance phase. The system is a collection of modules means components. For design this system first following the top down approach to device the problem in modules. In this system, the system is main module, because it consists of discrete components such that each component supports a well-defined abstraction and if a change to the component has minimal impact on other components. The modules are highly coupled and coupling is reduced in the system, because the relationships among elements in different modules are minimized.

3.2 SOFTWARE STRUCTURE


Hyper Text Markup Language (HTML) HTML is a static language used to create hypertext documents that have hyperlinks embedded in them. You can build web pages. It is only formatting language and not a programming language. Hyperlinks are underlined or emphasized words or locations in a screen that lead to other documents. WWW is a global, interactive, graphical, hypertext information system. Java Server Pages (JSP) JSP pages are high level extension of servlet and it enables the developers to embed java code in html pages. JSP files are finally compiled into a servlet by the JSP engine. Compiled servlet is used by the engine to serve the requests. These interfaces define the three methods for the compiled JSP page. These methods are:

jspInit() jspDestroy() _jspService(HttpServletRequest request,HttpServletResponse response)

In the compiled JSP file these methods are present can define jspInit() and jspDestroy() methods, but the _jspService(HttpServletRequest request,HttpServletResponse response) method is generated by the JSP engine.

JSP or Java Server Pages was developed by Sun Microsystems. JSP technology is objectoriented programming language and is based on Java language. In this section you will learn about JSP and some its important features.

Usage of JSP:

JSP is widely used for developing dynamic web sites. JSP is used for creating database driven web applications because it provides superior server side scripting support. Simplifies the process of development, Portability, Because of Efficiency, Reusability.

MS Access JDBC Driver

Schematic of the JDBC-ODBC bridge The JDBC type 1 driver, also known as the JDBC-ODBC bridge, is a database driver implementation that employs the ODBC driver to connect to the database. The driver converts JDBC method calls into ODBC function calls.

3.3 USECASE DIAGRAMS A usecase diagram is a diagram that shows a set of usecases and actors and their relationships.

Usecase diagrams commonly contain: usecases actors Dependency, generalization and association relationships.

3.2.1 Usecase Diagram to See Results

3.2.2 Class Diagram for Automatic Extraction of Useful Facet Hierarchies

3.2.3 Activity Diagram for Automatic Extraction of Useful Facet Hierarchies

3.2.4 Sequence Diagram For Automatic Extraction of Useful Facet Hierarchies

3.4 DATABASE DESIGN


A database is a collection of interrelated data stored with the minimum redundancy to serve many users quickly and efficiently. The general objective is to make information access quick, in expensive and flexible for the user. The tables are organized To reduce data duplication and hence inconsistency. To enable the efficient storage and retrieval.

3.4.1 TABLES

Login TABLE
NAME Username FirstName LastName City DATATYPE Varchar2(20) Varchar2(11) Varchar2 Varchar(20) CONSTRAINT Primary key Not null Not null Not null Table 3.4.1.2 DOMAIN A-Z, a-z A-Z, a-z A-Z, a-z a-z, A-Z DESCRIPTION Username FirstName LastName Native place

3.4.2 Data Values

login TABLE

Username pvpsit srividya vijaya padmaja durga

FirstName pvp srividya vijaya palleti durga

LastName sit jyothirmai lakshmi padmaja lakshmi

City kanuru kakinada autonagar vuyyuru tadanki

Table 3.4.1.2

3.5 CODE DESIGN: Textclassification.java

import java.io.File; import java.io.FileInputStream; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.poifs.filesystem.POIFSFileSystem; public class TextClassification { int maxcnt=0; String path=""; String finalstr=""; public TextClassification() {

// TODO Auto-generated constructor stub } public Vector<String> getData(String path, String srchstr) { // TODO Auto-generated method stub maxcnt=0; Vector<String> retV = new Vector<String>(); POIFSFileSystem fs = null; File file = new File(path); String basept = file.getAbsolutePath(); String[] filelist = file.list(); int count=srchstr.split(" ").length; System.out.println("srchstr:"+srchstr); System.out.println("count:"+count); boolean flag=true; loop: for (int i = 0; i < filelist.length; i++) { File infile=new File(basept+ "/" +filelist[i]); if(infile.isDirectory()) if(compareRegularEx(srchstr, filelist[i])) { String[] infilelist = infile.list(); retV.add(infile.getName()); for(int j=0;j<infilelist.length;j++) { if(infilelist[j].endsWith(".doc")) retV.add(infile.getAbsolutePath() +"/"+infilelist[j]);

flag=false; } break loop; } } if(flag){ retV.add("No Facet"); for (int i = 0; i < filelist.length; i++) { File infile=new File(basept+ "/" +filelist[i]); if(infile.isDirectory()) { String[] infilelist = infile.list(); for(int j=0;j<infilelist.length;j++) { if(infilelist[j].endsWith(".doc")) { try { File f = new File(infile.getAbsoluteFile() + "/" +infilelist[j]); fs = new POIFSFileSystem(new FileInputStream(f)); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); String abpath=f.getAbsolutePath(); if (compareRegularEx(srchstr, we.getText(),abpath)) { retV.add(abpath);

} } catch (Exception e) { e.printStackTrace(); continue; } } } } } } System.out.println("path:"+this.path); System.out.println("finalstr:"+this.finalstr); System.out.println("retV:"+retV); System.out.println("maxcnt:"+maxcnt); return retV; } private boolean compareRegularEx(String srchstr, String text,String path) { String[] srch = srchstr.split(" "); boolean result = false; int cnt=0;

for (int i = 0; i < srch.length; i++) { Pattern p = Pattern.compile(srch[i]); Matcher m = p.matcher(text); if (m.find()) { result = true; cnt++; } } System.out.println("cnt:"+cnt); if(maxcnt<cnt) { maxcnt=cnt; this.path=path; this.finalstr=text; System.out.println("cPath:"+path); } return result; } private boolean compareRegularEx(String srchstr, String text) { String[] srch = srchstr.split( ); 24oolean result = false; for (int I = 0; I < srch.length; i++) { Pattern p = Pattern.compile(srch[i]); Matcher m = p.matcher(text); if (m.find()) { result = true;

} } return result; } }

3.6 REPORT DESIGN


3.6.1 FEASIBILITY STUDY The prime focus of the feasibility study is evaluating the practically of the proposed system keeping in mind a number of factors. The following factors are taken into account before deciding in favor of the new system. 3.6.2 TECHNICAL FEASIBILITY We proposed to use web technologies of advanced java to implement the project. Knowledge about the text parsers is required. We proposed to use the java servlet technology, and java server pages (JSP) technology to generate RIA (Rich Internet Applications) along with Java Script. My Eclipse IDE 8.0 is used to this application rather than Eclipse. Because 1. Eclipse needs My Eclipse pluggable source. But My Eclipse did not required any pluggable source. 2. My Eclipse is a licensed product which was build on Eclipse 3. Eclipse will provide little functionality. 4. My Eclipse provides all advance futures 5. MyEclipse is base for so many other IDEs like IBMRAD, Rational Rose ,etc.

The system must handle and manages the text that can be stored and managed. The system must have a search provision so that it can search all the details. The major process of the system is as follows. Identifying important terms with in each document: The inputs for this process will be original database and the term extractors. The original database consists of a set of documents, and a set of extractors. The process will compute the term frequencies, identifies important terms each document is annotated with the term. The output will the annotated terms. Deriving context terms using external resources: The input for this process is an annotated database and external resources. The process includes identifying the context terms for each document. The output will be contextualized database.[2] Identifying important facet terms by comparing the term distributions in original and contextualized database: The inputs for this process are original database and contextualized database. The process includes computation of term frequencies in original database and contextualized database. Computes shift in rank and frequency. The outputs will be useful facet terms.

SYSTEM IMPLEMENTATION

4.1 Types Of Testing :


System testing consists of the following steps: 1. Modular Testing 2. Integrated Testing 3. User Acceptance Testing Modular Testing : A module represents the logical element of a system. For a module to run satisfactorily, it must compile and test data correctly and tie in properly with other modules. Modular testing checks for two types of errors: Syntax and Logic. A syntax error is a program statement that violates one or more rules of the language in which it is written. Integrated Testing : Individual modules are invariably related to one another and interact in a total system. Each portion of the system is tested against the entire module with both testing and live data before the entire system is ready to be implemented. When the individual modules were found works satisfactorily, the system integration test was carried out. Using these data a complete test was made. All outputs were generated. Different users were allowed to work on the system to check its performance.

Software Testing:
Software Testing is the process used to help identify the correctness, completeness, security and quality of developed computer software.[5] Testing is a process of technical investigation, performed on behalf of stakeholders, that is intended to reveal quality-related information about the product with respect to the context in which it is intended to operate. In general, software engineers distinguish software faults from software failures. Our project" Visual cryptography For Cheating Prevention is tested with the following testing methodologies.

4.2 TEST CASES: 1. 2. 3. 4. 5. 6. 7. 8. Test case ID. Precondition. Description. Test Steps. Expected output Actual output Status Remarks 1. Enter username, Enter password. If username and password match to the user details, Home page has to be displayed. 1. Enter username. 2. Enter password. 3. Click Login button. The home page of the particular user has to be displayed. Home page of the logged in user being displayed. Pass.

Table 4.1.1

Figure 4.1.1

Figure 4.1.1
1. 2. 3. 4. 5. Test case ID. Precondition. Description. Test Steps. Expected output 2. Enter either username or password If username is entered but password is not entered, then alert box will be displayed 1. Enter username or password. 2. Click Login button. The alert box has to be displayed.

6.

Actual output

Alert box is displayed specifying that enter username and password.

7. 8.

Status Remarks

Pass.

Table 4.1.2

Figure 4.1.2

Figure 4.1.2

Figure 4.1.2

Figure 4.1.2

1. 2. 3. 4. 5.

Test case ID. Precondition. Description. Test Steps. Expected output

3. Enter username. If username is entered but details arent entered, then alert box will be displayed 1. Enter username. 2. Click Submit button. The alert box has to be displayed.

6.

Actual output

Alert box is displayed specifying that Enter the missing fields.

7. 8.

Status Remarks

Pass.

Table 4.1.3

Figure 4.1.3

Figure 4.1.3
1. 2. 3. 4. 5. 6. Test case ID. Precondition. Description. Test Steps. Expected output Actual output 4. Enter registration details. If password length is less than 6, then alert box will be displayed 1. Enter registration details. 2. Enter password with length less than 6. 3. Click Submit button. The alert box has to be displayed. Alert box is displayed specifying that

password min length is 6. 7. 8. Status Remarks Pass.

Table 4.1.4

Figure 4.1.4

Figure 4.1.4

1. 2.

Test case ID. Precondition.

5. Enter registration details, but password and confirm password but both should be different. If password and confirm password didnt match, then alert box will be displayed 1. Enter registration details. 2. Enter password and confirm password but both should be different. 3. Click Submit button. The alert box has to be displayed. Alert box is displayed specifying that password and confirm password was mismatched. Pass.

3. 4.

Description. Test Steps.

5. 6.

Expected output Actual output

7. 8.

Status Remarks

Table 4.1.5

Figure 4.1.5

Figure 4.1.5
1. 2. 3. 4. 5. 6. Test case ID. Precondition. Description. Test Steps. Expected output Actual output 6. Enter same registration details. If username and password are already present in database, then alert box will be displayed 1. Enter registration details. 2. Enter same username and password. 3. Click Submit button. The alert box has to be displayed. Alert box is displayed specifying that username already exists. Please choose 7. 8. Status Remarks different name Pass.

Table 4.1.6

Figure 4.1.6

Figure 4.1.6

DOCUMENTATION

5.1 How to Operate Package:


Download My Ecllipse 8.0 and install. File Import Project Folder.

Right click on project in Project Explorer.

Properties Java Build Path Libraries.

Add 3 jar files from poi folder. Put Facet database folder of project into C:/facet. Connect Jdbc - Odbc Driver for Ms-Access.

Control Panel Administrative Tools Data Sources (ODBC)

Click on Add

Add Ms Access file name.

Specify Ms Access file Location.

5.2 SCREEN SHOTS: Startpage:

CONCLUSION & FUTURE SCOPE

The conventional system requires more waiting and searching time to extract useful information from large volumes of data. Implementation of this project would ultimately improve efficiency of application by providing user friendly interface. We can thus better serve the neediest section of Society. Some Enhancements can be made to decrease I/O access. Importing and adding any files through application is further development to increase efficiency of system. So that throughput of system is increased. The distributional analysis step of our technique automatically identifies which concepts are important for the underlying database and generates the appropriate facet terms. [1] We have to perform more experiments in this direction and examine the performance of our techniques for a larger variety of text databases and external resources.

REFERENCES

REFERENCES [1] W. Dakka, Panagiotis G. Ipeirotis, Automatic Extraction of Useful Facet Hierarchies from Text Databases in ACM SIGIR 2009 Workshop on Faceted Search, 2009. [2] A. G. Taylor, Wynars Introduction to Cataloging and Classification, 10th ed. Libraries Unlimited, Inc, 2006. [3] A. S. Pollitt, The key role of classification and indexing in view-based searching, in Proceedings of the 63rd International Federation of Library Associations and Institutions General Conference (IFLA97), 1997 [4] R. Snow, D. Jurafsky, and A. Y. Ng, Semantic taxonomy induction from heterogenous evidence, in Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL 2006), 2006. [5] W. Dakka, R. Dayal, and P. Ipeirotis, Automatic discovery of useful facet terms, in ACM SIGIR 2006 Workshop on Faceted Search, 2006.

You might also like