You are on page 1of 7

International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

Extracting Information Science Concepts Based on JAPE Regular Expression


Ahlam Sawsaa & Joan Lu
Manuscript
Received: 3,Jun.,2012 Revised: 25,Jun.,2012 Accepted: 30,Jan.,2013 Published: 15,Mar.,2013

Keywords
ontology, Regular expression, Information extraction, Natural Language Programming

Abstract Recently, unstructured data on the World Wide Web has generated significant interest in the extraction of text, emails, web pages, reports and research papers in their raw form. Far more interestingly, extracting information from a specific domain using distributed corpora from the World Wide Web is a vital step towards creating corpus annotation. This paper describes a method of annotation, based on concepts from Information Science, to build a domain ontology, using Natural Language Programming (NLP) technology. We used Java Annotation Patterns Engine (JAPE) grammars to support regular expression matching and thus annotate IS concepts using a GATE developer tool. This speeds up the time-consuming development of the ontology which is important for experts in the domain facing time constraints and high workloads. The rules provide significant results: the pattern matching of IS concepts based on the lookup list produced 403 correct concepts and the accuracy was generally higher, with 0 partially correct, missing and false positive results. Using NLP technique is good approaches to reduce the domain experts work and they can be evaluated the results

relationships among phrases and sentence elements within text. In fact, each of these tools has advantages and disadvantages. A comparative analysis of the existing tools for data extraction is needed to assess their capabilities. This is done in the next section. In this paper, first we provide a brief background of IE tools to justify why we feel the NLP technique should be used to speed up the building of an Ontology of Information Science (OIS). To extract concepts in the field, we used CREOLE plug-ins from GATE in the IE system. We also show how the JAPE grammar has been implemented by detailing the rules we use to annotate IS concepts. The paper is structured as follows: In section 2, we discuss the background of IE. In section 3, we discuss the methods used to extract Information Science (IS) concepts and how they were constructed. In section 4, we present how the domain knowledge is acquired for creating the corpus, Gazetteer, and how the JAPE rule is implemented. Our discussion and evaluation is in section 5. Finally, we draw conclusions and make suggestions for future work.

2.

Background

1.

Introduction

Recently, Information Extracting (IE) has received significant interest due to the number of web pages emerging on the internet containing unstructured data. Due to the amount of information available on the internet, it is necessary to have a tool for extracting it. Many specialists in the field of IE have worked to find suitable tools, such as Wrappers, that classify interesting data and map them onto appropriate formats such as XML or relational database. Furthermore, some HTML-aware tools are based on inheriting the constructural features of documents so as to extract the data. On the other hand, Natural Language Programming (NLP) is a technique used by many tools to extract the data in natural language documents. Tools such as GATE use techniques such as a part-of-speech tagging, filtering, or lexical semantic tagging to link relevant information, and identify
Ahlam Sawsaa, Joan Lu. Informatics, University of Huddersfield (sw sa2004@yahoo.com; J .lu@hud.ac.uk )
ws ja

It is a shared belief that ontology receives a lot of recognition from various research fields. Although there are some well-known domain ontologies, such as CYC, the Standardized Nomenclature for Medicine (SNOMED, a clinical terminology), Toronto Virtual Enterprise (TOVE), and the GENE ontology (GO), study of the ontology area is still immature and improvements are needed [6]. IS is a multidisciplinary field, including branches such as Library Science, Archival Science and Computer Science, and therefore lacks a unified model of domain knowledge. The inconsistencies in the structure of the domain make it difficult to use and share data at the syntactic and semantic levels. It is thus necessary to develop an OIS to represent the domain knowledge [9]. The growing amount of unstructured data appearing on the internet makes it extremely difficult to extract knowledge from it. The IS domain includes a huge number of documents made up of web-based knowledge that is inaccessible. Building an OIS thus requires us to set up an appropriate knowledge description module for the intended ontology [11]. A number of studies have shown that applications of IE can be used to annotate documents that are written in natural language. Certainly, the growing number of IE

192

International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

tools that can be used to annotate concepts, such as SHOE, Annota, Annozilla, MnM, Ontomat, COHSE, Melita, and GATE, makes it easy to process machine-readable text. A comparison of these tools shows that they provide distinct methods of IE [1, 12], as illustrated in Table 1. Table 1 shows that the GATE developer provides semi-automatic and automatic annotations, as does the MnM ontology editor. Comparing these two, GATE can extract text from different formats such as XML, HTML, XHTML, emails and PDF files, while MnM annotates HTML formats only. Basically, the annotating IS concepts is based on the GATE developer, which is an architecture tool for text engineering. It is a free open source tool developed by a team at Sheffield University starting in the early 1990s. The first version was released in 1995, the second in 2002 and the latest in 2010. GATE can run on any platform and supports JAVA 5.0. It has also been developed and tested on Linux, Windows and Mac OS X. It has a user interface to enable user editing and visualization and quick application development. Furthermore, it provides support for manual annotation, semi-automatic and semantic annotation as well as ontology management. Moreover, GATE uses CREOLE plug-ins as objects for language engineering. All of these are packaged as Java Archive and XML configuration data [4]. GATE is a tool used to take unseen texts and convert them into a fixed format such as XML or HTML. These data can then be displayed for users or stored in a database for analysis. Before talking about GATE in more detail, we should clarify the difference between Information Retrieval (IR) and IE [3]. IE helps the user to extract information from a huge amount of text for the purpose of fact analysis. IR is just pulling out documents containing relevant information according to a key word search. IE can identify queries in a structural way and provides
Table 1 Information extracting tools Tools Type SHOE Knowledge annotation Annotation W3C schema Degree of automation Automatic Based on

knowledge at a deeper level, while IR uses a normal queries engine which makes it hard to gain accurate answers, and provides knowledge at the typical level. Consider an enquiry such as which UK airports are currently closed due to severe weather conditions? Or where an event took place and who it involved, such as where was Gordon Browns last visit as prime minister [10, 2]. IR would just provide a webpage containing the relevant information and the user would then need to search that webpage using various terms or concepts, so as to analyze the information. IE on the other hand provides specific information about the enquiry; even if the information is not accurate, but you can back up only the correct information [7]. IE has been used for applications such as text mining, semantic annotation, question answering, opinion mining, decision support, rich information retrieval and exploration. GATE has a comprehensive set of plug-ins, including Alignment, ANNIE, Annotation_Merging, Copy_Annots_Between_Docs, Gazetteer_LKB, Gazetteer_Ontology_Based, Information_Retrieval, Keyphrase_Extraction_Algorithm, Language_Identification, Ontology_Tools, and WordNet. GATE is based on ANNIE, which is a new IE system with core processing resources. ANNIE relies on a finite state algorithm and JAPE grammar and combines Tokenisor, Sentence splitter, POS tagger, Gazatteer, Name entity tagger (JAPE transducer), Orthomatcher (co-references), NP and VP chanker. Among these modules, we used Tokenisor, Sentence splitter, Gazatteer, and JAPE transducer [5]. GATE includes automatic and semi-automatic semantic annotation as well as manual annotation, which enables the user to create their own annotations. As a result, the GATE developer can be used to extract terms
Ease of use + Language written in Java Advantages & Disadvantages Allows users to mark up pages in SHOE, guided by ontologies or URL Does not support IE; like an ontology server; makes annotation publicly available Similar to malita Used to create & maintain ontologies; uses OtoBroker as server Uses ontology server to mark up pages in DAML+OIL & reused as RDF To retrieve structure & semi-structured annotations Semantic annotation, indexing, and retrieval of unstructured and semi-structured content. Comprises an architecture and framework. Based on NLP group

Annota

Automatic

RDF mark XML,XHTML,CSS &Xpointer Mozilla HTML OWL

up

C Available for Windows, Unix & MAC

Annozilla MnM Ontomat

Email annotation Ontology editor

Automatic Semi-automatic & automatic Automatic

++ + ++

COHSE

Melita KIM ontotext GATE

Integration of text-processing components Annotation interface Semantic annotation platform Annotation tool

Automatic

DAML+OIL

RDF

Semi-automatic Automatic

Extensible mark-up language, Java, HTML RDF

++ ++

Semi-automatic & automatic

XML, HTML, XHTML, emails

+++

JAVA version 5

International Journal Publishers Group (IJPG)

Sawsaa et al.: Extracting Information Science Concepts Based on JAPE Regular Expression.

193

and concepts from a specific text effectively and efficiently. For this work, we annotate text belonging to members of Ontocop. Ontocop is a virtual community of practice within IS domain, designed to support group interaction and communication across diverse destinations. It was intended as a tool for creating an OIS ontology by extracting the main concepts from there outputs, document discussions. Professional bodies can be good resources for the process of building a conceptual of information Science OIS [8].
Documents of Information Science

We uploaded the corpus into the GATE software so as to start running ANNIE. We annotated the concepts based on JAPE grammar, which is run within ANNIE. Testing and evaluation. As illustrated in Fig. 1.

4.

Implementation

A. Knowledge acquisition Before creating the ontology, we had to collect the IS knowledge for the domain model. Our approach consisted of annotating IS concepts based on the JAPE grammar, using the GATE software. The annotation process began as follows: We collected discussion threads from the Ontocop website on MySQL database, using the URL of the Ontocop website.
from Ontocop forum

Analysis process

List (1) Transfer of discussion data that obtained to XML files.

Corpus

Transfers to XML Upload to GATE Framework

Running ANNIE

Annotation of concepts & evaluation

Fig. 1 Annotation workflow

3.

Methods Employed

Our method is based on creating a corpus of documents and a Gazetteer of Information Science, with JAPE rules used to extract IS concepts. GATE provides facilities for loading corpora for annotation from a URL or uploading from a file. The process generally started as follows: We compiled IS knowledge from different resources, such as the Ontocop website forum and various publications on the web by members of Ontocop. We analyzed the data to ensure it covered all branches of the field. We transferred the information resources into an XML file to form the corpus.
International Journal Publishers Group (IJPG)

Figure 2 shows the methods we used to annotate concepts from the embedded knowledge that had emerged from experts in the field on the Ontocop forum. The discussion topics were collected in the MySQL database before being transferred to XML files. For example, the concept of Information science was annotated and defined in the OWL ontology language to start the lifecycle of the ontology process. Next, we collected IS publications by Ontocop members to speed up the process, and then

194

International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

transferred these to the XML format as well. This is shown in list 1. Creating the corpus: as a result of the above steps, we had a corpus of 300 files in the XML format, containing text relevant to the IS field. Creating the IS Gazetteer: This is a list of IS terms that have to be identified as MajorType, MinarType, etc. For example: Acquisition policy: major type= concept Computer aided design: minor type= term Data analysis: major type= concept Next we use JAPE rules:

Using JAPE rules allows us to extract concepts and identify tokens that contain the concepts in the correct order, and then look up the concepts in the Gazetteer. JAPE rules create a phase based on Java for creating specific grammar. Each JAPE rule consists of a LHS that contains the pattern that must be matched and a RHS that details the annotations that are to be created [4]. We used JAPE grammar to support regular expression matching, as this is how annotation is achieved in GATE. Annotation can also be carried out using other CREOLE plug-ins such as Gazetteer, for which it is necessary to create a list of concepts to be annotated. By clicking on the ANNIE Gazetteer, all the lists appear, including the IS list, as shown in Fig. 3. The sub-list presents the main concepts, such as Abstract, Access list, access to information, and computer aided design. The next step was to upload the corpus to the application framework using the JAPE grammar and Gazetteer to annotate concepts from the corpus.

Fig. 3 Screenshot of IS Gazetteer

The first entity detected is Information service {Type=Token, start=867, end= 837, id= 4210, majorType=concept} labelled as information service.concept Phase: one Input: Lookup Token Options: control = appelt Rule: concept1 ( ({Token.string == "information"}) {Token.string == "service"} ({Lookup.minorType == region}): regionName ) : service --> : reginName.Location = {}, : Information service.concept = {} Phase: Two Input: Lookup Token Options: control = all

B. NLP technique used to extract IS concepts: We present an automatic extraction method based on ANNIE using a JAPE grammar that extracts concepts from XML files and HTML text. Our JAPE rule extracts concepts as follows.

Fig. 2 Process used to annotate IS concepts drawn from Ontocop

International Journal Publishers Group (IJPG)

Sawsaa et al.: Extracting Information Science Concepts Based on JAPE Regular Expression.

195

Rule: concept2 Priority: 20 ( ({Token.string == "information"}) {Token.string == "service"} ({Lookup. major Type == "concept"}) ) : information --> : Information. concept = {Rule=concept2} More precisely, we apply a regular expression to match strings of text: Phase: Concept Input: Lookup Token Options: control = appelt Rule: Glossary ( ({Token.string == "catalog?e"}) ): concept --> :{} .concept= {Rule= "Glossary"} In these rules we specify a string of text {Token.string == } that must be matched, specifying the attributes of the annotation by using operators such as ==, and then annotating the entities according to the correct labels. Furthermore, using a control field such as all, applet, brill gives the right results. The next example shows how regular expressions could be annotated as showing concepts related to (abstract) metacharacter(dot, *, [ ], | ), {Token.string == "abstract(ing)"} This could capture the words abstract, abstracting, or abstractor. If we want to annotate the acquisition concept followed by another word we would use the following: {Token.string == "acquisition. number"} This could annotate: Acquisition.police Acquisition.service The code {Token.string == "archival * "} will annotate archival library, archival journal, archival processing, archival software, and archival studies, for example. We could also choose one term from a choice of two terms. For example, we could choose either Data or mining from the phrase Data mining, Data or processing from Data processing, Data or storage from Data storage, or Data or representation from Data representation by applying the pipe simple [|] operator: {Token.string == "Data | mining"} {Token.string == "Data | processing "} {Token.string == "Data | storage "}
{Token.string == "Data | representation"} {Token.string == "Book | art" }

5.

Discussion and evaluation

Our extraction of IS concepts using JAPE grammar and regular expression based on the GATE developer for automated IE provides significant results. The main idea behind using JAPE and regular expression is to identify IS terminology as tokens, for example Computing, Libraries and Information technology, from a large text. The term identification relies on looking up a list of IS terms from the Gazetteer. For example, we could look up book art, book card, book guidance or book catalogue, or computer application, computer science, computer experts, computer file, or computer image. These concepts can be collected to be the main component of IS glossary, and to structure in semi-formal hierarchy before creating the computational model of the OIS ontology. We extracted the IS concepts from a corpus of 300 documents, obtained specifically for this purpose. We ran the ANNIE application, using document reset, Tokenisor, sentence Splitter, Gazetteer, POS tagger, JAPE transducer and Orthomatcher The annotation set that appeared in the display pan, and the concepts are highlighted in the annotation default, each annotation has different colour. as can be seen in Fig. 4.

Fig. 4 Annotation concepts in GATE

Figure 4 presents the results of annotating the IS concepts after running ANNIE and highlighting the matching concepts. The results show that our approach successfully annotates concepts. We recalled 541 of the Knowledge concept, 275 Information concept and 35 of the organization concept (see Fig. 5). Each annotation starts from a specific point and ends at a different point, based on how many tokens it has. The knowledge concept starts at point (557) and ends at (566), while the organization concept starts at (624) and ends at (636), with its features {major Type=concept}.

International Journal Publishers Group (IJPG)

196

International Journal of Advanced Computer Science, Vol. 3, No. 4, Pp. 191-197, Apr., 2013.

URLs or emails. B. Future work Ontology is at the heart of the semantic web. It defines concepts and relations that make global interoperability possible. In future work, we plan to enhance these concepts so as to develop an OIS to create the taxonomy of IS as a domain. The next step will be to code the process using Protg as the ontology editor. Additionally, a generic model of the OIS will be evaluated.

References
Fig. 5 Result of the annotation of the IS domain

The data were evaluated using the Annotation Diff tool which is based on evaluation metrics for precision, recall and the F-measure. Annotation Diff is used to compare two different annotations sets based on the same document. We compared the key type feature and the response feature. The tests showed that are accuracy rates are equal to the manual output of IS experts. The statistics of the corpus show that the pattern matching of IS concepts based on the lookup IS list was 403, correct concepts and accuracy were generally higher, and there were no partially correct results (0), missing false positives. However, we used GATE due to its benefits as an open source software and because it contains multi-language NLP models that can be reused to develop other resources.

6.

Conclusion

A. Achievement This paper has described a method using NLP techniques to extract concepts for the purpose of speeding up the development of an OIS. Furthermore, the development of the IE system should save domain experts time and effort in labelling the most common concepts. In total we extracted 664 concepts that are classes of the OIS, and 650 subclasses, making up the main components of the ontology skeleton. The IE technique can be applied to many different formats, such as XML, HTML documents,

[1] ALBERTO, H. F., BERTHIER, A. L.-. & RIBEIRO-NETO (2002) A brief survey of web data extraction tools. SIGMOD Record http://annotation.semanticweb.org/tools/. [2] CHANG, C.-H., KAYED, M., GIRGIS, M. R. & SHAALA, K. (2000) A Survey of Web Information Extraction Systems. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 13. [3] CRESCENZI, V. & MECCA, G. (2004) Automatic Information Extraction from Large Websites. Journal of the ACM, 51, pp. 731779. [4] GATE (2010) Developing Language Processing Components with GATE Version 6 (a User Guide). http://gate.ac.uk/sale/tao/splitch13.html#x18-32300013.2. [5] HANDSCHUH, S. & STAAB, S. (2002) Authoring and Annotation of Web Pages in CREAM. Honolulu, Hawaii, USA. [6] LABORATORY, E. I. (2011) TOVE Ontology Project. University of Toronto. Toronto, http://www.eil.utoronto.ca/enterprise-modelling/tove/. [7] MOENS, M.-F. (2006) Information Extraction: algorithms and prospects in a retrieval context, Springer. [8] SAWSAA, A. & LU, J. (2010a) Ontocop: A virtual community of practice to create ontology of Information sceicne. ICOMP'10. Las vagas. [9] SAWSAA, A. & LU, J. (2010b) Ontology of Information Science Based On OWL for the Semantic Web. In: International Arab Conference on Information Technology (ACIT'2010). University of Garyounis, Benghazi, Libya [10] SRIHARI, R. & LI, W. (2002) Information Extraction Supported Question Answering. In Proceedings of the Eighth Text REtrieval Conference (TREC-8 ). [11] SUI, Z., ZHAO, J., KANG, W. & ZHAO, Q. (2008) The Building of a CBD-Based Domain Ontology in Chinese. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [12] TURMO, J., AGENO, A. & CATAL`A, N. (2006) Adaptive Information Extraction. ACM Computing Surveys, 38.

Fig. 6 Accuracy results

International Journal Publishers Group (IJPG)

Sawsaa et al.: Extracting Information Science Concepts Based on JAPE Regular Expression.

197

Ahlam Sawsaa she was born in Libya. She received her B.s and M.S degree from Garyounis university in Library and Information Science. She is serving as lecture in the Department of Library and Information Science at Garyounis University, and supervisor of many projects of graduate degrees. She is author of a book and more than seven international publication, reviewer in international conference. She is currently a PhD research in Ontologies and semantic web at university of Huddersfield. Joan Lu Professor Lu is in the Department of Informatics. She was a Team Leader of IT Department in an industrial company before she jointed university. Her research interests include XML technology, Object Oriented System Development, Agent technology, data management system, information access/retrieval/visualization/representation, security issues and Internet Computing. She serves as Editor in Chief for the International Journal of Information Retrieval Research.

International Journal Publishers Group (IJPG)

You might also like