You are on page 1of 12

Addis Ababa University College of Natural Science

School of Information Science

Course: Data and Web mining


Course Code: INSC 636

Documentation on Text Mining

Prepared By:

1. Eden Getachew ---------------------------GSE/0368/08


2. Henos Demeke ---------------------------GSE/0380/08

Instructor: Dr.Million Meshesha

Dec 3, 2016
Contents
Introduction .................................................................................................................................................. 1
Text mining VS data mining ...................................................................................................................... 2
Methods and Models Used In Text Mining ................................................................................................... 2
Termed Based Method ............................................................................................................................. 2
Phrased Based Method ............................................................................................................................. 3
Concept Based Method ............................................................................................................................ 3
Pattern Taxonomy Method....................................................................................................................... 4
Techniques Used In Text Mining ................................................................................................................... 4
Information Extraction .............................................................................................................................. 4
Categorization ........................................................................................................................................... 5
Clustering .................................................................................................................................................. 6
Visualization .............................................................................................................................................. 6
Summarization .......................................................................................................................................... 7
Advantages and Challenges of Text Mining .................................................................................................. 8
Advantages of text mining ........................................................................................................................ 8
Challenges of text mining.......................................................................................................................... 8
Applications of Text Mining .......................................................................................................................... 8
Text mining in Human Resource Management ........................................................................................ 8
Text mining in Customer relationship management and market analysis ............................................... 9
Conclusion ..................................................................................................................................................... 9
Recommendation.......................................................................................................................................... 9
Reference .................................................................................................................................................... 10
Introduction
Data mining work on structured data (displayed in titled columns and rows which can easily be
ordered and processed by data mining tools) like transaction, relational databases like
SQL, data warehouse data. However, Seth Grimes, published an article that stated, 80% of
business-relevant information originates in unstructured form, primarily text.[4] Which consist
of large collections of documents from various sources, such as news articles, research papers,
books, digital libraries, e-mail messages, and Web pages. These days text database are highly
growing because of the large amount of information available in electronic form, such as
electronic documents electronic publications, , e-mail, www and most of organization or
business information are stored electronically, in the form of text databases [1] [6].

Data stored in most text databases are semi-structured, in that they are neither completely
unstructured nor completely structured. For example, a document may contain a few
structured fields, such as category, authors, title, and publication date and so on, but also
contain some largely unstructured text components (content of the document is not easily
broken down and categorized). Like abstract and contents. Information Retrieval techniques,
such as text indexing methods, have been developed to handle unstructured documents [1] [6].

Text mining is almost new area of computer science which is multidisciplinary has connection
with natural language processing, data mining, machine learning, and information retrieval and
knowledge management. Text mining works to get useful information from unstructured
textual data through the identification and exploration of interesting patterns, by automatically
extracting information from different written resources [5] [10].

1
Text mining VS data mining
Both seek novel and useful patterns and both are semi-automated processes on the other hand
they are totally different regarding the nature of the data they utilize. i.e. data mining requires
structure data and text mining can be applied on semi-structured and unstructured data.

Data mining is discovery of knowledge from structured data, but text mining is finding useful
information from sources that contain semi-structured or unstructured data. So, the key
difference between data mining and text mining is that in text mining data is unstructured or
semi- structured [5] [10].
Databases are designed for programs to process automatically; text is written for people to
read. Many researchers think it will require a full simulation of how the mind works before we can write
programs that read the way people do. However, there is a field called computational linguistics (also
known as natural language processing) which is making a lot of progress in doing small subtasks in text
analysis

Methods and Models Used In Text Mining


Text mining methods classification is based on how text document are analyzed. So according
to the information retrieval essentially there are four methods [2].

1) Term Based Method (TBM)


2) Phrase Based Method (PBM)
3) Concept Based Method (CBM)
4) Pattern Taxonomy Method (PTM)

Termed Based Method

The Term in document is unit used to semantic meaning of the text .each term in Term Based
Method have weight, which measure importance of measure. Term based methods suffer from

2
the problems of polysemy and synonymy. Polysemy means a word has multiple meanings and
synonymy is multiple words having the same meaning. The semantic meaning of many
discovered terms is uncertain for answering what users want. Calculation of term weight is
based on distribution of term in the document, thus term frequency TF indicate how many
times term t occurs in document d. The document frequency DF (t) is number of documents in
which term occurs. The inverse document (IDF) calculates how the term distributed across the
documents [2].

Phrased Based Method


In this method document is analyzed on phrase basis (sequence of terms) .phrase which is
collection of semantic terms are less ambiguous and more discriminative than separate term.
Sequential pattern mining algorithm help to obtain various phrases or extracts frequent
sequential patterns by finding sequence of terms; however it is difficult to use phrase to give
answer for user need because phrases have fewer occurrences in document [2].

Concept Based Method


In the term based method only importance of term within document is captured in statistical
analysis but. In concept based method the term which contributes to sentence semantic is
analyzed with respect to its importance at sentence and document levels. Ctf calculate the
number of occurrence of concept of sentences and tf calculate the number of occurrence of
concept of original document [2].

Every concept in the processing of new document is matched with the other concepts in the
previously processed documents. To match the concepts in previous documents is
accomplished by keeping a concept list L that holds the entry for each of the previous
documents that shares a concept with the current document.
After the document is processed, L contains all the matching concepts between the current
document and any previous document that shares at least one concept with the new

3
document. Finally, L is output as the list of documents with the matching concepts and the
necessary information about them. The concept-based term analyzer algorithm is capable of
matching each concept in a new document (d) with all the previously processed documents in
O(m) time, where m is the number of concepts in d [2].

Pattern Taxonomy Method


It represent text document based on pattern basis. It is a Tree-like Structure which shows out
patterns being extracted from a text data. Instead on term based document representation
pattern based model contain frequent sequential patterns (single term or multiple terms) is
used to perform the same concept of task. All the documents are being spitted into paragraphs
and then discover a sequential pattern from the corpus show relationship between patters
which is seen in the documents by using pattern taxonomy model [2].

Techniques Used In Text Mining

Information Extraction
Information extraction is initial step for computer to analyze unstructured text by doing pattern
matching which is used to look for predefined sequences in text, thus it identifying key
information concerns entities of interest in the application domain or relations between such
entities, usually in the form of events. Information extraction task includes tokenization,
identification of named entities, sentence segmentation, and part-of-speech assignment. After
extracting entity terms and events or semantically interpreted then the required pieces of
information entered into the database for further processing. Information extraction solve the
difficulty of transforming a collection of textual documents into a more structured database
General information extraction process is as shown in fig.1 [2] [8].

4
Fig. 1 Information Extraction

Categorization
Based on document content Predefined classes are assigned to the text documents, thus
Categorization automatically assigns one or more category to text document. Categorization
performs pre-processing, indexing, dimensionally reduction, and classification [2] [8].

Using machine learning, the objective is to learn classier from examples which perform the
category assignments automatically. This is a supervised learning problem. It is supervised
learning method because it is based on input output examples to categorize new documents [2]
[8].

Categorization identifies the main topics that the document covers by calculating terms from
the count. It depends on the vocabulary of the topic and relationships are recognized by looking
for broad terms, narrower terms, synonyms, and related terms [2] [8].

5
Clustering
Clustering method is used to group documents with similar contents which means group of
documents having features, which are more similar to each other than to the features of any
other group. That means document from one cluster have a feature which separate them from
other document in different cluster. Applying clustering algorithms is used to discover structure
with in the corpus and it will help to create of subsets from the corpus. Since documents can
emerge in multiple subtopics clustering help a useful document will not be absent [2].

Even if, clustering group similar documents it differs from categorization since in clustering
documents are clustered on the fly instead of use of predefined topics. Clustering not require
any predefined categories in order to group the documents [2].

Based on the data collection and the task to be accomplished clustering algorithm varies.
Hierarchical Clustering, K-means and the Binary Relational Clustering are frequently used once,
A basic clustering algorithm creates a vector of topics for each document and measures the
weights of how well the document fits into each cluster [2].

In the former approach, clusters are organized in cluster trees, where similar clusters appear in
the same branch of the tree. This approach is used to give an overview of the Contents of a
large document collection, identification of hidden structures within groups of objects thus,
getting related information become easy and duplicate document in the corpus can be easily
identified [2] [8].

Visualization
Visualization represent huge amount of document in visual hierarchy and it makes discovery of
relevant information easy for users. Users can manipulate by zooming and creating sub maps.
This technique good when the user wants to minimize the range of document and to discover
other related documents and it can be applied for tracing terrorists and crimes [5].

6
Goals of Visualization have three steps. The first step is data preparations, which decides and
obtain data for visualization and creates original data space. The second step is data analysis
and extraction, here analyzing and extracting visualization data for original data and form
visualization data space. The last step is visualization mapping, the activities done under this
step are applying mapping algorithm to create the map of the visualization data space to start
visualizing the target document [5].

Summarization
Summarization can be defined as minimizing the size of a text document without losing its
important contents. Even if computers become powerful, they are still in challenge of
understanding the semantics of the documents with this problem there are still tools that
summarize using different techniques like sentence extraction and position information, which
looks for clues and subtopics. In addition to these humans first read and tries to summarize the
document with its important points. Text summarization is important for users to identify if
that specific document satisfies their need before reading the whole document [5].
Summarization has three steps of processes. The first step is preprocessing step at this stage
the structured representation of the text document is created, the second step is processing
step which converts the structured text to summary structure, the final step is generation step
in this step the final summary is extracted from the summary structure[5].
In preprocessing stage sentences are identified then every word in the sentence will be listed
out if different sentences contain the same word, those sentences are considered as they have
relation. This technique calculates the range of sentence will be calculated and the top X
sentences will be taken as a summary.

7
Advantages and Challenges of Text Mining

Advantages of text mining


The relationship between different entities and their names can be easily figured out
from the collection of documents by implementing techniques like information
extraction.
Text mining can solve the hardship of managing and extracting useful knowledge from
unstructured documents [2].

Challenges of text mining


The major challenge of text mining comes from the nature of natural language because
natural language is ambiguous, i.e. one word may have different meaning and different
words may have the same meaning. This type of ambiguity creates noise, in order to
avoid this problem a lot of researches have been done but its still in progress [2].

Applications of Text Mining


Text mining can be applied in different fields like financial sector with banks and insurance,
medical and pharmaceutical sectors, media and communication sector, political institutions and
public administration sectors, research sectors. Furthermore we reviewed some of its
applications in detail:

Text mining in Human Resource Management


Text mining is well used in knowledge management and human resource management. Some
application of text mining in KM and HRM is listed as follows:

Human resource management: Text mining can be used in HRM for managing,
analyzing employees comments, their documents for selecting new staffs and measure
their performance. In general text mining helps the company to measure its
performance by analyzing informal or unstructured documents [5].

8
Text mining in Customer relationship management and market analysis
Text mining is well applied in customer relationship management to administer customers
messages especially frequently asked questions and forward questions to the appropriate
department and provide the right answer. On the other hand market analysis is also another
highly benefited field from text mining. The application of text mining in market analysis is
more than competitor and customer opinions analysis but it further focuses on building the
companys image through examining press reviews and other documents [5].

Conclusion
In conclusion text mining is a growing field with much benefit for different organizations.
According to text mining experts most of organizations documents are in unstructured text
format and text mining is the best solution for managing and analyzing texts which are
completely unstructured or semi-structured documents. As stated above it has a many
advantages in the fields of human resource management to manage and analyze employees
documents and their opinions which is unstructured and difficult to deal with other methods,
like data mining and information retrieval because these technologies need structured
database which is not applicable for all kinds of organizational files. Moreover text mining can
be applied in customer relationship management and market analysis. Unfortunately with
these all advantages it has also some challenges which are emanated from the ambiguous
nature of natural language.

Recommendation
Since text mining makes people and organization competent and profitable, it needs
appropriate attention regarding researches done to make it perfect and to get the whole
paybacks of its application. As we marked earlier there are challenges like the ambiguity related
to the nature of natural language can be eliminated by using pattern taxonomy method since
this method uses pattern to extract knowledge from text data. In addition to this the field
needs exhaustive researches.

9
Reference

1. Cios, K. J., Pedrycz, W., &Swiniarski, R. W. (1998). Data Mining and Knowledge
Discovery. In Data Mining Methods for Knowledge Discovery (pp. 1-26). Springer US.
2. Gaikwad, S. V. & Chaugule, A., Patil, P. (2014). Text mining methods and techniques.
International Journal of Computer Applications, 85(17).
3. Gaikwad, S. V., & Chaugule, A. Performance Enhancement of Effective Pattern Discovery
in Text Mining for Medical database.
4. Gupta, A. P. P., & Mishra, A. High Performance Side Information Mining using Enhanced
COATES Algorithm.
5. Gupta, V., &Lehal, G. S. (2009). A survey of text mining techniques and
applications. Journal of emerging technologies in web intelligence, 1(1), 60-76.
6. Han, J., Pei, J., &Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
7. Joachims, T. (1998, April). Text categorization with support vector machines: Learning
with many relevant features. In European conference on machine learning (pp. 137-142).
Springer Berlin Heidelberg.
8. Karanikas, H., Tjortjis, C., &Theodoulidis, B. (2000, September). An approach to text
mining using information extraction. In Proc. Knowledge Management Theory
Applications Workshop,(KMTA 2000).
9. Krzysztof, J. C. (2002). Data Mining: A Knowledge Discovery Approach/Krzysztof J. Cios.
10. Radovanovi, M., &Ivanovi, M. (2008). Text mining: Approaches and applications. Novi
Sad J. Math, 38(3), 227-234.
11. Witten, I. H. (2005). Text mining. Practical handbook of Internet computing, 14-1.

10

You might also like