You are on page 1of 4

Algorithm Analysis for Twitter Sentiment Classication

Cardenas, Jonathan
De La Salle University aitsukihime@gmail.com

Jimenez, Nickleus
De La Salle University njr 528@yahoo.com.ph

King, Mark Kevin


De La Salle University king.mark.kevin@gmail.com

Panuelos, Kevin Matthew


De La Salle University kevin.panuelos@gmail.com

Abstract
In the pursuit of classifying continuously streaming messages, called tweets, of their emotional value, the researchers went through various data mining algorithms to see which of them t the task the best. Categories and Subject Descriptors eral E.0 [Data ]: Gen-

the perfect places to engage in this type of research, especially due to the challenges posed by crowdsourced information. Given that Twitter opens up its database of status updates, or tweets, to developers interested in utilizing the publicly available text data that users send out, Twitter then serves as a very large repository of brief messages that can be mined, cleaned and analyzed for sentiment analysis. However, sentiment analysis on a generic scale gives way to ambiguity and a bevy of issues that can span over what is covered for the purposes of the research, therefore the search was narrowed down to the term Harry Potter, and to collect data for the model accompanied by emoticons in a specialized search string that can be used to lter through Twitter. An example of such search string would be in this form: (Harry Potter) ( :-) OR :} OR :) OR :-} ) A query like this would be acknowledged by Twitters Search API, which looks approximately one week back on the record of tweets under a certain topic. Twitters Search API allows the use of operators, quotes and, albeit undocumented, parentheses to granularize and make a search specic. It is by no means perfect, especially given the search interfaces potential to repeat certain results due to sponsorship deals, but this is handled upon the data preprocessing phase. That said, searching Twitter from a historical perspective can be limited by the number of requests that can be made to its servers and the number of results it returns, meaning that developers have a hard limit on the number of times it can call Twitters search feature in a given time period, so as to avoid distributed denial of service attacks. The internal data gathering program then falls back to Twitters other application programming interface, called the Streaming API whenever the results given a historical search have been exhausted. The dierence between the Streaming API and the Search API is that the former catches new tweets as they come in, meaning that it does not perform a historical search, but waits for new tweets to arrive.

Keywords Algorithms, Data Mining, Social Media, Sentiment Analysis

1.

Introduction

Social media holds value for the internet advertising space [1], especially because of the fact that this is targeted to users and their interests, sometimes even to a granular level (e.g., location-based) [3]. The emergence of this type of advertising has led to various approaches both intrusive and non-intrusive that enable advertisers to have a better return on investment on their ad placements based on interest [2]. One of the developments in the online and social spaces comes in the form of sentiment analysis, which involves assessing the emotional state of a certain status update or post in a blog; this phenomenon is still at an experimentation stage and has not necessarily seen mainstream usage in the larger web services (including advertising platforms) [?], but it may eventually see increased usage as the data mining and text mining techniques are further improved, and algorithms are investigated and researched upon.

2.
2.1

Methodology
Data Gathering

Given the task of collecting text data and classifying them according to a certain sentiment in this case happy, sad, angry and disgusted social media platforms are one of

2.2

Data Preprocessing

Preprocessing 2.3 Dataset

In building a corpus of tweets, the bag of words format was used, wherein each token in a tweet that hasnt been ltered out by the preprocessor is part of the entire set of special words that may come with a certain type of tweet. For example, if a tweet is classied as happy, all the tokens that remain on that tweet are dedicated one column in the data set. Say, the tweet contained the words

analyze and recognize data patterns, usually used in classication and regression analysis. The basic SVM works by taking an input and attempting to predict which of two possible classes formed the input. Given a set of training examples or data, a certain SVM training algorithm would build a model that assigns test examples given to it to one class or another in accordance with the class data it was trained with. The SVM model itself is represented by presenting these examples as points in space, such that examples of dierent categories are mapped as far away from each other as possible, separated by a clear gap. Further examples are then introduced into the space and then classied based on which side of the gap they occur on, if they tend to one side more or the other. 3.2.1 SMO Algorithm

Im having fun tonight. :) The data preprocessor would likely eliminate the useless words and theoretically leave the tweet to contain only fun. Thus, the bag of words may look like this: happy-c happy sad-c others surprised-c others disgusted-c others fun 1 ... 0 ... 0 The SMO algorithm was invented by John Platt in 1998 at Microsoft Research, which aims to eliminate the problem of time taken to train SVMs by issuing a series of very small quadratic programming problems as opposed to one really big one (this really big problem is broken down). This is done iteratively, and allows the algorithm to handle very large training sets within a reasonable span of time. 3.2.2 LibSVM Algorithm

Each word is assigned a value denoting whether it is present or not. 0 means not present, while 1 means present, no matter how many times the word appears. 2.4 Assessment Techniques

10-fold cross validation, supplied test set, percentage split

3.

Classication Algorithms

The classication algorithms yielded dierent accuracies in categorizing the tweets. Various implementations of the three main algorithms: J48, Support Vector Machines and Naive Bayes. 3.1 J48 Tree Algorithm

The nu-SVM classication is a type of training model that minimizes the error function by utilizing the nu parameter not to control the tradeo between training error and generalisation error, but by placing the upper bound on the fraction of margin errors and the lower bound on the fraction of support vectors. This was implemented using the Weka LibSVM tool created by Yasser El-Manzalawy and Vasant Honavar. They claim that this implementation of the SVM algorithm runs much faster than the implementation of SMO in Weka, in our tests however, SMO proved to be more accurate overall. 3.3 Naive Bayes

The J48 Algorithm or better known as C4.5 Decision tree developed by Ross Quinlan builds training data sets the same way ID3 does [citation]. The J48 Algorithm uses the concept of information entropy but a better version compared to ID3. J48 also has two types of implementation: post pruning and the on-line pruning. The concepts of pruning are to compare how much error a decision tree suers before and after the pruning. When J48 uses the so-called on-line pruning, it passes the data set through 10 fold cross validation. 3.2 Support Vector Machines

The Naive Bayes, named after Thomas Bayes, is a statistical supervised learning algorithm that uses probabilities to determine the outcomes of anything uncertain in the model (University of Craiova, n.d.). Using probability, a given input has dierent certainties on whether or not it belongs to a certain class. (Sahami, 1998) Predictive and diagnostic problems can be solved using this algorithm, along with text documents. It is in wide use for spam ltering, recommendations, and other such online applications. (Sahami, et al., 1998) Using the Bayes theorem, these values are computed of the probabilities that a word can appear or not appear (Bernoulli model) or can be frequent or infrequent (Multinomial model) (Sanguinetti, 2012). In choosing which implementation of the Na ve Bayes algorithm would be picked for the comparison of the classi-

The SVM, or support vector machine, is a form of supervised learning model with the associated learning algorithms that

ers, both Bernoulli and Multinomial implementations were tested for accuracy, which yielded the following learning curves. The table indicates that a high Percentage split for training data is needed for yielding more accurate results, but too much training data also reduced accuracy when it came to the relatively small training set. The J48 and Na ve Bayes algorithms seem to yield close accuracies to each other, but from the get-go, SMO seems to be the most ideal way going forward. This hypothesis is further supported by the percentages yielded by the other emotions between the algorithms.

4.0.2

Expectation-Maximum Algorithm

4.

Clustering Algorithms

Clusters are found by determining a mixture of Gaussians that t a given data set. Each Gaussian has an associated means and covariance matrix. The parameters can be initializing by using a randomly selected Gaussians means. This algorithm can be optimal by incrementally updating the mean value and variances. What this eectively means is that this clustering algorithm is best used with Gaussian values for data, which this particular dataset is not. Therefore, it is very much expected that the resulting cluster is not at all properly analyzable. Processing for this algorithm was done by taking the Harry Potter tweet dataset and eliminating three of the emotion classiers, leaving only one as the basis. For this particular test, only the happy classier was tested. The following result did not look very promising. There are two visible clusters, but they are large clusters. While it is very evident that one of the clusters which is far from the big cluster is most likely the one representing the happy classier, the clusters themselves are not very dense. Furthermore, the colors that are clustered close together are still very hazy it is hard to determine any sort of conclusive evidence from this clustering algorithm, unlike the Cobweb algorithm. 4.0.3 K-Means Algorithm

Three clustering algorithms were used for testing the Cobweb algorithm, the expectation-maximum algorithm (EM algorithm) and the k-Means algorithm. Mixed results were gleaned o these testings, but it should be expected, as most clustering algorithms are not usually used for sentiment analysis of text.

4.0.1

Cobweb Algorithm

The cobweb algorithm incrementally organizes the output data structure. The data structure is a decision tree that classies a message to a certain leaf node. The leaf nodes are the nal classications of the argument. The clustering algorithm Cobweb required the dataset to be run through the Numeric to Nominal lter. The results were able to distinguish the nodes. An output that ignored attributes (consisting of the key words) besides the emotions surprised, sad, disgust and happy resulted in classifying the nodes equally as it should be. The leaf nodes represented the emotions as they are classied in the dataset. Node 0 contains all the entries or original attributes. The dataset has 500 entries for each emotion and the cluster was able to properly cluster the 2000 total entries into the constituent four parts. The visualization, shown below the tree, represented how far the characteristics of tweets with dierent emotions are. Words of the same emotion were clustered very densely and the four clusters were actually a very far distant apart, and they were also equidistant, which meant that the algorithm was able to properly cluster the initial dataset. If you do not allow Weka to ignore the attributes as part of its clustering criteria, it showed many dierent attributes from the dataset. This leads to the output tree looking very cluttered as many attributes are considered as xed points for the cluster. This means more leaves than expected but the result is very dicult to read and interpret properly. The clustering, therefore, ended up failing to properly split the initial dataset into the four classications. The algorithm, therefore, only works eectively if you give it the main nominal classiers as arguments. Including the other attributes only introduces noise and other distorted interferences with the clustering. However, it does work properly when done correctly.

The algorithm sets the k-cluster by choosing k number of clusters at random as initial cluster center points. Each data point is assigned to the k-cluster based on these center points. The cluster center is then replaced by the mean of all data points available until there are no more available data points which can change the mean. The following image shows the results of running the dataset through the algorithm, which was processed the same way as the previous algorithm (EM). Again, only two clusters are present the layout is very similar to the EM algorithms output, except the clusters are a lot more dened. In the bottom left cluster, it can be observed that all the data points are at least very close in relation to each other. Even in the large cluster, the red data points are very close together and neatly split the blue data points. While this does mean one can observe that one of the classiers is present it doesnt say anything about the other three classiers.

5.
5.0.4

Association Algorithms
Tertius Algorithm

With the Tertius algorithm, association rules are built using key-value pairs in the training data and statistically places a rating stating how reliable the key-value pairs are. This rating is simply how many times the key-value pair occurs in the data. Using an A* search algorithm, the entire space is searched in a way that rules are formed. A rule is then composed of a body and a head, where the body is the conditions (or events) that are required to happen or be

met before whatever is in the head happens. The algorithm records the number of times the rule is true, and the number of times the rule isnt true. (Weka Tertius Explanation, 2010) Unfortunately, this doesnt make the algorithm ideal for this type of classication task, due to the fact that not a lot of the tweet instances share the same words. A partial output of rules for classiying surprised tweets is available, unfortunately, as the output states (partially shown in the gure below), the memory was not enough to yield a truly nished rule set. It is apparent that the rules that arent classied as true for the surprised class are marked false, which is to be expected. 5.0.5 Predictive Apriori Algorithm

It can be seen that not every rule is condent in associating but certain emotions were associated with attributes condently. However in human context the association does not make total sense like the word harry does not indicate an emotion.

6.

Inter-algorithm Comparison

From the various results of each algorithm, and through a simple, logical process of elimination, association is not very ideal for the supervised classication of the tweets, but can actually help produce n-grams that can be potentially useful for a dierent type of data set, while clustering is not very ideal for a strictly supervised classication task, thus logically leading to classication algorithms being most ideal for this task.

The Predictive Apriori algorithm may resemble the apriori algorithm, except that the support threshold increases, so that means the pruning gets more discriminative while the algorithm progresses in its execution. The best n rules are produced based on whether the rule has a condence value that passes the condence threshold. Unfortunately, the data set (called harry-potter-micro) had to be altered in order for the system not to break down while processing. Even then, the system wasnt able to nd any itemsets of rules even if it starts out with a low condence threshold. Again, this is likely due to the fact that the already low support counts for any of the attributes are further pruned due to the increasing support and, consequently, condence thresholds. 5.0.6 FP Growth Algorithm

7.

Conclusion and Recommendations

References
[1] Newman, A. A. Brands now direct their followers to social media, Aug. 2011. [2] Qing, L. Y. Social ads oer better targeting, roi, 2012. [3] Tam, D. Twitter oers location targeted advertisements, 2012.

The association algorithm Frequent Pattern Growth or FP Growth yielded result from our dataset of tweets that is identied as harry-potter-tweets. It has yielded a set of association rules. The top ten rules were displayed by Weka. It did not take as long as the other association algorithms since it did not generate sets of candidate tables. It just got patterns to be associated. The divide and conquer approach of the algorithm was able to generate the rules since it decompose the mining task. Hence it is less costly than the Apriori algorithm. In the end the rules just shows what attribute is associated with the emotion. Below are the 10 rules.

Figure 5.1. The rules generated by the FP Growth algorithm.