You are on page 1of 18

Twitter Sentiment Analysis

Suman Sourav(BE/6030/15)
Mekala Keerthi Niveditha(BE/6034/15)

Project Supervisor
Balaram Mandal
Assistant Professor, CSE Dept, BIT Mesra, Off Campus Deoghar
INTRODUCTION
What is sentiment analysis?
Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive,
negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.
Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most
common applications of NLP is sentiment analysis. From opinion polls to creating entire marketing strategies,
this domain has completely reshaped the way businesses work, which is why this is an area every data
scientist must be familiar with.

Why sentiment analysis?


 Business: In marketing field companies use it to develop their strategies, to understand customers’
feelings towards products or brand, how people respond to their campaigns or product launches and why
consumers don’t buy some
products.
 Politics: In political field, it is used to keep track of political view, to detect consistency and inconsistency
between statements and actions at the government level. It can be used to predict election results as well!
 Public Actions: Sentiment analysis also is used to monitor and analyze social phenomena, for the
spotting of potentially dangerous situations and determining the general mood of the blogosphere
Problem Statement
In this report , we will try to conduct sentiment analysis on “tweets” using textblob , tweeepy and
support vector machine algorithm. We will try to determine the polarity of the statement whether it is
positive or negative. Also we will use the most dominant sentiment should be picked as the final label.

We used data from Kaggle which was labeled positive, negative and neutral. The data consist of
emoticon, username, and hashtag which is required to remove the preprocessed before using.

Also the second data is collected directly from the live application created on twitter. The data set is
highly complex in this case as it contains likes, pics, data, retweets and all other data related to the
tweets.
Dataset Description
The dataset is collected from the Kaggle.com.
The dataset consist of 17 Rows and 1483 columns. The dataset used is collected from airlines and it is
labeled as positive negative and neutral for training and test data. The data set relevant columns are ID,
AIRLINE_SENTIMENT, and TWEET.
The data is mixture of comments, emoji, URLS, Sentiments, User location and other details.
Number of NEUTRAL comments: 3099
Number of NEGATIVE comments: 9178
Number of POSITIVE comments: 2363
As the data is biased towards negativity the test and train data is chosen accordingly such that the
model is not biased towards any particular sentiment.
The data set of @VirginAmerica is taken for the analysis.

For the live model using tweepy module and tweeter api , the data is directly collected from
the twitter using authentication token and access codes provided in the twitter application. The
collected data is very complex. The tweepy module extracts the tweets, retweets, likes,
comments and other relevant data from the given data set.
Packages Used
• Numpy
• Matplotlib
• Seaborn
• Tweepy
• Textblob
• Pandas
• Nltk
• Sklearn
• Sklearn svc
Algorithms Used
• Text blob package - The text blob uses three parameters for analysis of the sentiment of a
statement:
a. polarity – Sentiment polarity is a verbal representation of the sentiment. It can be
negative, neutral, or positive.
b. Subjectivity- Subjective sentence expresses some personal feelings, views, or beliefs.
c. Intensity - The use of modifiers like very, greatly are used as intensifier. These have
impact on sentiments

• Logistic Regression - Logistic regression is the appropriate way to conduct when the dependent
variable is binary. Like all regressions, the logistic regression is used for predictive analysis. Logistic
regression is used to describe data and explain the relationship between one dependent binary
variable and one or more nominal, ordinal, interval or ratio-level independent variables.

• Support Vector Machines - Support Vector machine can also be used as regression, keeping all the
features intact that characterizes the algorithm. Support vector regression uses the same principle
as the SVM for classification problem, with few adjustments. Overall the idea is to minimize the
error , individualizing the hyperplane which maximizes the margin, keeping in mind that part of
error is considerable.
Model 1
Using Tweepy And TextBlob Package

Steps :
1. Making a twitter application for extracting the live tweets Initially for streaming the tweets.
directly from twitter to our module, we need to create a twitter application. The twitter application
provides us with the proper credentials for logging in into our twitter account and also provides the
authority to stream tweets from the specified twitter handle.
2. In Jupyter Notebook, we import all the necessary packages for the module.
3. Then we write all the necessary functions to stream the data and get the important analysis details
into the program.
Friend list
Likes
Retweets
Length of Tweet
Source of Tweet
4. Visualization of DATA

Fig 1: Analyzed data

Fig 2: Length Vs Date Plot


Fig 3: Likes , retweets VS time

Fig 4: Length Vs Date


Result of Model 1
Analyzing the above results, we can clearly see that the model is giving the sentiment as 0 if neutral, 1 if
positive and -1 if negative.
It also provides us the likes, retweets, date and the actual tweet.
Text Blob is an effective package to carry out the textual data analysis.

Fig 5: Result of Model 1


Model 2
Using SVM and Logistic Regression
Steps:
1. Initially we get the dataset required for the model. The data was collected from
Kaggle.com. The dataset contains the data from the @virginamerica airways. The
data set is described in earlier part of the project.
2. Importing all the necessary packages into the model for the training and testing of
model.
3. Preprocessing the dataset fit for analysis.

Fig 6: Dataset Suitable for Analysis


5. The data set is then divided into train and test data set. We initially take 80% of the data for training
the model and rest for the cross-validation of the dataset.

6. . Next, we wish to fit our data into the predictive model using svm pipelining for taking advantage of
the system multicore. Initially we set the folds for the pipelining and then we then we fit it into the
griv_Svm core.
Result of Model 2

Fig 7: Result of Comparison Parameters


Roc Curve

Fig 22: ROC Curve


Learning Curve
Score vs Training Score:

Clearly , we can see that the training score and cross-validation score increases as we increase the
data set training. The result for the data set is given with above accuracy.
Predicting some tweets on our own. The result will be either 1 or 0 depending
on the statement is true or false.

Fig 25: Some examples of Prediction


Future Work
1. As we used logistic regression for the analysis, we can only predict the binary results i.e we can
only say that whether a comment is positive or negative and in real world the values are not so
discrete. The values are often fuzzy. So, we need to make model that can provide the fuzzy value
of a statement of the in terms of positivity and negativity.
2. The analysis can also be done with other machine learning algorithms like neural networks,
random forest and the result can be compared to find out the best analysis parameters and
sentiments.
3. The dataset quantity can be increased for better results. Outlier detection can also be
implemented into the model for better performance.
4. The model is only effective in case of English language tweets as the stop words for
tokenization and feature extraction is considered as English. We can extend this project to predict
the sentiment for other languages also.
Refrences

1. Kaggle
(www.kaggle.com)
2. Free code Camp Org
(https://medium.freecodecamp.org/how-to-build-a-twitter-sentiments-analyzer-in-python-using-t
extblob-948e1e8aae14)
3. GeeksforGeeks
(https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/)
4. Siraj Raval Youtube Classes
(https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A)
5. Github
(https://github.com)

You might also like