You are on page 1of 63

CITIZEN SOCIAL MEDIA SEN-

TIMENTANALSIS:BUILDING A
MODEL TO MEASURE OPIN-
IONS OF CITIZENS ON UKS
PLANNED HIGH SPEED 2
RAILWAY LINE
by
FAREED IDDRISU IBRAHIM
(w144913642)
Supervised by
PHILIP WORRALL
Submitted in partial fulfilment of the requirements of
the Dept of Business Information Systems
of the University of Westminster
for award of the Master of Science
SEPTEMBER 2014


II W144913642


DECLARATION
I, (FAREED IDDRISU IBRAHIM) declare
that I am the sole author of this Project; that all references cited have been consult-
ed; that I have conducted all work of which this is a record, and that the finished work
lies within the prescribed word limits.
This has not previously been accepted as part of any other degree submission.
Signed :
Date :



III W144913642

FORM OF CONSENT
I, (FAREED IDDRISU IBRAHIM) hereby
consent that this Project, submitted in partial fulfilment of the requirements for the
award of the MSc degree, if successful, may be made available in paper or electronic
format for inter-library loan or photocopying (subject to the law of copyright), and that
the title and abstract may be made available to outside organisations.
Signed :
Date :



IV W144913642


ABSTRACT

Sentiment Analysis (SA) has been used widely as a text mining tool to find out the sentiment
polarity of a given corpus. In this research a comprehensive study of sentiment analysis is
undertaken with the view of applying it to the emerging field of government and citizen inter-
action via twitter social media, as a case study.
The case study uses the proposal by the UK government to undertake a planned high speed
rail, which has become a subject of public debate in the UK. The study therefore analysis the
sentiment or opinions of a section of the citizens, who are expressing their views on twitter.
SA is then applied to the collected data with the aim of determining the general polarity of us-
ers views about the project.
Details the collection of the relevant data and various sentiment analysis such as term fre-
quency, Latent Dirichlet Allocation (LDA) and using a pre-built nave Bayes classifier to classi-
fy the sentiment into three main polarities, positive, negative and neutral.
Interesting results are presented and a case is made for the application of SA by government
as a way of finding out the sentiment of their citizenry.













V W144913642

ACKNOWLEDGEMENT

All praise is due to God Almighty for giving me the opportunity to undertake a dissertation of
this kind and keeping me alive and healthy during this period.
A special Thank you to my Parents Mr and Mrs Iddrisu as well as my siblings Asmau, Raqiba
and Hajira, for their affectionate love and support in taking the decision to provide me with the
opportunity to undertake a masters course from their sole finances, and also for reposing a
lot of trust and confidence in my ability to undertake this academic endeavour.
I would not have been in the position to start writing such a dissertation without the training I
received from the various lectures at the University of Westminster during the course of study.
I would therefore like to acknowledge the contribution of all the lectures and any staff of the
university who in one way or another have contributed to my successful stay at the university.
My decision to undertake this particular topic was due to the interesting approach my lecturer
and supervisor Phillip Worrall took during his course web and social media Analytics. Serving
as my supervisor also, I would like to acknowledge the effort Phillip put in guiding my work
and ensure I produce a good dissertation.
Finally to all who have in one way or another provide assistance to me during the duration of
course at the university I would like to thank you all













VI W144913642


Table of Contents
DECLARATION ...................................................................................................................... ii
FORM OF CONSENT ............................................................................................................iii
ABSTRACT ........................................................................................................................... iv
ACKNOWLEDGEMENT ..................................................................................................... v
LIST OF FIGURES AND TABLES ....................................................................................... viii
LIST OF ABBREVIATIONS ................................................................................................... ix
CHAPTER ONE ..................................................................................................................... 1
1. INTRODUCTION ............................................................................................................ 1
1.1 BACKGROUND .......................................................................................................... 3
1.1.1 GOVERNMENT CITIZEN RELATIONSHIP ....................................................... 3
1.1.2 SOCIAL MEDIA DATA FOR ANALYSIS ............................................................ 3
1.1.3 SOCIAL NETWORKS ........................................................................................ 4
1.1.4 SOCIAL MEDIA VRS SURVEY ......................................................................... 5
1.1.5 TWITTER SOCIAL NETWORK AND DATA ...................................................... 6
1.1.6 SENTIMENT ANALYSIS /OPINION MINING..................................................... 7
1.1.7 BUSINESS INTELLIGENCE ..................................................................................... 8
1.1.8 HIGH SPEED RAIL ............................................................................................ 9
1.1.9 PROJECT SCOPE AND OBJECTIVES ........................................................... 11
1.1.10 JUSTIFICATION AND CONTRIBUTION ..................................................... 11
CHAPTER TWO ................................................................................................................... 13
2. LITERATURE REVIEW .................................................................................................... 13
2.1 GENERAL SENTIMENT ANALYSIS ................................................................. 13
2.2 THE OBJECTIVE/TASK OF SENTIMENT ANALYSIS .................................. 14
2.3 THE CHALLENGE OF SENTIMENT ANALYSIS ....................................................... 15
2.4 METHODOLOGIES USED IN SENTIMENT ANALYSIS ......................................... 16
2.5 IDENTIFYING THE SEMANTIC ORIENTATION OF WORDS .................................. 17
2.5.1 THE LEXICONS APPROACH................................................................................. 18
2.5.2 USING TRAINING DOCUMENTS .......................................................................... 18
2.5.3 IDENTIFYING SEMANTIC ORIENTATION OF SENTENCES AND PHRASES .... 19
2.5.4 IDENTIFYING THE SEMANTIC ORIENTATION OF DOCUMENTS ..................... 20
2.5.5 OBJECT FEATURE EXTRACTION ........................................................................ 20
2.5.6 COMPARATIVE SENTENCE IDENTIFICATION.................................................... 20
2.6 SENTIMENT ANALYSIS USING TWITTER DATA.................................................... 21
2.7 GOVERNMENT CITIZEN SENTIMENT ANALYSIS .................................................. 23


VII W144913642

2.8 AN OVERVIEW OF DATA MINING (STRUCTURED) AND TEXT MINING
(UNSTRUCTURED DATA) .............................................................................................. 26
2.9 SENTIMENT ANALYSIS AND MODELLING TECHNIQUES. ................................... 27
2.10 OVERVIEW AND WAY FORWARD FROM THE LITERATURE REVIEW ............. 29
CHAPTER THREE ............................................................................................................... 31
3 PROBLEM SPECIFICATIONS ...................................................................................... 31
3.1 METHODOLOGY ........................................................................................................... 31
3.2 METHODOLOGICAL JUSTIFICATION ....................................................................... 33
3.3 SOFTWARE USE JUSTIFICATION .............................................................................. 34
CHAPTER FOUR ................................................................................................................. 35
4.0 PROJECT IMPLEMENTATION ................................................................................... 35
4.1 UNDERTAKING SENTIMENT ANALYSIS ................................................................ 36
4.1.1 TERM FREQUENCY ANALYSIS ............................................................................ 36
4.1.2 LATENT DIRICHLET ALLOCATION TOPIC MODELLING .................................... 37
4.1.3 SENTIMENT ANALYSIS SCORE ........................................................................... 38
4.2 CHALLENGES AND ADJUSMENTS ......................................................................... 38
CHAPTER FIVE ................................................................................................................... 40
5. RESULTS AND ANALYSIS ............................................................................................ 40
5.1 TERM FREQUENCY RESULTS AND ANALYSIS .................................................... 40
5.2 LDA TOPIC MODELLING RESULTS AND ANALYSIS ............................................. 43
5.2.1 MODEL EVALUATION ..................................................................................... 44
5.3 SUMMARY OF FINDINGS ............................................................................................. 46
5.4 FUTURE WORK ............................................................................................................ 46
CHAPTER SIX ..................................................................................................................... 47
6 CONCLUSION .............................................................................................................. 47
BIBLIOGRAPHY ................................................................................................................................... 48
APPENDIX ........................................................................................................................... 54




VIII W144913642


LIST OF FIGURES AND TABLES

FIGURE 1.1 WORD CLOUD FOR SENTIMENT ANALYSIS ........................................................................................... 2
TABLE 1.0 TYPES OF SOCIAL MEDIA .................................................................................................................. 4
FIGURE1.2 DEPICTING THE INTERCONNECTIVITY OF TWITTER ................................................................................. 7
FIGURE 1.3 A SCREEN SHOT OF SENTIMENT RELATING TO HS2 FROM TWITTER .......................................................... 8
FIGURE 1.4 MAP OF PROPOSED ROUTE FOR HS2 (BBC) ..................................................................................... 10
FIGURE 2.1 AN EXAMPLE OF A SYSTEM ARCHITECTURE FOR SENTIMENT ANALYSER ................................................... 26
FIGURE 3.1 METHODOLOGICAL STEPS ............................................................................................................. 32
FIGURE 4 PROCESS DIAGRAM SHOWING THE STAGES OF ANALYSIS ......................................................................... 36
FIGURE 4.1 LDA GRAPHICAL REPRESENTATION .................................................................................................. 37
FIGURE 5.1 DISPLAY OF TERMS FOR X = 100 ................................................................................................... 40
TABLE 5.1 TERMS AND THEIR FREQUENT ASSOCIATED TERMS ............................................................................... 42
FIGURE 5.2 LDA TOPIC MODELLING RESULTS .................................................................................................... 43
FIGURE 5.2 PERPLEXITY VALUE FOR LDA TOPIC MODELLING ................................................................................. 44
FIGURE 5.3 SCATTER PLOT FOR LDA TOPIC MODELLING ...................................................................................... 44
FIGURE 5.3 RESULTS OF SENTIMENT ANALYSIS................................................................................................... 45
FIGURE 5.4 A PLOT OF SENTIMENT SCORE ........................................................................................................ 45













IX W144913642





LIST OF ABBREVIATIONS

API : Application Programming Interface
CSV : Comma Separated values
HS2 : High Speed Rail Network Phase 2
IBM : International Business Machines
LDA : Latent Direlecht Allocation
NLP : Natural language Processing
NLTK : Natural language Tool kit
POS : Part of Speech Tagging
SA : Sentiment Analysis
UNPACS: United Nations Public Administration Studies
UK : United Kingdom


1 W144913642

CHAPTER ONE

1. INTRODUCTION

Citizen social media sentiment analysis is a relatively new dimension of the broader field of
social media sentiment analysis. It involves the engagement of governments, public institu-
tions and the citizenry using social media as the common platform. Governments across the
world are increasingly facing the challenge of serving the interest of their citizens rather than
their own interest. Citizens across the world are demanding for a greater say in the govern-
ance process now than ever before.
A 2012 UNPACS reports, Citizens are increasingly getting involved in the governance pro-
cess of their communities, and countries. These engagements implies that the involvement of
citizens in decision- making process of the state through measures and institutional arrange-
ments, so as to increase their influence on public policies and programming ensuring a more
positive impact on their social and economic lives. (UN, 2012)
The effects of citizen marginalization were manifested in the events of the now infamous Arab
spring in 2010, which saw a popular uprising by citizens of some Arab countries against their
governments. Governments of these countries were toppled over by mass protest and
demonstrations of their own citizens amidst the loss of lives and property. It is largely reported
that these uprising, protest and demonstrations were coordinated using social media such as
facebook and Twitter. (Arunachalam & Sarkar, 2013)These events seem to have redefined
the role of citizens in modern day governance.
The advent of new disruptive technologies such as mobile applications, cloud computing and
social media have emerged as a tool and conduit for which governments and citizens can use
as an effective medium to communicate and hence forge closer together towards building a
relationship where the opinions of citizens can be taken into account as well as that of the
government can be explained to the citizens. Recent trends have seen social media transform
itself into a rich repository of data that can be analysed using data mining and analytic tech-
niques to gain insights and trends into what the data contains.
The goal of any data mining process is to help with critical decision making process, using a
scientific approach rather than by the use of intuition. This gives credence to the application of
data/text mining in analysing the opinions of citizens, which can help governments make bet-
ter and informed choices about the point of view of its citizen
Using Sentiment Analysis or Opinion mining, a subset of data mining, can provide the vehicle
for which thousands of opinions can be analysed. This non trivial technology has been suc-


2 W144913642

cessfully used as a business intelligence tool by many businesses in areas such as customer
relationship management, targeted marketing, political campaigns, Mass movements, disas-
ter and crisis response, news reporting etc. (Gundecha & Liu, 2012) The success of this
technology is what has informed is application to the area of governments relationship with it
citizens, by using a similar approach to derive better decision making for governments.
This research therefore seeks to explore the application of sentiment analysis to the relatively
new dimension of government sentiment analysis. The research work involves finding how
governments can put in place a system to mine data from social media. The focus of the re-
search will be to build a model that can be used to analyse sentiments of citizens concerning
government policies, programs and projects.
In order to demonstrate the importance of the topic, a case study using the proposed con-
struction of a high speed railway line (HS2) in UK. The HS2 is a planned modern high speed
rail network that seeks to link up major cities of the UK from London through Birmingham to
Manchester and Leeds with a possible expansion to Scotland. This proposed project however
has generated lots of controversy about its implementation amongst the British society. Vari-
ous sentiments expressed regarding this debate would be used to build a model for the sen-
timent analysis. The goal would be classify these sentiments as positive, negative or neutral
based on data from the social media site Twitter. From the classification, it should be possible
to measure the sentiments/opinions of UK citizens concerning the implementation of the H2S
project.

Figure 1.1 Word cloud for sentiment analysis






3 W144913642


1.1 BACKGROUND

1.1.1 GOVERNMENT CITIZEN RELATIONSHIP

Government can be defined as the political system by which a country is administered and
regulated (Encyclopedia Britanicca, 2014) governments therefore play a vital role in the lives
of citizens by ensuring their welfare is cared for. It is therefore imperative that Governments
build a relationship with their citizens, where the sentiment of citizens can be taken into ac-
count when taking decisions that affect their lives. (Schellong, 2008). According to Schellong
(Schellong, 2008) sentiments offer policy makers information to;
Understand and establish public needs,
Develop communicate and distribute public services,
Assess the degree of public service satisfaction


1.1.2 SOCIAL MEDIA DATA FOR ANALYSIS

Social media can be defined as "a group of Internet-based applications that build on the ideo-
logical and technological foundations of Web 2.0 and that allow the creation and exchange
of user-generated content." (Kaplan & Haenlein, 2010). Social media involves interaction
amongst people in which they create, share or exchange information and ideas in virtual
communities and networks. Social media has gained worldwide acclaim and popularity and is
transforming the way people communicate, the way we form relationships, the way we con-
nect to each other, the way we live and work. (Arunachalam & Sarkar, 2013)
According to (Search Engine watch, 2014) these statistics give an idea about the fast pace at
which the social media landscape is growing; there are almost a billion people signed up to at
least one social media type, 1.43 billion people worldwide visited a social networking site in
2012, nearly 1 in 8 people worldwide have their own Facebook page, 3 million new blogs
come online every month, and 65% of social media users said they use it to learn more about
brands, products and services. This explosive growth can be attributed to the ubiquity of the
internet and communications devices such as computers and smart phone devices.


4 W144913642

The term social media is used to make a distinction between traditional forms of media such
as TV, radio and press. Traditional media such as those cited above follow a unidirectional
delivery paradigm from business to consumer. The information is produced from media
sources or advertisers and transmitted to media consumers. Different from this traditional
way, web 2.0 technologies are like consumer to consumer services. They allow users to in-
teract and collaborate with each other in a social media dialogue of user-generated content in
a virtual community.
Social media is a generic term that encompasses the various different platform that engage
people socially on the internet. From table 1.0 the various social media sites contain various
types of service and thus create different formats of data, to include mainly text, image, video
etc. For example Twitter, Facebook and Youtube provide text, image and video services re-
spectively and though there maybe overlap they are specialised for such services better. (Hu
& Liu, 2012) Our focus in this research however would be on text data.


Table 1.0 Types of Social Media

1.1.3 SOCI AL NETWORKS
One of the most popular subset of social media is the online social networks. Online Social
networks are defined as a network of interactions or relationships, where the nodes consist of
actors, and the edges consist of the relationship or interaction between these actors.
(Aggarwal, 2011).
If we note from the definition of Social media, and social network we see that the definitions
vary to some degree. This indicates a difference in the two terms though they are often used
CATEGORY REPRESENTATIVE WEBSITE
Blogging Blogger, LiveJournal, Wordpress, Huffington Post
Wiki Wikipedia, Wikihow, WikiTravel Scholarpedia
Social News Digg, Mixx, Slashdot, Reddit
Micro blogging Twitter, Google Buzz,Tumblr, Jaiku Plurk
Opinion & Reviews Epinions, Yelp
Question Answering Yahoo! Answers, Quora
Academic Networking Research gate, Academia.edu, Slideshare
Media Sharing Flick, Youtube, Vimeo
Social Bookmarking Delicious, CiteULike
Social Networking Facebook, LinkedIn, Myspace, Google+


5 W144913642

interchangeable. Example of popular social networks includes facebook, twitter, Linkedin,
Google+ etc.
An Aberdeen Group Bench Report published shows that more than 84% best-in-class com-
panies improved their overall performance, customer satisfaction, risk management and ac-
tionable insights from social media monitoring and analysis. (Zabin & Jefferies, 2008)
A similar approach to how corporate businesses are involving social media in business pro-
vides a pathway for governments to follow suit in engaging the people, on whose behalf they
serve. A novel aspect to the use of social network is the ability to analyse discussions and
posts for the purpose of gaining valuable insights. These insights can then be used for deci-
sion making based on the results of the analysed sentiments. Around 2007, the researchers
and analysts started to take notice of the importance and value of social media monitoring
and sentiment analysis. (Arunachalam & Sarkar, 2013)
The approach to analysing this data is a non-trivial process that involves using programming
softwares to collect thousands of sentiments from social media platforms. After collection var-
ious Text and data mining algorithms are used in the process of analysing the collected data.

1.1.4 SOCIAL MEDIA VRS SURVEY
Compared to traditional survey polls, running an analysis on social media is attractive for a
number of reasons First, social media analysis is cheaper and faster compared to traditional
surveys, and enables continuous monitoring of public opinion by performing real-time analy-
sis. (Xin, Gallagher, Cao, Luo, & Han, 2010) On the contrary, offline surveys are by definition
more static. Hence, we are able to capture the reaction of public opinion in almost real time.
Analysing social media also allows us to observe trends and breaking points. (Ceron, Curini,
Mlacos, & Porro, 2013)
In addition, traditional surveys pose solicited questions, and it is well known that this approach
might inflate the share of strategic answers. Conversely, sentiment analysis does not utilize
questionnaires and focuses only on listening to the stream of unsolicited opinions freely ex-
pressed on the Internet. In other words, sentiment analysis adopts a bottom-up approach, at
least if compared with the more traditional top-down approach of offline surveys. (Ceron,
Curini, Mlacos, & Porro, 2013)Far from saying that all of the comments posted on social net-
works contain the sincere opinion of the author, we can argue that the Internet may represent,
to a large extent, an arena in which users are free to express themselves. (Savigny, 2002)
Thus, the social network should be in a position to be less affected by the spiral of silence.
Moreover, while web analysis must contend with the problem of silent users, surveys face the
problem of low response rates. (Ceron, Curini, Mlacos, & Porro, 2013)


6 W144913642



1.1.5 TWITTER SOCIAL NETWORK AND DATA
Twitter is a micro blogging service that allows communication with short 140 character mes-
sages, which roughly correspond to thoughts or ideas. Twitter is akin to a free high-speed,
global text-messaging service, which enables rapid and easy communication. What differenti-
ates twitter is its asymmetric following model satisfies the human curiosity. It is the asymmet-
ric following model that cast twitter as more of an interest graph than a social network, and
the Application programming Interface (APIs) that provide just enough of a framework for
structure and self-organising behaviour to emerge from the chaos. What this means is that
whereas some social websites like Facebook and Linkedin require the mutual acceptance of
a connection between users, twitters relationship model allows you to keep up with the latest
happenings of any other user, even though that other user may not choose to follow you back
or even know you exist. Twitter enables you to create, connect, and explore a community of
interest for an arbitrary topic of interest, the power of Twitter and the insights you can gain
from mining its data become much more obvious. (Russell, 2013)
In June 2012 twitter report 340 million tweets from 140 million active users. (Twitter, 2012), in
2013 more than 400 million tweets per day were reported (Wickre, 2013) Inherent in these
data can be discussions that talk about the like or dislike of products and services, breaking
and updating of news, and as a public relations medium for some business, politicians and
celebrities (Zhao, et al.) etc. Twitter can therefore be considered as a rich source of social
data due to its inherent openness for public consumption as well as ease of access to the da-
ta using APIs.
This has led to a very high interest among researchers, some research work done include;
the topological characteristics of Twitter (Kwak, Lee, Park, & Moon, 2009), tweets as social
sensors of real-time events (Sakaki, Okazaki, & Matsuo, 2010), the forecast of box-office rev-
enues for movies (Asur & Huberman, 2010), etc.
In this project twitter would be the primary social network where data would be sourced for the
case study reasons for which have been explained above.


7 W144913642


Figure1.2 Depicting the interconnectivity of Twitter


1.1.6 SENTIMENT ANALYSIS /OPINION MINING

Sentiment analysis or opinion mining is defined as the computational study of peoples opin-
ions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics
and their attributes. (Liu & Zhang, 2012). It is a field of machine learning that employs compu-
tational power and well-designed software such as the natural language tool kit (NLTK) to
process large amounts of text data with a view of analysing sentiments or opinions expressed
in these text corpuses.
Humans have always made decisions based on one or more sentiments of others. E.g. a pro-
spective student in applying to a university would base his/her choice on the positive senti-
ments about the university; the choice not to buy a product might be borne out of negative
reviews the product has generated, these and so many other similar situations go a long way
to illustrate the importance of sentiment based decision making. (Liu & Zhang, 2012)
However, decision making that is irreversible after they have been made would not just re-
quire a few sentiments, but rather hundreds or thousands of varying opinions in order to make
the best decision and this situation is what governments face daily in the governance process.
Analysing thousands of sentiments is beyond human capability and is one that requires the
use of computational processes to effectively analyse these thousands of sentiment data.
With computational processes now easily available and relatively cheaper these days, the
application of sentiment analysis is more practicable and easier to undertake now than ever
before. (Pang & Lee, 2008)


8 W144913642

Sentiment analysis is a prominent and active area of research, spurred particularly by the rap-
id growth of web social media and the opportunity to access the valuable opinions of numer-
ous participants on various business and social issues. (Ghiassi, Skinner, & Zimbra, 2013)
The field of sentiment analysis forms part of the wider discipline of business intelligence.

Figure 1.3 A screen shot of sentiment relating to HS2 from twitter



1.1.7 BUSINESS INTELLIGENCE

Since 2004 business intelligence has been a top three key information systems (IS) man-
agement issue and application development area. (Luftman & Mclean , 2012). The term busi-
ness intelligence (BI) was first used by IBM researcher Hans Peter Luhn (Azevedo & Santos,
2012) employing the definition with the use of Webster dictionary. Since then there has been
a number of definitions of BI.
According to Golfarelli et al, BI is the process by which businesses transform relatively mean-
ing-less data into useful, actionable information and then into knowledge. (Golfarelli , Dario, &
Rizzi, 2004) This knowledge can be used to guide the business in the running of its day-to-
day activities, as well as serving as a basis by which strategic planning and decision-making
processes can be efficiently and effectively carried out. Lonnqvist & Pirttimaki (Lonnqvist &
Pirttimaki, 2006) also defines it as an organised and systematic process by which organisa-


9 W144913642

tions acquire, analyse, and disseminate information from both internal and external infor-
mation sources significant for their business activities and for decision making
BI as an integral part of decision support systems (DSS), (Azevedo & Santos, 2012) has at-
tracted a great deal of interest from both industry and research (Arnott, & Pervan, 2005) be-
cause of the critical role it can play in helping organisations and businesses derive value out
of their data.
BI is a broad term or concept, and there are also many other similar and partially overlapping
terms such as Competitive Intelligence, Customer Intelligence, Market Intelligence, and Stra-
tegic Intelligence (Lonnqvist & Pirttimaki, 2006).



1.1.8 HIGH SPEED RAIL

High speed rail is the world standard for long distance inter-city rail travel. The standard
speeds for a high speed rail today, is 250kph. High speed rails operate in about 13 countries
such as Japan, China, Italy, France Germany, UK etc. The UK currently has an existing high
speed line (HS1) connecting St. Pancras international station in London with Kent and a se-
cond channel connects Ebbsfleet station with St. Pancras. The full HS1 service was connect-
ed in December 2009.
Like many other countries Britain is investing in high speed rail to create space on over-
crowded networks and enable large numbers of people to move efficiently. This would be the
biggest transport project undertaken for a generation. It will fundamentally improve rail infra-
structure in Britain, breaking the 21
st
century railway thinking and practices. (Hs2 story, 2014)
On 28
TH
January 2013 the UK government released phase two of HS2. (IPSOS, 2013) This is
a new rail line that has been proposed to be built from London to as far as Leeds and beyond
.Phase one is from London to Birmingham, whiles phase two would continue from the West
midlands to Manchester and Leeds. (Hs2 story, 2014)
The proposal to undertake the HS2 project by the UK government has been welcomed and
met with some resistances by the UK public. According to the BBC, the 32 billion project is
the subject of heated debate right from the political class to the everyday citizen. (BBC, 2014)
Proponents of the project have hailed it as a move to among others, bridge the north south
economic and developmental gap, provide jobs, and reduce the travel time between the major


10 W144913642

cities of the UK , while catering for the increasing commuter population, as well as handle
more freight across the cities. (BBC, 2014). A formidable coalition of cities has united to press
for the delivery of the once in a century promise of HS2. The City Council Leaders of Shef-
field, Leeds, York, Newcastle Nottingham, Liverpool, Derby and Manchester have launched a
new Connected Cities campaign group, speaking with one clear voice to press key national
decision makers to commit delivering HS2 to the North. Connected Cities is built upon a
strong local consensus amongst key business and political leaders from each city. (City of
York, 2014)
Opponents on the other hand have also cited concerns such as; negative Environmental im-
pact, waste of money, a potential white elephant especially in the age of telecommunications
where people can communicate over long distances and finally some inhabitants of villages
where the rail would pass through have voiced out concerns over the project destroying the
picturesque of their respective villages. (BBC, 2014) The biggest opponent to this project is
the campaign group stop Hs2. The civil society groups mission is;
To stop HS2 by persuading the Government to scrap the HS2 proposal.
To facilitate local and national campaigning against High Speed Two.

Figure 1.4 Map of proposed route for HS2 (BBC)


Even within the political class, though in principal all political parties support the project there
are different opinions in how it should be implemented or which routes the line should take.
(Butcher, 2014) (BBC, 2014)
From research it is evident that citizens across the UK have divided opinions about the UK
government undertaking the project. This makes it a good basis for use as a case study in
this project


11 W144913642


1.1.9 PROJECT SCOPE AND OBJECTIVES

This research work falls under the academic discipline of business intelligence and analytics.
The field of business intelligence is a very vast field of study of which data mining, text mining
and sentiment analysis are part of. This research aims to apply traditional sentiment analysis
on this relatively new area of citizen sentiment analysis. The application of sentiment analysis
using twitter data as a source is also a young field of research, as most sentiment analysis
has been applied to movie and product reviews and blogs data in the past. The objectives of
this research therfore are:
Apply sentiment analysis to the emerging field of citizen sentiment analysis.

Develop a proposed sentiment analysis model to generate analytical insights into the
sentiments of citizens about the H2S rail project.



1.1.10 JUSTIFICATION AND CONTRIBUTION

In the era of the information age, social media is transforming the way people communicate,
connect, and form relationships and even the way we live and work. (Arunachalam & Sarkar,
2013). The opportunity this provides is the vast amount of textual data that are generated on
a daily basis.

With advances in sophisticated computational process, which has resulted in the easy of ap-
plying data and text mining on huge data sets for analysis, the opportunities to generate
meaningful insights into these data sets are enormous and beneficial. With successes in the
application of sentiment analysis in mostly commercial businesses. (Pang & Lee, 2008) The
opportunity to replicate such success in public administration and governance is possibly. So-
cial media presents itself as a big data source of citizen voice and opinion, providing a deep
insight in what citizens want. (Arunachalam & Sarkar, 2013)

Analysing these data contributes valuable information that can better enhance decision mak-
ing. E.g. For this research getting an insight into the various topics that are discussed around
the HS2 project should give an indication of what aspects of the project are most talked


12 W144913642

about. Further analysis into finding the overall sentiment polarity can also give an indication
as to whether citizens are in favour of or against the project.

Making decisions based on proven scientific and intelligent methodologies is a much better
way than taking such decision out of intuition, and that is what this research attempts to do by
applying these methodologies to a vast amount of data that is generated daily concerning the
HS2 project.

This research work would go a long way to add to the body of knowledge, where future work
can be added to any short fall this research might not have covered. Government and public
institutions can also find benefit in the work especially because it is aimed at how they can
interact with their citizens more.

Finally undertaking this research work will contribute immensely to my own understanding of
the academic discipline of sentiment analysis, and how to apply it in a real world problem
such as the sentiments about the HS2 project.

















13 W144913642


CHAPTER TWO

2. LITERATURE REVIEW

In the chapter we review literature works that are connected to the aim of our research. This
takes the form of making important summaries from these sources that are of relevance from
the entire work under review. The review is centred on the discipline of sentiment analysis as
a field of study and how this study can be applied to the general objective of government-
citizen relationship. Literature on twitter specific sentiment analysis would also be carried out.
The method of searching for literature was mainly using the Universitys online library search,
Library books that documented journals relevant to this study. A few were taken from the web
of science and Google scholar/books. The references used are mostly from Journals, articles,
conference proceedings and reports, contained in the ACM digital library and science direct.
Some reading was also done on a few white paper reports, books and research work carried
out in this field.
KEYWORDS USED IN SEARCH: Sentiment Analysis, Social media Analytics, Twitter, un-
structured data

2.1 GENERAL SENTIMENT ANALYSIS

As a field of research, Sentiment analysis can be said to be part of computational linguistics,
natural language processing (NLP), and text mining. It is also called opinion mining, subjectiv-
ity analysis and appraisal extraction. (Mejova, 2009) Generally speaking, sentiment analysis
aims to uncover an authors view towards a subject or the overall contextual polarity of a text.
(Mejova, 2009)
Sentiment analysis has been applied in many corpuses such as news blogs (Bautin, Ward,
Patil, & Skiena, 2010), movie review (pang, Lee, & Vaithyanathan, 2002), citizen political
preference (Ceron, Curini, Mlacos, & Porro, 2013) etc. Research in this field has largely fo-
cused on two things; identifying whether a given textual entity is subjective (i.e. a sentence
that expresses a personal view) or objective,(i.e. a sentences that presents a factual infor-
mation) and identifying polarity of subjective text (Pang & Lee, 2008).


14 W144913642


2.2 THE OBJECTIVE/TASK OF SENTIMENT ANALYSIS

The objective of sentiment analysis as described by (Liu & Zhang, 2012) includes the follow-
ing; entity extraction and grouping, aspect extraction and grouping, Opinion holder and time
extraction, aspect sentiment classification and opinion quintuple generation. A widely re-
searched tasks is sentiment or opinion detection which is viewed as classification of text as
objective or subjective. Usually opinion detection is based on the examination of adjectives in
sentences. For example, the polarity of the sentence this is a nice car can be determined
easily by looking at the adjective. (Hatzivassiloglov & Wiebe, 2000) They also examined the
effects of adjectives in sentiment subjectivity. Later studies (Benamara, Cesarano, Picariello,
Reforgiato , & Subrahmanian, 2007) have shown that adverbs may be used for similar pur-
pose.
The second task is polarity classification. Given an opinionated piece of text, the goal is to
classify the opinion as belonging to one of two opposing sentiment polarities, or locate its po-
sition on the continuum between these two polarities (Pang & Lee, 2008)
When viewed as a binary feature, polarity classification is the binary classification task of la-
belling an opinionated document as expressing either an overall positive or an overall nega-
tive opinion. Most of this research was done on product reviews, where the definitions of pos-
itive and negative are clear. Other tasks, such as classifying news as good or bad pre-
sents some difficulty. A news article may contain bad news without actually using any sub-
jective terms. Furthermore, these classes usually appear intermixed when a document ex-
presses both positive and negative sentiments. Then the task can be to identify the main sen-
timent of the document. (Mejova, 2009)
To distinguish between different mixtures of the two opposites polarity classification uses a
multi-point scale (such as the number of stars for a movie review).This is where the task be-
comes a multi-class text categorization problem. But unlike the topic-based multi-class classi-
fication problems where vocabularies differ for each class (or overlap slightly), the vocabular-
ies for positive, neutral, and negative classes can be very much alike, and differ only in few
crucial words. Since many documents have a mixed opinion, this class is actually a combi-
nation of positive and negative. Negations, which tend to be disregarded in much of text anal-
ysis as unimportant, play an important role in sentiment, flipping an originally positive term
into negative, and vice versa. The above two tasks can be done at several levels: term,
phrase, sentence, or document level. It is common to use the output of one level as the input
for the higher layers (Dave, Lawrence , & Pennock, 2003). For instance, we may apply senti-
ment analysis to phrases, and then use this information to evaluate sentences, then para-
graphs, etc. Different techniques are suitable for different levels. Techniques using n-gram


15 W144913642

classifiers or lexicons usually work on term level, whereas Part-Of-Speech tagging is used for
phrase and sentence analysis. Heuristics are often used to generalize the sentiment to docu-
ment level. (Mejova, 2009)
A third task that is complementary to sentiment identification is the discovery
of the opinions target. The difficulty of this task depends largely on the domain
of the analysis. (Mejova, 2009) It is usually safe to assume that product reviews usually talk
about the specified product. On the other hand, general writing
such as webpages and blogs dont always have a pre-defined topic, and often mention many
objects. Another lively area of research is feature extraction, given an
object or topic of the text. (Liu, Hu, & Cheng, 2005) Liu define features as either components
or attributes of an object, which is a definition that is mostly used in practice. (Liu B. , 2006)
Sometimes there is more than one target in a sentiment sentence, which is the case in com-
parative sentences. A subjective comparative sentence orders objects in order of prefer-
ences, for example, this camera is better than my old one. These sentences can be identi-
fied using comparative adjectives and adverbs (more, less, better, longer), superlative adjec-
tives (most, least, best) and other words such as same, differ, win, prefer, etc. (Liu B. , 2006)


2.3 THE CHALLENGE OF SENTIMENT ANALYSIS

Research indicates that sentiment analysis present much complex challenge than traditional
topic modelling. (pang, Lee, & Vaithyanathan, 2002) .This is despite the fact that sentiment
analysis classifies text into 3 main classes, whiles topic modelling involves n-ary of topics.
(Pang & Lee, 2008).
Sentiment classification classifies an opinion document e.g. a product review as expressing a
positive, negative and neutral sentiment. The task is also commonly known as the document-
level sentiment classification because it considers the whole document as the basic infor-
mation unit. (Liu & Zhang, 2012)
The main reason why sentiment Analysis is more difficult than topic-based classification is
that topic-based classification can be done with the use of keywords while this does not work
well in sentiment analysis( (Turney, 2002)
Some other reasons that make sentiment analysis difficult include; difficulty in determining
whether a given text is objective or subjective. (There is always a thin-line between the two). It
is also difficult to determine the opinion holder. Sentiment can be expressed in subtle ways
without any ostensible use of negative words. E.g. how could anyone sit through this movie?
contains, no single word that is obviously negative. However this could be classified as nega-
tive review of a movie. Thus sentiment requires more understanding than the usual topic-


16 W144913642

based classification. Other factors include dependency on domain and other words (Pang &
Lee, 2008). Opinions expressed with sarcasm, irony, and negation.

2.4 METHODOLOGIES USED IN SENTIMENT ANALYSIS

A wide range of tools and techniques can be employed to tackle the goals described in the
previous section. This section therefore describes some of the most common and widely used
ones.

Classification: Many of the tasks in Sentiment Analysis can be thought of as classifi-
cation. (Mejova, 2009) Machine Learning offers many algorithms designed to under-
take that, but this task of classifying text according to its sentiment presents many
unique challenges. These can be formulated in one question: What kinds of features
do we use?

Term Frequency or Presence: Traditional Information Retrieval systems have long
emphasized the importance of term frequency. The widely used TF-IDF (Term Fre-
quency - Inverse Document Frequency) measure is well-used in modelling docu-
ments according to Jones. (Jones, 1972)TF-IDA is a measure of how concentrated
into relatively few documents is the co-currencies of a given word. (Rajaraman &
Ullman, 2011) The intuition is that terms that often appear in the document but sel-
dom in the whole collection are more informative as to what the document is about as
compared to the terms mentioned just once. (Mejova, 2009) TF-IDF have been
shown to be quiet effective in sentiment classification (Liu & Zhang, 2012) In the field
of Sentiment Analysis we find that instead of paying attention to most frequent terms,
it is more beneficial to seek out the most unique ones. Pang et al improved the per-
formance of this system using term presence instead of frequency. (pang, Lee, &
Vaithyanathan, 2002) Wiebe and Hoffman states in their paper that, apparently
people are creative when they are being opinionated, implying the importance of low-
frequency terms in opinionated texts. (Wiebe & Hoffmann, 2005)

n-grams: Term positions are also important in document representation for Senti-
ment Analysis. The position of terms determines, and sometimes reverses, the polari-
ty of the phrase. So, position information is sometimes encoded into the feature vec-
tor. (pang, Lee, & Vaithyanathan, 2002) Wiebe and Hoffman selects n-grams
(n=1,2,3,4) based on precision calculated using annotated documents. (Wiebe &
Hoffmann, 2005) The n-grams are a word-stem, part-of-speech pair, for instance (in-
prep the-det can- noun) is a 3-gram.



17 W144913642


Part-of-Speech: Adjectives are a good indicator of sentiment in text, and in the past
decade they have been commonly exploited in Sentiment Analysis ( Whitelaw, Garg ,
& Argamon, 2005). This is true for other fields in textual analysis, since part-of-
speech tags can be considered to be a crude form of word sense disambiguation.
(Wilks & Stevenson, 1998). In his work, Turney used part-of-speech patterns, to in-
cluding an adjective and even went further to used adverb as well, for sentiment de-
tection at the document level. (Turney, 2002) Syntax information has also been used
in feature sets, though there is still discussion about the advantages of this infor-
mation in Sentiment classification (Pang & Lee, 2008). This information however may
include important text features such as negation, intensifiers, and diminishers used
sub tree-based boosting algorithm with dependency tree-based features for polarity
classification, and show that it outperforms the bag-of-words baseline. (Kennedy &
Inkpen, 2006)

Negations: Negations have been long known to be integral in Sentiment Analysis.
The usual bag-of-words representation of text disconnects all of the words, and con-
siders sentences like I like this car and I dont like this car very similar, since only
one word distinguishes one from the other. But when talking about sentiment, a nega-
tion changes the polarity of a whole phrase. Negations are often considered in post-
processing of results, while the original representation of text ignores them (Hu & Liu,
2005), one could explicitly include the negation in the document representation by
appending them to the terms that are close to negations; for example term like-NOT
would be extracted form I dont like this book (Pang & Lee, 2008). Though using co-
location may be too crude a technique. It would be incorrect to negate the sentiment
in a sentence such as No wonder everyone loves this car. To handle such cases.
(Na, Sui , Khoo , Chan , & Zhou, 2004) use specific part-of-speech tags patterns to
identify the negations relevant to the sentiment polarity of a phrase.


2.5 IDENTIFYING THE SEMANTIC ORIENTATION OF WORDS

According to (Mejova, 2009)One of the most basic tasks in Sentiment Analysis is identifying
the semantic orientation (the polarity and objectivity) of a word. A variety of techniques have
been used, which can be roughly categorized in the following:
using a lexicon, constructed manually or automatically.
using some statistical techniques such as looking at concurrence of a word with a
word of a known polarity.
using training documents, labelled or unlabelled, as a source of knowledge about the
polarity of terms within the collection


18 W144913642

Hybrid Approach

Each of these techniques cited above has its advantages and difficulties, which will be re-
viewed here.
.
2.5.1 THE LEXICONS APPROACH
.
Extended lexicons are a fundamental part of Sentiment Analysis, but not all of
them are alike. The simplest ones are ones with binary classification of words
into positive vs. negative polarities or objective vs. subjective. A more fine distinction
between the classes can be made with fuzzy lexicons where each label has a score associat-
ed with it, conveying the strength of the label. (Mejova, 2009).
A variety of lexicons have been created for the use in Sentiment Analysis,
often by extending existing general-purpose lexicons. For example, Subasic and Huettner,
2001 have manually constructed a lexicon associating words with affect categories, specifying
an intensity (strength of affect `level) and centrality (degree of relatedness to the category).
(Subasic & Huettner, 483496)Besides manual annotation, other resources can be used to
build lexicons. Existing lexicons can be augmented to include sentiment information. Prince-
ton Universitys WordNet lexicon has been one of the most popular ones to be used for Sen-
timent Analysis. As described on http://wordnet.princeton.edu/, WordNet R is a large lexical
database of English, (Mejova, 2009) Taboada et al used a sentiment orientation calculator
(SO-CAL).It uses dictionaries of words annotated with their semantic orientation (polarity and
strength), and incorporates intensification and negation. SO-CAL is applied to the polarity
classification task, the process of assigning a positive or negative label to a text that captures
the texts opinion towards its main subject matter. ( Taboada, Brooke, Tofiloski, Voll, & Stede,
2011)


2.5.2 USING TRAINING DOCUMENTS

It is possible to perform sentiment classification using statistical analysis and machine learn-
ing tools that take advantage of the vast resources of labelled (manually
by annotators or using a star/point system) documents available. (Mejova, 2009) Product re-
view websites like C-NET, Ebay, RottenTomatoes and the Internet Movie Database (IMDB)
have all been extensively used as sources of annotated data. The star(or tomato, as it were)
system provides an explicit label of the overall polarity of
the review, and it is often taken as a gold standard in algorithm evaluation. (Mejova, 2009)
Manually labelled data is available through evaluation efforts such
as the Text REtrieval Conference (TREC), NII Test Collection for IR Systems


19 W144913642

(NTCIR), and Cross Language Evaluation Forum (CLEF). These datasets produced often
serve as standard in the Information Retrieval community, including for Sentiment Analysis
researchers. Individual researchers and research groups have also produced many interest-
ing data sets. An example is; The Congressional floor-debate transcripts - published by
Thomas and Pang contains political speeches that are labelled to indicate whether the
speaker supported or opposed the legislation discussed. (Thomas & Pang, 2006)
Once a desirable data set has been obtained, a variety of machine learning algorithms can be
used to train sentiment classifiers. Some of the most popular algorithms are Support Vector
Machines, Naive Bayes, and maximum entropy-based classifiers. (Mejova, 2009)



2.5.3 IDENTIFYING SEMANTIC ORIENTATION OF SENTENCES AND
PHRASES

Using the semantic orientation of individual, it is often desirable to extend this to the phrase or
sentence the word appears in. One of the most straightforward ways to accomplish this is to
take an average of the polarities of words in the sentence. Hu and Liu write: if posi-
tive/negative opinion prevails, the opinion sentence is regarded as a positive/negative one. In
the case that the number of positive and negative opinion words is the same, they take the
orientation of the closest opinion sentence. (Hu & Liu, 2005)

Another common way is to train a Naive Bayes classifier using sentences and documents
labelled as opinionated or factual as examples of the two categories. (Yu & Hatzivassiloglou ,
2003) The authors used features including words, bigrams, and trigrams, as well as the parts
of speech in each sentence. They also use the presence of words with known polarities in a
sentence as an indication that the sentence is subjective. And they take into consideration the
effect of negation words such as
no, not, and yet appearing in the window of 5 words around the word in question.
Although simplistic, this heuristic has been shown to work for most of the cases. An even
more sophisticated combination of sentiment labels is possible by taking advantage of syntac-
tic relationships between words. For example, Popescu and Etzioni use an unsupervised
classification technique Relaxation Labelling that extends the label attributed to the word to
the sentence it appears in. This approach takes, among other things, the negation modifiers.
(Popescu & Etzioni, 2005)





20 W144913642

2.5.4 IDENTIFYING THE SEMANTIC ORIENTATION OF DOCUMENTS

Most of the research work done is in determining the semantic orientation of
words and phrases, some tasks like summarization and text retrieval may require
semantic labelling of the whole document. (Mejova, 2009) It may not make much sense to do
this for long documents such as articles or books, which have been a key form in traditional
Information Retrieval. However in the age of social networking and internet commerce, there
are a vastly increasing number and variety of short documents, often containing only a few
sentences. These may be product reviews, emails, blog posts, etc. (Mejova, 2009) Much like
approaches for identifying semantic orientation of words, those for documents also range
from simple statistical ones to ones using elaborate knowledge structures to guide the pro-
cess. One of the most popular, and simple, methods is a linear combination of all polarities.
Dave et al., use averaging to determine the polarity of documents. (Dave, Lawrence , &
Pennock, 2003)


2.5.5 OBJECT FEATURE EXTRACTION

Object feature extraction deals with trying to find out certain features of an entity. It is another
important part of sentiment .In shorter, more focused documents it is often safe to assume
that the author is only talking about the topic of the document. Product reviews, for example,
usually contain opinions about that product, and movie reviews talk about the movies in ques-
tion. Yet it is often not enough to know the general topic of the writing. (Mejova, 2009) A com-
pany making a product would certainly want to know not only what people think about this
product in general, but which features they like/dislike in particular. Thus, the task of feature
extraction (where feature can be any target of an opinionated statement) has been gaining
popularity in the field of Sentiment Analysis.
A common approach is to use the part-of-speech (POS) tags to construct templates
of how sentiment is applied to objects. (Hu & Liu, 2005)


2.5.6 COMPARATIVE SENTENCE IDENTIFICATION

Another important research area in Sentiment Analysis is the study of comparative sentences.
Liu defines comparative sentence as a sentence that expresses a relation based on similari-
ties or differences of more than one object. (Liu B. , 2006)
These can be classified into types, such as gradable and non-gradable comparisons.
A gradable comparison is based on the relationship of greater, equal to, or less than. For ex-
ample, Intel chip is faster than the AMD one ranks object in quality. A non-gradable compar-


21 W144913642

ison the features are compared, but not ranked in the order of preference: Coke tastes differ-
ently from Pepsi. Both types of sentences tell us something about the relationships between
different objects. Thus, one of the outputs of a comparative sentence analysis system could
be a rank of products, as determined by the opinion holders. So far though, identification of
comparative sentences has been the primary focus of the computational linguistics communi-
ty. (Mejova, 2009)


2.6 SENTIMENT ANALYSIS USING TWITTER DATA

Undertaking sentiment analysis using twitter data presents is own opportunities and chal-
lenges, compared with other forms of data sources such as blogs, articles and the general
sentiment analysis studies. Twitter is unique in the following ways; twitter posts are short, the
maximum number of characters that are allowed is 140. This makes users very efficient with
their participation in social media discussions. (Hu & Liu, 2012).However twitter messages are
full of wrongly spelled words or slangs. (Go, Bhayani, & Huang, 2009).Notwithstanding twitter
still presents itself as a good source of data for machine learning and sentiment analysis due
to the availability of huge amounts of twitter data for training and testing and also easy access
to twitter data using APIs. ( Pak & Paroubek, 2010)Twitter enables users to utilize the #
symbol called hastag, to mark keywords or topics in a tweet (tag information) (Hu & Liu,
2012).
A review on some work done on twitter is presented below.
Go et al carried out one of the earliest work on twitter data, and undertook sentiment classifi-
cation. The authors focused on using emoticons to help construct large copra of structured
set of texts, and labelling the tweets according to the emoticons. The authors built models
using Nave bayes, MaxEnt and support Vector Machines (SVM) classifier with a mutual in-
formation measure for feature selection. This approach however showed high performance
for two class classification problem. The method shows unsatisfactory results with three clas-
ses (negative, positive and neutral) unsatisfactory results. (Go, Bhayani, & Huang, 2009)
Pak and Paroubek used twitter as a popular micro blogging platform to conduct sentiment
analysis on an extensive collection of tweets. A corpus of tweets were analysed as positive,
negative and neutral tweets. The authors labelled the tweet as positive if the message in-
cludes the happy emoticon :-) , :), =), :D, as negative if sad emoticon is used :- (, :(,
=(, ;( , etc.). However for the objective tweets they retrieved posts from Twitter accounts of
popular newspapers and magazines. After the data collection, they did some linguistic analy-
sis on the dataset, using POS tagging with the aim of finding any differences between subjec-
tive (positive and negative) and objective sentences. The authors noted that there were dif-


22 W144913642

ferences between the POS tags of subjective and objective Twitter posts. They also noted
that there are differences in the POS tags of positive and negative posts. Data was cleaned
by removing URL links, user names (those that are marked by @), RT (for retweet), the emot-
icons, and stop words. Finally they tokenized the dataset and constructed n-grams. Then they
experimented with several classifiers including SVM, but Naive Bayes was found to give the
best result. They trained two Naive Bayes Classifiers. One of them uses n-gram presence,
and the other, POS tag presence. The probability of a sentiment(positive, negative, neutral) of
a Twitter post is obtained as the sum of the summation of the probabilities of n-gram pres-
ence and the summation of the probabilities of n-gram POS tags. Using the formula derived
for nave Bayes:
( (

)) (

)


where G is a set of n-grams of the tweet, T is the set of POS tags of then-grams, M is the
tweet and s is the sentiment (one of positive, negative, and neutral). The sentiment with high-
est likelihood (L(s/M)) becomes the sentiment of the new tweet. The authors achieved best
result (highest accuracy) with bigram presence. Their explanation for this is that bi-grams pro-
vide a good balance between coverage (uni-grams) and capturing sentiment expression pat-
terns (tri-grams) Negation( 'not' and 'no') is handled by attaching it to the words that precede
and follow it during tokenization. The handling of negation is found to improve accuracy.
Moreover, they report that removing n-grams that are evenly distributed in the sentiment clas-
ses improves accuracy. Evaluation was done on the same test data used by (Go, Bhayani, &
Huang, 2009). However, they do not explicitly put their accuracy in number other than show-
ing it in a graph. ( Pak & Paroubek, 2010)
Another study by Barbosa and Feng, employed a two-phased approach to twitter sentiment
analysis. The approaches are; classifying the dataset into objective and subjective classes
(subjective detection) and classifying subjective sentences into positive and negative classes
(polarity detection). The authors felt that the use of n-gram for Twitter sentiment analysis
might not be a good strategy since Twitter messages are short. Instead the opted for the use
of two other features of tweets: meta information about tweets and syntax of tweets. With me-
ta-info, they use POS tags (some tags are likely to show sentiment, e.g adjectives and inter-
jections) and mapping words to prior subjectivity (strong and weak), and prior polarity is re-
versed when a negative expression precedes the word. For tweet syntax features, they use
#(hashtag, @(reply), RT(retweet), link, punctuations, emoticons, capitalized words etc. They
create a feature set from both the features and experiment with machine learning technique
available in WEKA. SVM performs best. For test data, 1000 tweets were manually annotated
as positive, negative and neutral. The highest accuracy obtained was 81.9% on subjectivity
detection followed by 81.3% on polarity detection. ( Barbosa & Feng, 2010)


23 W144913642

Contrary to the use of machine learning approach Bollen et al performed Sentiment Analysis
using a psychometric instrument (profile of mood states) rather than a machine learning pro-
cess. The psychometric instrument extract six moods (tension, depression, anger, vigour, fa-
tigue, confusion). 9664,952 tweets between August 1 to December 20 2008. The tweets con-
tained political, cultural, social, economic, and natural events. Each tweet was then measured
according to the six different moods.
The results were compared with a timeline of notable events that took place in that period. In
concluding the authors stated; We find that social, political, cultural and economic events are
correlated with significant, even if delayed fluctuations of public mood levels along a range of
different mood dimensions. To conclude, we bring about the following methodological contri-
bution: we argue that sentiment analysis of minute text corpora (such as tweets) is efficiently
obtained via a syntactic, term-based approach that requires no training or machine learning.
Sentiment analysis techniques rooted in machine learning yield accurate classification results
when sufficiently large data is available for testing and training. However, minute texts such
as microblogs may pose particular challenges for this approach. (Bollen, Mao, & Pepe, 2011)
The aggregate of millions of tweets submitted to twitter at any given time may provide an ac-
curate representation of public mood and sentiment. This led to the development of real time
sentiment-tracking such as North eastern university and Harvard Universitys pulse of na-
tion, using over 300 million tweets. ( Mislove, Lehmann, Ahn , Onnela, & Rosenquisk, 2010)
Another real time research was carried out by Sakaki et al the authors investigated real-time
interaction of events during the occurrence of an earthquake. They considered each user as a
sensor to monitor tweets posted about the earthquake. To detect a target event the work was
carried out is as follows. First a classifier is trained by using keywords, message length, and
corresponding context as features to classify tweets into positive or negative cases. Second
they build a probabilistic spatio-temporal model for the target event to identify location of the
event. As an application the authors constructed an earthquake-reporting system in japan,
where earthquake occurrence is relatively frequent. (Sakaki, Okazaki, & Matsuo, 2010)



2.7 GOVERNMENT CITIZEN SENTIMENT ANALYSIS

There are not many publications on the specific context of applying sentiment analysis in
government citizen context. A few however are Abbasi where he proposed an affect analysis
approach for measuring the presence of hate, violence, and the resulting propaganda dissem-
ination across extremist group forums. (Abbasi., 2007) In a similar application, Bermingham et


24 W144913642

al.[33] proposed crawling and analysing social media sites, such as YouTube, to detect radi-
calism. ( Bermingham, Conway , McInerney, OHare , & Smeaton, 2009) One of the first pub-
lications the project reviewed was (Arunachalam & Sarkar, 2013). The authors make a strong
case for the importance of applying sentiment analysis in this context. The authors approach
was to use topic modelling, applying the TF-IDF model.
Sentiment Analysis was performed on one of the major social benefits organisation in the
USA. Data was sourced from several social media sites e.g. Twitter, Facebook, Flickr etc. A
hotword affinity analysis was carried out within the topic model approach. Hotwords are pa-
rameters that are common across defined topics of interest. They can provide additional in-
sights into how sentiments around a particular concept can be perceived in the context of dif-
ferent hotwords. A tag cloud was also used to determine which words came up the most. Af-
ter their analysis they were able to find out which social benefits programmes and services
received positive and negative sentiments.
The results of the sentiments were classified into 4 polarities, positive negative, neutral and
ambivalent and visualised using a bar chart. From chart it was easy to find out which social
benefit programs received what polarity of sentiment.
The whole idea of governments analysing the sentiments of its people should not be an one-
time event but if possible an on-going daily process in order that governments can really be
monitoring the sentiments of citizens on a regular basis.
According to the authors i.e. (Arunachalam & Sarkar, 2013), assert that in 2010 Gartner insti-
tuted the Open Government Maturity Model. Gartner proposed sentiment analysis as a
means to achieve collaboration for governments echoing to that model. Forrester research
observed that the USA federal government was monitoring the citizen sentiment in Twitter.
Gartner also called for governments to use social media for achieving collaborative budgeting
and pattern discovery where citizens sentiment analysis on social media can play a signifi-
cant role.
It is therefore imperative that we look at a proposed architecture used by the author as a pos-
sible system which can be used to analyse sentiments on a regular basis.

2.3.7.1 SYSTEM ARCHITECTURE FOR CITIZEN SENTIMENT ANALYSIS
The architecture (Arunachalam & Sarkar, 2013) used is adopted from what IBM has proposed
for use as an effective approach to both topic modelling and sentiment analysis. The various
components for such a system are described below:
GPFS: The IBM General Parallel File System is a specialized file system targeted for high
performance applications such as big data analytics.


25 W144913642

HADOOP: Apache Hadoop is an open source software framework for running data-intensive
applications in a distributed fashion over commodity hardware.
SYSTEMT: It is a rule-based IE system that was proposed in the works of () It uses a declara-
tive rule language, AQL to define the Natural Language Processing (NLP) rules for infor-
mation extraction from documents.

FLOWMANAGER: Based on the rule and configurations this component orchestrates the ex-
ecution of different task across different components in this system.

LUCENE: Apache Lucene is an open-source framework for information retrieval applications.

AdminUI: The user interface used by administrators to configure this system and define AWL
rules using simple interfaces.

ANALYSISUI: The user interface component that enables sentiment analysis execution and
rendering using Lucene component.

DATA FETCHER: The social media interfacing component that interacts with diverse sources
fetches information in different formats and produces JSON representation of them and saves
into GPFS.

TOPICEXTRACTOR: With the help of natural language processing rules in systemT, this
component extracts information from JSON data created by DataFetcher. It computes the
term frequency and document inverse frequency values and produces X matrix. This compo-
nent runs as Hadoop job.
TOPICMODELLER: This component computes the estimated matrices W and H. It employs
the proximal Rank-One Residue Iterations (Proximal RRI) optimization algorithm as proposed
by (). It also produces JSON documents annotated with topic information. This component
also runs as a Hadoop job.
Uploader: This component picks up the annotated JSON document produced by TopicModel-
ler and uploads them into a staging area. Lucene indexes these documents so that they can
be searched and analysed based on extraction information using traditional sentiment anal y-
sis techniques for subjectivity detection and sentiment classification.


26 W144913642

.
Figure 2.1 An example of a system architecture for Sentiment Analyser





2.8 AN OVERVIEW OF DATA MINING (STRUCTURED) AND TEXT MIN-
ING (UNSTRUCTURED DATA)

Data mining is a field which has seen rapid advances in recent years (Han & Kamber,
2005)due to advances in hardware and software technology, which has led to the availability
of different kinds of data. One of such is text data which resides in large repositories such as
the web and more specifically social networks (Aggarwal & Zhai, 2012)
Unstructured data refer to information that either does not have a pre-defined data model and
is not organised in a predefined manner. ( Nemschoff, 2014) examples are; books, emails,
social media etc.
Structured data is data that can be easily organised regards of its simplicity. ( Nemschoff,
2014)Structured data is normally ready for seamless integration into a database or well-
structured file format such as XML. (Johnson, 2012) Examples of such data are; sensory da-
ta, point of sale, web server, recorded data entry i.e. gender, age post code etc.
Structured data is generally less noisy and managed with a database system, text data on the
other hand is relatively noisier and typically managed via a search engine due to the lack of
structure (Gundecha & Liu, 2012)
Due to the difference in text data, the mining techniques which can be employed are different.
In text data a key characteristic is the sparse and high dimensionality (Aggarwal & Zhai, 2012)
e.g. given a corpus drawn from a lexicon of 100 000 words but a given text document may
contain only a few hundred words. Thus a corpus of text documents can be represented as a


27 W144913642

sparse term document matrix of size where n is the number of documents, and d is the
size of the lexicon vocabulary. The (I, j)th entry of this matrix is the normalised frequency of
the jth word in the lexicon in document i. (Aggarwal & Zhai, 2012) It is therefore desirable to
transform text data into a structured format prior to applying traditional data mining tasks such
as clustering and classification. ( Feinerer, Hornik, & Meyer, 2008)

Text data can be analysed at different levels of representation. E.g. Text data can be treated
as a bag-of words, or it can be treated as a string of words. However, in most applications, it
would be desirable to represent text information semantically so that more meaningful analy-
sis and mining can be done. (Aggarwal & Zhai, 2012)


2.9 SENTIMENT ANALYSIS AND MODELLING TECHNIQUES.

In the field of data mining and analytics a model is defined as simply an algorithm or set of
rules that connect a collection of inputs (often in the form of fields in a corporate database) to
a particular target or outcome. (Berry & Linoff, 2004)

A number of different techniques can be used in modelling sentiment analysis. These model-
ling approaches are machine learning techniques and the approaches one can use are su-
pervised, unsupervised, and combined techniques.

In the supervised technique the task or approach is to build a classifier. The classifier would
require training data to build and train the model. Algorithms used here are support vector
machines (SVM), Naive Bayes classifier and Multinomial Nave Bayes. According to (pang,
Lee, & Vaithyanathan, 2002) supervised techniques can use one or a combination of ap-
proaches stated earlier. E.g A supervised technique can use a relationship-based approach,
or language model approach or a combination of them. For supervised technique, the text to
be analysed must be represented as a feature vector. (pang, Lee, & Vaithyanathan, 2002)
states that supervised techniques outperform unsupervised techniques in performance.

In unsupervised technique, classification is done by a function which compares the features of
a given text against discriminatory- word lexicons whose polarity are determined prior to their
use. e.g. starting with positive and negative word lexicons, one can look for them in the text
whose sentiment is being sought and register their count. Then if the document has more
positive lexicons, it is positive, otherwise it is negative. (Turney, 2002) Uses a slightly different
approach by employing a simple unsupervised technique to classify reviews as recommended
(thumbs up) or not recommended (thumbs down) based on semantic information of phrases
containing an adjective or adverb. He computes the semantic orientation of a phrase by mu-


28 W144913642

tual information of the phrase with the word poor. Out of the individual semantic orientation
of phrases, an average semantic orientation of a review is computed. A review is recom-
mended if the average semantic orientation is positive, not recommended otherwise.

DISCOURSE APPROACH: Using this approach, discourse relation between text
components is used to guide the classification. According to (pang, Lee, &
Vaithyanathan, 2002) in their movie review the overall sentiment is usually ex-
pressed at the end of the text. This means the approach to sentiment analysis, in this
case, will be discourse-driven in which the sentiment of the whole review is obtained
as a function of the sentiment of the different discourse components in the review
and the discourse relations that exist between them. In such an approach, the senti-
ment of a paragraph that is at the end of the review might be given more weight in
the determination of the sentiment of the whole review.
RELATIONSHIP-DRIVEN APPROACH: In this approach the classification task
deals with different relationships that may exist in or between features and compo-
nents. These relationships include relationships between discourse participants or
between product features. E.g.to know the sentiment of customers concerning a
brand, one can compute it as a function of the sentiment on different features or
components of it.
LANGUAGE MODEL DRIVEN APPROACH: Classification in this approach is car-
ried out by building n-gram language models. Presence or frequency of n-gram
might be used. From traditional information retrieval and topic-oriented classification
frequency of n-grams is shown to deliver better results. Usually, the frequency is
converted to TF-IDF to take terms importance for a document into account. (pang,
Lee, & Vaithyanathan, 2002) in their movie review classification, found that term-
presence gives better results than term frequency. They indicated that having a uni-
gram presence is more suited for sentiment analysis. With product review however
the authors concluded that the bi-grams and tri-grams worked better than Uni-grams
in sentiment analysis.
KEYWORD/KNOWLEDGE MODEL APPROACH: This approach sees sentiment as
the function of some keywords. The main task is the construction of sentiment dis-
criminatory-word lexicons that indicate a particular class such as positive class or
negative class. The Polarity of the words in the lexicon are determined prior to the
sentiment analysis work. There are variations to how the lexicon is created. Lexicons
can be created by starting with some seed words and then using some linguistic
heuristics to add more words to them or start with some seed words and adding to
these seed words other words based on frequency in a text (Turney, 2002).




29 W144913642

2.10 OVERVIEW AND WAY FORWARD FROM THE LITERATURE RE-
VIEW

An overview of similar works has been presented in the preceding sections about sentiment
analysis and twitter specific sentiment analysis. From the review it can be noted that there are
various ways of approaching sentiment analysis depending on the nature of document to
analysis e.g. the approach to a movie review would be different from the approach to a prod-
uct review. In the same way twitter sentiment analysis also comes with its own approach and
challenges as well.

In approaching twitter sentiment analysis certain factors must be considered, one factor is
that twitter posts are short usually 140 characters maximum. This fact means that certain
classification models such as the discourse and relationship-based model cannot be applied
successfully in twitter. Similarly relationship based model becomes irrelevant because there is
no such thing as whole-component relationship in tweets. This leaves us with the other two
approaches which are language models and knowledge-based model. These two approaches
are what most of the reviewed studies have implemented, as the previous two are merely
theoretical approaches and therefore hardly used.
The choice of these two determines what technique to use. While Knowledge based approach
involves the use of mostly unsupervised techniques, language models use supervised ma-
chine learning techniques.
All the Twitter specific sentiment analysis reviewed above used supervised techniques or
achieved better results with them. (pang, Lee, & Vaithyanathan, 2002) in their work stated
that supervised techniques outperform unsupervised techniques.

From our review we have discovered various algorithms which have been applied examples
of which include TF-IDF, LDA, POS tags, n-grams, Nave Bayes, SVM etc.

This research aims to classify twitter based sentiment on their polarity, which indicates that an
adoption of the language based approach should be employed.
However since our general aim is to perform analysis, it would make sense to perform an ex-
ploratory analysis using the additional dimension of the knowledge model. This should give an
idea of the various topics contained in our set of data at the time of collection. Based on the
results of the topics a fair idea of the results would help in the classification effort. However
our primary focus is still to classify the data in to one of three polarities.

After a careful study of various model, this research decides to carry out the sentiment analy-
sis in 3 different phases or stages:


30 W144913642

The first stage is will be to do a term frequency analysis to find out the most common
occurring terms that are featured in the data. This would help us have an insight to
what terms are associated with the HS2 project.
The second will be to carry out a topic modelling using the LDA algorithm. The topic
modelling is one of the common approaches that some researchers have used. E.g.
(Arunachalam & Sarkar, 2013).By grouping frequent occurring terms into models
should further give us additional dimension into what topics or inherent within the data
set. Just by looking at the polarity of these topics can give an indication of the polarity
of the entire corpus.
The final stage will be to apply sentiment analysis to the data (corpus). From the re-
view a number of ways have been suggested, however this project would use an in-
built pre classifier to classify the tweets into the 3 polarities stated already.
















31 W144913642


CHAPTER THREE
3 PROBLEM SPECIFICATIONS

With advances in data mining and text mining algorithms, coupled with the huge number of
text data being generated daily, on many different social media platforms in this context twit-
ter. The possibility of governments trying to analyse opinions and sentiments is much possible
this time than ever before. However the challenge in trying to undertake such analysis comes
from the relatively young field of applying sentiment analysis to twitter data, due to reasons
such as data retrieval,(Information extraction), (Arunachalam & Sarkar, 2013) unstructured
and noisy nature of the text data, linguistic semantics due to the informal and specialised lan-
guage used, languages used to create contents are ambiguous (Gundecha & Liu, 2012)and
identifying which classification approach suits a particular domain of the type of textual data to
be classified.
The problem/research question is to find a suitable model which would best classify senti-
ments into the 3 different classes i.e. positive, negative and Neutral with high accuracy, taking
into account the nature of the type of data. i.e Twitter data. Based on a high accuracy model
conclusions could be drawn about the general opinions towards the HS2 project.
The solution to the stated problem is to employ the knowledge discovery from data(KDD) ap-
proach used in traditional data mining for similar problem specification. This process is dis-
cussed in much detail in the methodology


3.1 METHODOLOGY

A case study is used as a primary method in applying sentiment analysis to public administra-
tion. In this case study a mixed method of both qualitative and quantitative approach is used
in carrying out this research design.
The qualitative approach involves a comprehensive research on the relevance of sentiment
analysis in our lives and why Governments should apply this area of business intelligence in
the governance process. A study of the evolving paradigm of social media would be useful in
understanding why social media provides a good platform to bringing the governing and the
governed closer as well as being a rich source for data.


32 W144913642

The qualitative approach would be carried out by way of extensive literature review (chapter
2) in the form of studying and citing sources from academic journals, reports and publications
and in a few instances textbooks. This would be carried out using the university library, online
libraries and other online sources. This study would serve as a guide in implementing a model
for the task
The quantitative approach which is central to the research would involve the detailed process
of data/information extraction and collection, data storage, data processing, data analysis and
data visualization. These processes would culminate in the eventual model for this research.
(chapter 4)

Figure 3.1 Methodological steps
Model building: build a suitable model using a classification algorithm to classify sentiments
as positive, negative or neutral. In this stage the same techniques such as natural language
tool kit used in conventional opinion mining would also be applied. Model building is an entire
process that consists of the steps illustrated below.
Figure 3.2 Modelling process
The only source of data for this research is Twitter data. The data is generated using Twit-
ters Application Programming Interface (API) to request for the data. Data used in this re-
search is based on empirical primary data collection rather than using publicly available da-
taset.
The model building process is hinged on the use of computational intelligence enabled by
software. Software that would be used in this study are python and R.
Literature
review
Model
building
Analysis of
results
Interpretation
of results
Conclusion
Define
goals
Collect
twitter data
Process
data
Exploratory
data
Analysis
Build an
appropriate
model
Interpret
reslts


33 W144913642

The models chosen for the analysis are topic (unsupervised) and classification (supervised)
model, that would classify sentiments into 3 major classifiers, Positive, Negative and Neutral
so as we can measure the public opinion on the HS2 project.

Analysis of results: The focus of our analysis would be to have a measure of the general
polarity of the processed corpus (tweets). This would be done by looking at the proportion of
the 3 polarities. The polarity with the highest proportion or percentage will give an indication of
the general feeling of the public towards the project. Attempts would be made to compare that
with existing polls or surveys on the HS2 project.
Interpretation of results: In order to interpret the result appropriate visualization methods
would be used in presenting the results. The visualization would be done using R software
Conclusions: The conclusion would provide a summary of how successfully the project has
gone, and the various objectives met. Challenges faced during the course of work and rec-
ommendation to further areas of research.

3.2 METHODOLOGICAL JUSTIFICATION

This research uses a case study to attempt to demonstrate the relevance of sentiment analy-
sis to government-citizen relations, using a real life case of the HS2 project. Yin defines the
case study research method as an empirical inquiry that investigates a contemporary phe-
nomenon within its real-life context; when the boundaries between phenomenon and context
are not clearly evident; and in which multiple sources of evidence are used (Yin, 1984). Yin
further states that the importance of case study is among others; to bring an understanding of
a complex issue or object and extend or add strength to what is already known through previ-
ous research. Also case studies emphasise detailed contextual analysis of a limited number
of events. (Yin, 1984). Most case studies involve steps already mention in earlier i.e data col-
lection analysis and report.
This methodology is widely used and an accepted form of research. It is therefore justified
that, it can be employed in this research by following is methods of inquiry or investigation.





34 W144913642

3.3 SOFTWARE USE JUSTIFICATION

A number of softwares are available for use in this research, with each having its merits and
demerits. However this research chooses two widely used open source software. Python and
R.
Python is a scripting language used to write quick and small programs or scripts. Among the
interpreted languages python is distinguished by its large and active scientific computing
community. In recent years pythons improved library support has made it a strong alternative
for data manipulation tasks. Together with pythons strength in general purpose programming,
it is an excellent choice as a single language for building data-centric applications. (McKinney,
2013)
R is a free functional language software and software environment for statistical computing
and graphics. R has an extensive and powerful graphics ability, that is tightly linked with its
analytics abilities. The R system is developing rapidly with new features and abilities appear
every few months. It is widely used among statisticians and data miners for data analysis. (
Maindonald, 2008)
Both softwares are capable of collecting tweets using their API, and both can also pre-
process as well as analyse text data for the purpose of sentiment analyse. R has a good vis-
ualisation tool for visualising result.
Since these two are widely used with the research community both software would serve the
purpose of sentiment Analysis.












35 W144913642


CHAPTER FOUR

4.0 PROJECT IMPLEMENTATION

This Section describes the practical implementation steps used in the text mining/sentiment
analysis process.

DATA COLLECTION: The first step was to collect the necessary data relevant to our
interest. As stated the source of data would be from tweets on the twitter network that
discuss the HS2 project. This process is carried out using an API. An API can be de-
fined as, a set of routines by which an application program allows another application
program to work directly with it. (Horak, 2007) Twitter has developed its RESTful API
for users to be able to access tweet data. Tweets were collected using Python Ac-
tivestate version 2.7.6.9 software. There are two ways of generating data from twitter;
the historic/search data and the streaming data. The historic data generates a finite
number of desired tweets from the last 7 days prior to making a request, whiles the
streaming data generates a requested number of tweets in real time i.e. just when a
user tweets.

In order to generate tweets that are relevant to the subject of study the query must be
requested for by indicating either the subject name or preceding the name with the @
and # symbols e.g.( HS2, @HS2 and #HS2). Another issue of consideration is the
number of data instances required for a task at hand. Data mining task thrives on a
large amount of data instances/record. Generally larger datasets are better in training
a given model, this is because models learn better and are more accurate with more
data than less data. Most mining task use tens of thousands of data instances for the
mining task.
A total of 10833 tweets was successfully collected using both historic and search API
methods. 2870 historic tweets collected on (16/07/2014), the remaining 8013 stream-
ing tweets were collected between 20/07/201411/08/2014.
The data was stored in MS excel in order to be able to quantify the amount of tweets.
It was later saved as a CSV file for processing.( See Appendix for codes used to
generate tweets)

UNDERSTANDING YOUR DATA: It is important that as a researcher I understand
and get to know the data that is being generated. This is very important in make
meaningful analysis. Therefore as an important step the collected data was studied in


36 W144913642

the context of the public debate and getting to understand the various issues underly-
ing the project.


DATA PRE-PROCESSING: As stated earlier in chapter 3 text data is unstructured
and relatively nosier than numeric data. It is therefore essential that the data under-
goes the pre-processing stage where it is cleaned and transformed into a more struc-
tured format. This step was embedded as part of the analysis stage in R, for both
term frequency and LDA topic modelling. Most sentiment Analysis task do little pre-
processing so as to preserve as much of the data as possible because some pre-
processing task such as stemming and stopwords can affect the polarity of sentences
when used e.g. Stopword removal may take off not and no which is mainly used to
negate adjectives in sentences.


4.1 UNDERTAKING SENTIMENT ANALYSIS


Figure 4 Process diagram showing the stages of analysis

4.1.1 TERM FREQUENCY ANALYSIS

Term frequency also referred to as word probability is a method of using the frequency of in-
put words as an indicator of importance. (Nenkova & McKeown, 2012) The probability of a
word w, P(w) is calculated from an input as the number of occurrences of a word, C(w) divid-
ed by the number of all words in the input N:



The aim of applying this method is to find out the frequently used words, and if these words
can give an insight into frequent terms that are associated with the HS2 debate.
STAGE 1
Term Frequency
AnalysisTtt
STAGE 2
LDA Topic
Modelling
STAGE 3
Sentiment
Analysis
classifier and
scoring


37 W144913642

Since this is a public debate by the public the view taken in this research is that, frequent
used terms should be given more priority than less used terms. E.g. If people feel the eco-
nomic benefit of Hs2 are enormous then one would expect that the term economic and benefit
should feature prominently in the frequency analysis. This is the view taken as against using
TF-IDF which gives preference to rare words. The Term frequency was carried out in R using
the text mining library. Also term frequency association analysis was carried out to find which
other terms were associated with the main frequency terms.
In carrying out the term frequency process, a number X is specified; X can be considered to
be a minimum probability value P(w) of the occurrence of a term in the entire corpus. Based
on the P(w) values, all terms of X P(w) would be displayed. Values for X used were 20,
100, 500 1000, and 1500. The aim of increasing the value of X was to determine the single
most used term in the corpus. However the analysis of the terms was done using X = 100.


4.1.2 LATENT DIRICHLET ALLOCATION TOPIC MODELLING

LDA is a generative probabilistic model of a corpus. The basic idea is that documents are rep-
resented as random mixtures over latent topics, where each topic is characterized by a distri-
bution over words. (Blei, Ng, & Jordan, 2003)
LDA provides the mechanism for finding patterns of term co-occurrence and using those pat-
terns to identify coherent topics. LDA results in topics in which the terms that are most proba-
ble frequent co-occur with each other in documents. (Crain, Zhou, Yang, & Zha, 2012) LDA
model is usually illustrated graphically.

Figure 4.1 LDA graphical representation
The LDA model is represented as a probabilistic graphical model in Fig. 4.1 as the figure
makes it clear; there are three levels to the LDA representation. The parameters and are
corpus level parameters, assumed to be sampled once in the process of generating a corpus.
The variables
d
are document-level variables, sampled once per document. Finally, the vari-
ables z
dn
and w
dn
are word-level variables and are sampled once for each word in each doc-


38 W144913642

ument. (Blei, Ng, & Jordan, 2003) LDA is a widely used in topic modelling in many research-
es, hence employing it is appropriate for the task of finding interesting topics and term co-
occurrence in our corpus.
20 and 50 topics were chosen with each topic containing 20 terms. (20x20) and (50x20). Our
aim is to determine if we can identify interesting topics within the corpus.


4.1.3 SENTIMENT ANALYSIS SCORE

The sentiment analysis is undertaken in two steps. The first is to use an inbuilt Nave Bayes
sentiment analyser in python to classify our corpus into 3 different polarities. Nave Bayes
classifier is a commonly used generative classifier based on Bayes theorem. It models the
distribution of the documents in each class using a probabilistic model with independence
assumptions about the distribution of different terms. It computes the posterior probability of a
class, based on the distribution of the words in the document, and work with the bag of words
assumption. (Aggarwal & Zhai, 2012)
The second step was to develop a dictionary based on the term frequency that was devel-
oped. The words in the dictionary were manually scored between -5 to 5. With the following
polarity score: (-5 and -4: very negative), (-3 and -2: negative), (-1 and 1: neutral) (2 and 3:
positive),(4 and 5: very positive). This dictionary together with a similar one developed by Finn
rup Nielsen (www.opendatacommons.org). Both will be used to score the entire corpus. The
second Method though similar to the first has the advantage of classifying better than the first
because it uses specific lexicons that were used in the HS2 discussions.


4.2 CHALLENGES AND ADJUSMENTS

A number of challenges faced with the analysis process. The first was getting more data than
was used. The aim was to collect at least 15,000, however using the stream data method,
gives tweets as and when a user tweets and this process can make the tweet collection slow
if not so many people are tweeting at a time. The average rate of collection was about 250
tweets a day. It was therefore decided to scale down the number to at least 10,000 in order to
keep with the project timelines.
The major challenge was applying the dictionary based approach to scoring the sentiment.
The process applied in R did not generate any results with coding errors. A decision to use a


39 W144913642

widely known sentiment analysis method by Jeffery Breen also did not present any meaning-
ful result.
Another challenge was time constraint in manually labelling thousands of individual tweets
into the three polarities in order to train a classifier with a good accuracy.
Adjustments therefore made were to focus on the results of the topic modelling and the inbuilt
nave Bayes sentiment analyser for the classification task.



















40 W144913642


CHAPTER FIVE


5. RESULTS AND ANALYSIS

5.1 TERM FREQUENCY RESULTS AND ANALYSIS

Figure 5. Shows the output for terms with X=100. 162 terms are returned as output from the
entire corpus.(in alphabetical order) With the results we can make some inference from the
terms displayed above with a good understanding of issues and debates.

Figure 5.1 Display of terms for X = 100
I. Terms that imply the economic viability and benefits are featured very
well, terms such as; affordable, benefit(s), billion, cost economic,
taxpayer, jobs, waste and welfare, indicate that the issues of eco-
nomic impact are well debated. Whiles terms like waste can clearly be
considered negative, jobs can be seen as a positive term.

II. A number of cities where the project would take place also feature. Lon-
don and Birmingham are the two main cities where phase one of HS2 is
expected to connect. An interesting city that comes up is Liverpool since
Liverpool is not originally part of the cities to be connected. This therefore
raises an interesting question as to why Liverpool features. One way to
find out is to carry a term frequency association analysis.



41 W144913642


III. Some notable Politicians feature also as terms. Prime Minster David
Cameron, and finance sectary George Osborne The political parties
uklabour and ukip are the only parties that feature among the several
parties including the parties of the collision government.

IV. Other significant terms that appear are hs2aintgreen, hs2aa hs2facts,
stophs2,anti and support

V. Also featuring in the terms are the people who contribute to the debate
the most which we can term authors. 5 different authors feature with their
twitter names.
The most used term was found out to be stophs2 at X = 1500. Stophs2 is one of the many
terms used to voice out displeasure to the project and there are numerous groups with the
@stophs2 or #hs2 campaign.

5.1.1 TERM ASSOCIATION
A term association for selected terms was carried and with 10% as the minimum threshold for
other terms associated with the main terms we used as queries. The results of the top terms
and the tweet that forms the terms are displayed below.
From the table present in fig one can say that frequent term association can give a deeper
insight about the most frequent terms. It can also help identify which tweets (in this case the
issues that are being tweeted the most). Using the term Liverpool as an example we can
now draw a fact that there is a strong campaign to include the city into the HS2 network.









42 W144913642


TERM ASSOCIATED TERMS(10%) Tweet
stophs2 Hs2facts 37%
blocking 24%
illegally 24%
publication 24%
report 24%
scathing 24%
The UK government is blocking
the publication of a scathing hs2
report
hs2facts blocking 42%
illegally 42%
publication 42%
report 42%
scathing 42%
stophs2 37%
The UK government is blocking
the publication of a scathing hs2
report
Liverpool destination 51%
link 51%
tourist 51%
uks 51%
visitors/year 48%
fastest 43%
growing 44%

Liverpool is UKs fastest growing
destination an hs2 link would
add 700k+ extra visitors/year
Government Hs2facts 54%
blocking 54%
illegally 54%
publication 54%
report 54%
scathing 54%
The UK government is blocking
the publication of a scathing hs2
report
Economy 26k 38%
8.3bn 36%
Kent 36%
Transform 32%
Adding 32%
#osborne 26%
No single tweet makes up these
terms but a combination of sev-
eral.
Jobs 73% 64%
Created 62%
London 33%
73% of all jobs created by hs2
would be in London
Table 5.1 Terms and their frequent associated terms


43 W144913642

5.2 LDA TOPIC MODELLING RESULTS AND ANALYSIS

Figure 5.2 LDA topic modelling results

From Figure a number of interesting facts can be deduced about the various topics contained
in the corpus;
I. The term stophs2 is grouped 3 times in topics 4, 6 and 7. It also appears as a
term in all the remaining 17 topics. Just like the term frequency analysis,
stophs2 is the dominant term that appears most in the corpus. This result in-
dicates a number a number of inferences we can make about the data;
Most of the tweets use either the @hs2 or #hs2 as the subject of dis-
cussion concerning the project.
The stophs2 campaigners are actively using twitter as one of the plat-
forms to argue out their point of view.


II. Again like in the case of term frequency another interesting consideration is
the topic of Liverpool and London in topics 15 and 17 respectfully. The expla-
nations in table 5 can be extended to why these two feature as topics in the
corpus.

III. Another interesting result is the display of twitter names of two authors in
topics 11 and 14. Though this aspect is not an objective we can imagine that
these most of the discourse is made by both the two.




44 W144913642


5.2.1 MODEL EVALUATION

Topic models are evaluated using Perplexity. (Blei, Ng, & Jordan, 2003) Perplexity mesures
how well a language model fits the word distribution of a corpus and is defined by :

Where p
l
(x
i
) is the probability of the occurrence of word x
i
estimated by the language model l
and N is the number of words in the document.
Lower perplexity values are generally desired and the best values are those close to the cho-
sen number of topics. A perplexity value of 19.64 was recorded, which can be considered as
a good fit for the topic modelling.
A scatter plot of the distribution of words in fig shows a relatively uniformed distribution of
words accros the corpus and hence further shows that LDA performed well in modelling the
words into topics within the corpus.


Figure 5.2 perplexity value for LDA topic modelling


Figure 5.3 scatter plot for LDA topic modelling




45 W144913642



5.2 SENTIMENT ANALYSIS RESULTS
Probability
value
0.55 0.60 0.65 0.70 0.75
Positive
76.6 71.4 66.9 59 53.1
Negative
14.7 11.8 9.2 7.5 5.3
Neutral
8.7 16.8 23.9 33.5 41.4

Figure 5.3 Results of sentiment analysis


Figure 5.4 A plot of sentiment score
Using a prebuilt analyser the individual tweet in the corpus is classified into the 3 polarities
stated. This works by defining a probability score on a range of between
0 to 1. E.g. If 0.6 is chosen it means what is the probability a tweet is 60% more positive than
negative and vice versa. Results for values chosen are 0.55, 0.6, 0.65, 0.7 and 0.75.
From the results in figure 5.4 the most of the tweets are classified as positive compared to
both negative and neutral. However as we increase the probability from 0.55 to 0.75 the neu-
tral polarity increases.
Using a prebuilt classifier presents us with some challenges, which include; our inability to
determine the accuracy of the classifier and train the classifier to be specific to some lexicons
used in our specific corpus. Unlike in topic modelling where results are based on terms used
in the corpus Sentiment analysis only classifies each sentiment according to how a classifier
0
10
20
30
40
50
60
70
80
90
0.55 0.6 0.65 0.7 0.75
positive
negative
neutral


46 W144913642

model or algorithm works. It is therefore difficult to validate the accuracy of the classifier used
for this task. The results however present a general positive polarity the corpus.

5.3 SUMMARY OF FINDINGS

The analysis was conducted in three stages and whiles the term frequency and LDA models
showed some interesting results. Not much can be said about the sentiment Analysis. In con-
trast to the seemingly negative term and topics i.e stophs2, which displayed prominently in
the first two analysis, the general sentiment of the corpus is shown to be positive. This makes
it quite challenging to draw a general conclusion of overall sentiments expressed in the data
we have. However based on the result of the prebuilt classifier we can only conclude, the
general sentiment expressed in the corpus is positive. This however cannot be extended as a
conclusive statement that based on our results the citizenry are in favour of the project as
against the project. Aside these we must bear in mind that the data collected was time bound
so will only represent the sentiment at the time of collection.

5.4 FUTURE WORK

A major limitation to the implementation of the sentiment analysis was to develop a manually
labelled corpus with which to train a corpus specific to our domain. Domain specific lexicons
is important in improving the accuracy of the classifier. It is therefore possible for considera-
tion in future to improve upon this work. Other possible considerations are using other models
such as POS. and n-grams.
A future novel area of research is to implement a real time sentiment analyser using a mobile
application. This involves building an app that connects government agencies to their citizens
so as to enable citizens send their sentiment to their various government agencies. The app
could then do a classification of the sentiment.
Other case studies that involve public debates such as fracking and immigration can also be
carried out.





47 W144913642


CHAPTER SIX

6 CONCLUSION

An extensive research on the field of sentiment analysis has been presented. This research
sought to apply this important field of learning to an emerging discipline. SA has been re-
searched and applied traditional in product and movie reviews and there are a lot of publicly
available datasets that makes undertaking such SA task a bit easier. Compared with the ap-
plication of SA in this research, though the same methods can be applied the nature of the
data makes some difference, due to domain specific lexicons contained in the data.
Most of the objectives set out were largely achieved i.e. the general research on SA, both
term frequency and Topic modelling. The challenge however was using either a lexicon dic-
tionary or building our own classifier to score and classify our corpus instead of using a gen-
eral prebuilt classifier
The research has presented a good understanding of SA and its importance as a business
intelligence tool. For this particular subject area, it presents a very interesting way for public
administrators for both the purposes of decision making and customer relationship manage-
ment. Social media has created a platform which needs to be taken advantage of by analys-
ing the various public sentiments expressed.
With respect to the case study though interesting results were made from the data, no con-
clusive statements were made about what the UK citizenry feels towards HS2 but rather only
interesting results, such as the stophs2 campaign and the campaign to extend HS2 to Liver-
pool. Also from the topic modelling there seems to be much concern over the cost of the pro-
ject. Another interesting revelation was excitement about the prospect of job creation particu-
larly in London using the term frequency.
A very good insight into undertaking sentiment analysis has been well learned and under-
stood. This work presented is a good basis for further work to be carried either in this same
case study or extending to other areas.
In conclusion, this research has served as a useful process of better understanding how best
to practical apply SA to a given task especially in domain specific areas. SA is a beneficial
tool for which governments should use in its public administration duties, So as to remain truly
beneficial to the concerns of citizens.


48 W144913642



BIBLIOGRAPHY

Barbosa , L., & Feng, J. (2010). Robust Sentiment Detection on Twitter from Biased and
Noisy Data. Proceedings of the 23rd International Conference on Computational
Linguistics, (pp. 36-44).
Bermingham, A., Conway , M., McInerney, L., OHare , N., & Smeaton, A. F. (2009).
Combining Social Network Analysis and Sentiment Analysis to Explore the Potential
for Online Radicalisation. International Conference on Advances in Social Network
Analysis and Mining.
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of
Statistical Software.
Maindonald, J. (2008). Using R for Data Analysis and Graphics Introduction, Code and
Commentary.
Mislove, A., Lehmann, S., Ahn , Y.-Y., Onnela, J.-P., & Rosenquisk, J. (2010). Pulse of the
Nation. Retrieved 8 15, 2014, from Pulse of the Nation:
http://www.ccs.neu.edu/home/amislove/twittermood/
Nemschoff, M. (2014, June 28). A Quick Guide to Structured and Unstructured Data.
Retrieved 08 04, 2014, from Smart Data Collective:
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-
unstructured-data
Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion
Mining.
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-Based Methods
for Sentiment Analysis. Computational Linguistics, 267-307.
Whitelaw, C., Garg , N., & Argamon, S. (2005). Using appraisal groups for sentiment analysis.
Proceedings of the ACM SIGIR Conference on Information and Knowledge, (pp. 625-
631).
Data preprocessing. (2014, April 18). Retrieved April 18, 2014, from TechTarget:
http://searchsqlserver.techtarget.com/definition/data-preprocessing
dictionary.com. (2014, April 5th). Retrieved April 05/04/2014, 2014, from dictionary.com:
http://dictionary.reference.com/browse/algorithm
Hs2 story. (2014, 06 16). Retrieved 06 16, 2014, from HS2 engine for growth:
http://www.hs2.org.uk/about-hs2/high-speed-rail-hs2/hs2-story
Abbasi., A. (2007). Affect intensity analysis of dark web forum. Proceedings of Intelligence
and Security informatics, (pp. 282288).


49 W144913642

Aggarwal, C. C. (2011). Social Network Data Analytics. New York, London: Springer.
Aggarwal, C. C., & Zhai, C. (2012). A survey of Text Classification Algorithms. In C. C.
Aggarwal, & C. Zhai, Mining Text Data (pp. 163-213). Springer.
Aggarwal, C. C., & Zhai, C. (2012). Mining Text Data. New York: Springer.
Antonio do Prado, H., & Ferneda, E. (2008). Emerging Technologies of text Mining:
Techniques and Applications. London, Hershey PA : Information Science Reference.
Arnott, , D., & Pervan, G. (2005). . A critical analysis of Decision Support Systems research,.
Journal of Information Technology, 67-87.
Arunachalam , R., & Sarkar, S. (2013). The New Eye Of Government:Citizen Sentiment
Analysis in Social Media. IJCNLP 2013 Workshop on Natural Language Processing
for Social Media (SocialNLP),, (pp. 23-28). Nagoya.
Arunachalam, R., & Sarkar, S. (2013). The New Eye of Government: Citizen Sentiment
Analys. International Joint Conference on Natural Language Processing, (pp. 23 -
28). Nagoya.
Asur, S., & Huberman, B. (2010). Predicting the future with social media.
Azevedo, A., & Santos, M. F. (2012). Closing the Gap between Data Mining and Business
Users of Business Intelligence Systems:A Design Science Approach. International
Journal of Business Intelligence Research,, 14-53.
Bautin, M., Ward, C. B., Patil, A., & Skiena, S. S. (2010). Access: News and Blog Analysis for
the Social Sciences., (pp. 1229 - 1232).
BBC. (2014, 06 25). Reaction as HS2's second phase details unvails. Retrieved June 25,
2014, from BBC news UK: http://www.bbc.co.uk/news/uk-21229602
Benamara, F., Cesarano, C., Picariello, A., Reforgiato , D., & Subrahmanian, V. (2007).
Sentiment analysis: Adjectives and adverbs are better than adjectives alone.
Internation Conference in Weblogs and Social Media.
Berry, M. J., & Linoff, G. S. (2004). Data Mining Techniques for marketing, Sales, and
Customer Relationship Management. In Data Mining Techniques for marketing,
Sales, and Customer Relationship Management (p. 8). Indianapolis, Indiana: Wiley
Publishing.
Blei, D. M., Ng, A. Y., & Jordan, m. I. (2003). Latent Dirichlet Allocation. Journal of Machine
Learning Research, 996.
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion:Twitter Sentiment
and Socio-Economic Phenomena. Proceedings of the Fifth International AAAI
Conference on Weblogs and Social Media, (pp. 450-453).
Butcher, L. (2014). Standard Notes House of Commons. Library House of Commons.
Ceron, A., Curini, L., Mlacos, S., & Porro, G. (2013). Every tweet counts? How sentiment
analysis of social media can improve our knowledge of citizens' political preferences
with an application to Italy and France. Italy: Sage.
City of York. (2014, April 3). City of York Council. Retrieved 08 07, 2014, from City of York
Council:


50 W144913642

http://www.york.gov.uk/news/article/476/connected_cities_cities_united_voice_on_hs
2
Crain, S. p., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality Reduction And Topic
Modelling: From Latent Semantic indexing To Latent Dirichlet Allocation And beyond.
In Mining Text Data (pp. 129-156). Springer.
Dave, K., Lawrence , S., & Pennock, D. M. (2003). Mining the peanut gallery:Opinion
extraction and semantic classification of product review. Proceedings of the 12th
internaltional world wide web conference, (pp. 519-528).
Encyclopedia Britanicca. (2014, June 9). Government: Encyclopedia Britanicca. Retrieved
June 9, 2014, from Encyclopedia Britanicca:
http://www.britannica.com/EBchecked/topic/240105/government
Fang, Y., Si, L., Somasundaram, N., & Yu, Z. (2012). Mining Contrastive Opinions on Political
Texts using. Retrieved April 23, 2014, from Purdue.edu:
https://www.cs.purdue.edu/homes/lsi/WSDM_2012.pdf
Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter Brand Sentiment Analysis. A hybrid
sysytem using n-gram analysis and dynamic atific. Exper systems with application.
Go, A., Bhayani, R., & Huang, L. (2009). Twitter Sentiment Classification using Distant
Supervision.
Golfarelli , M., Dario, M., & Rizzi, S. (2004). The dimensional fact model: a conceptual model
for data warehouses. International Journal of Cooperative Information Systems, 215-
247.
Grimes, S. (2014, May 15). Break Through analysis. Retrieved May 15th May 2014, 2014,
from breakthroughanalysis:
http://breakthroughanalysis.com/2012/09/10/typesofsentimentanalysis/
Gundecha, P., & Liu, H. (2012). Mining social media: a brief introduction. Tutorials in
Operations Research, 1-17.
Gundecha, P., & Liu, H. (2014, April Monday). http://www.public.asu.edu/. Retrieved
AprilMonday 2014, from http://www.public.asu.edu/:
http://www.public.asu.edu/~pgundech/book_chapter/smm.pdf
Han, J., & Kamber, M. (2005). Data Mining Concepts and Techniques. Morgan Kaufmann.
Hatzivassiloglov , V., & Wiebe, J. (2000). Effects of adjective orientation and gradability on
sentence subjectivity. International Conference on Computational Linguistics.
Horak, R. (2007). Telecom Dictionary,A comprehensive reference for telecommunications
terminology. Indianapolis: Wiley Publications.
Hu , M., & Liu, B. (2005). Mining and summarizing customer reviews. Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural.
Hu, X., & Liu, H. (2012). Text Analytics in Social Media. In C. C. Aggarwal, & C. Zhai, Mining
Text Data (pp. 385-408). New York, London: Springer.
IPSOS. (2013). High Speed Two:Exceptional Hardship scheme for phase two. Social
Research Institute.


51 W144913642

Johnson, J. (2012, November 14). Structured Data vs. Unstructured Data. Retrieved 08 04,
2014, from KPI Partners: http://www.kpipartners.com/blog/bid/137981/Structured-
Data-vs-Unstructured-Data
Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation, 11-21.
Kaplan , A. M., & Haenlein, M. (2010). "Users of the world, unite! The challenges and
opportunities of social media". Business Horizons.53 (1), p. 61.
Kennedy , A., & Inkpen, D. (2006). Sentiment classification of movie reviews using.
Computational Intelligence, 22:110125.
Kim , S. M., & Hovy, E. (2004). Determining the sentiment of opinions. Proceedings of.
Proceedings of the 20th International Conference on Computational Linguistics.
Kwak, H., Lee, C., Park, H., & Moon, S. (2009). What is Twitter, A social network or Network
or News Media.
Liu, B. (2006). Web Data Mining Chapter Opinion Mining. Spinger.
Liu, B., & Zhang, L. (2012). A Survey Of Opinion Mining And Sentiment Analysis. In C. C.
Aggarwal, & C. Zhai, Mining Text Data (pp. 415-452). New York: Springer.
Liu, B., & Zhang, L. (2012). A SURVEY OF OPINION MINING AND SENTIMENT ANALYSIS.
In C. C. Aggarwal, & C. Zhai, Mining Text Data (p. 41). New York, London: Springer.
Liu, B., Hu, M., & Cheng, J. (2005). Opinion Observer: analysing and comparing opinions on
the web Proceedings of the international conference on World Wide web.
Lonnqvist, A., & Pirttimaki, V. (2006). The Measurement of Business Intelligence . Information
Systems Mangement journal, 32-40.
Luftman, J., & Mclean , E. R. (2012). Key issues for IT executives . MISQ Executive89-104.
McKinney, W. (2013). Python for Data Analysis. Sebastopol: O'Reilly.
Mejova, y. (2009). Sentiment Analysis: An Overview. Iowa.
Morstatter, F., Kumar, S., Liu, H., & Maciejewski, R. (n.d.). Public.asu.edu. Retrieved June 23,
2014, from public.asu.edu:
http://www.public.asu.edu/~huanliu/papers/kdd2013demoFM.pdf
Na, C. J., Sui , H., Khoo , C., Chan , S., & Zhou, Y. (2004). Effectiveness of simple linguistic
processing in automatic sentiment classification of product reviews. Conference of
the International Society of Knowledge Organization, (pp. 4954.).
Nenkova, A., & McKeown, K. (2012). A survey of Text Summarization Techniques. In Minning
Text Data (pp. 43-66). Springer.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
pang, B., Lee, L., & Vaithyanathan, S. ( 2002). Thumbs up? Sentiment Classication using
Machine Learning. Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP) (pp. 79-86). Philadelphia: Association for
Computational Linguistics.


52 W144913642

Popescu , M. A., & Etzioni, O. (2005). Extracting product features and opinions from reviews.
Proceedings of the conference on Human Language Technology and Empirical.
Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge University
Press.
Rehfeld, A. (2005). Towards a General Theory of Political Representation. Journal of Politics,
1-53.
Russell, M. A. (2013). Mining the social web. Sebastopol: O'reilly media.
Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter Users:Real-Time
Event Detectionby social sensors.
Savigny, H. (2002). Public opinion, political communication and the Internet. pp. 1-8.
Schellong, A. (2008). Cititzen Relationship Mangement. Frankfurt: Peter Lang.
Search Engine watch. (2014, July 3). Worldwide Social Media Usage Trends in 2012.
Retrieved July 3, 2014, from Search Engine watch:
http://searchenginewatch.com/article/2167518/Worldwide-Social-Media-Usage-
Trends-in-2012
Subasic , P., & Huettner, A. (483496). Affect analysis of text using fuzzy semantic typing.
IEEE-FS, 2001.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data mining. In Introduction to
Data mining (p. 2). Boston: Pearson Education Inc.
Thomas , M., & Pang, B. L. (2006). Get out the vote: Determining support or opposition from
congressional floor-debate transcripts. Proceedings of the Conference on Empirical
Methods in Natural Language Processing, (pp. 327335).
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company, .
Turney, P. (2002). Thumbs up or thumbs down?:Semantic orientation applied to unsupervised
classification of reviews. . In proceedings of the 40th Annual Meeting on Association
for Computational Linguistics., (pp. 417-424).
Twitter. (2012, March 21). Blog.twitter. Retrieved June 24, 2014, from Blog.twitter:
https://blog.twitter.com/2012/twitter-turns-six
UN. (2012, December 05). UN public adminstration programme. Retrieved June 04, 2014,
from unpan1.un.org:
http://unpan1.un.org/intradoc/groups/public/documents/un/unpan050896.pdf
Wickre, K. (2013, March). Clebrating Twitter 7. Retrieved 06 3, 2014, from Twitter.com:
http://blogtwitter.com/2013/03/celebrating-twitter7.html
Wiebe, W. T., & Hoffmann, J. (2005). Recognising contextual Polarity in Phrase-level
sentiment analysis. Proceeding of HLT-EMNLP.
Wilks, Y., & Stevenson, M. (1998). The grammar of sense: Using part-of-speech tags as a
first step in semantic disambiguation. Journal of Natural Language Engineering, 135
144.


53 W144913642

Xin, J., Gallagher, A., Cao, L., Luo, J., & Han, J. (2010). The wisdom of social multimedia.
Proceedings of ACM multimedia 2010 international conference, (pp. 12351244).
Firenze: ACM.
Yin, R. K. (1984). Case study research. Newbury Park,CA: Sage.
Yu , H., & Hatzivassiloglou , V. (2003). Towards answering opinion questions: Separating
facts from opinions and identifying the polarity of opinion sentences. Proceedings of
the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Zabin, J., & Jefferies, A. (2008). Social Media Monitoring and Analysis:Generating consumer
insights from online conversation". Aberdeen group Benchmark Report.
Zarrella, D. (2010). The Social Media Marketing book. North Sebastopol CA: O'Reilly media.
Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., et al. Comparing Twitter and
Traditional Media Using.



























54 W144913642

APPENDIX

1. CODE FOR EXTRACTING STREAMING TWEETS


2. LDA TOPIC MODELLING CODES

You might also like