Professional Documents
Culture Documents
1.1 Motivation
Online Customer reviews have become today's word of mouth for the current
generation of buyers and sellers. Online customer reviews influence both product
sales in consumer decision-makingand quality improvement by business firms.It is
predicted that the question“Was this review helpful to you?”brings in about$2.7
billion additional profit to Amazon.com(Spool,2009). Since thousands of reviews
have been constantly posted even for a moderately popular product, every
review will not get a fair chance of getting viewed.
Hence we decided to get the features which makes a review helpful, so when the
reviews are sortedbased on these features values, best reviews will get a fair
chance of being viewed by the customers.
1.2 Scope
This project can be useful to those ecommerce websites which have a big data of
reviews for their products and need to find the best reviews for the customers
available from those. Ourmodel determines how much a particular factor
influences on the helpfulness of a review so that the reviews can be sorted based
on those results. Many ecommerce websites for suppose amazon, have reviews
sorted based on helpfulness which are calculated by the question “was this
review helpful to you?” and also based on time. By using this they can sort the
best reviews by letting every review to get a fair chance to be viewed.
1.3 Problem definition
2. Literature Survey
Some research papers, survey papers and articles based on topic related to one
class data transfer learning have been studied and their respective advantages
and disadvantages have been described (as in Table No. 1). Based on the
Literature Survey a suitable implementation has been selected as the main
motivation for the project.
Literature Survey Table (Table No. 1)
Title Year Source Author Approach used Dataset used
Cleaning of data from garbage and misplaced values to avoid noise in results
Data Extraction from datasets to excel sheets i.e from json objects to spread
sheets (csv files)
Finding text polarity using lexical analysis based on sentiwordnet which is a lexical
resource for option mining
Extracting features like adjectives, nouns, average word length, average sentence
length using nltk library in python
Extracting remaining features like rating, helpfulness values directly from the data
Finding features values for dale-chall and flesh readability using formulaes
Implementing Gradient Boosting classifier to get results for all the extracted
feature values
Implementing Random Forest classifier to get results
Comparing and analyzing the results from above two techniques
Source Data
Feature Extraction
Training
Classifier
Testing
Classifier
Analysing
Results
3.3 Data Collection :
We have collected data of Amazon.com of online consumer reviews
(electronic products).
Source : http://jmcauley.ucsd.edu/data/amazon/
2) Title Polarity :
Title polarity is calculated in the same way as review polarity is calculated.
3) Ratings :
Ratings of each review are directly obtained from the dataset.
4) Dalechall Readability :
DaleChall value is a value given to a text which signifies how difficult
a sentence is to read and understand. Dalechall uses 3000 words and any
word which is not there in the list is considered as difficult word .
Flesch value indicates how difficult a text is to read . The more the
value the easier it is easy to read.
10)Capital words :
Number of capital words in the review are calculated using
isUpper() method of strings.
11)Helpfulness :
If a review is helpful it is marked as 1 and if it is not helpful it is
marked as -1. Helpfulness is calculated with helpful column for each review.
This column has data like [a , b] , where a represents helpfulness votes and
b represents total number of votes. If ratio of a to b is greater than 0.7 it is
considered as helpful(1) otherwise it is unhelpful(-1).
3.5 Training and testing classifier :
Total features extracted : 14
Total data samples : 6500
Helpful reviews : 4500
Unhelpful reviews : 2000
Ensemble Learning :
In Ensemble Learing different weak learners are combined to predict the
output which is more accurate than those weak models.
Boosting :
Boosting is a technique in which a initial classification is made by the model
And the errors observed in that model are given more weightage ,correct ones
are given less weightage and sent for the next classification and this goes on
.Finally all classifiers are given some weight to predict the output.
Gradient Boosting :
Gradient boosting is an ensembling technique, which means that prediction
is done by an ensemble of simpler estimators. The aim of gradient boosting is to
create (or "train") an ensemble of trees, given that we know how to train a single
decision tree. This technique is called boosting because we expect an ensemble to
work much better than a single estimator.
We used Gradient boost classifier for finding the influence of a variable in the
helpfulness of a review. For implementing Gradient boost classifier we have used
scikit learn which is a python library for implementing the machine learning
algorithms.
sklearn.ensemble.GradientBoostingClassifier(loss='deviance', learning_rate=0.1,
n_estimators=100, min_samples_split=2, min_weight_fraction_leaf=0.0,
max_depth=3, random_state=None, max_features=None,
max_leaf_nodes=None)
Parameters :
n_estimators :it is the no of boosting iterations in the gradient boosting classifier.
We have trained the classifier with different parameter values. We have fixed the
parameter values such that when we test our classifier with the same training
data, it gives 100 percent accuracy.
Collection of data sets and selecting suitable techniques for feature extraction
and classification.
Cleaning of data from garbage and misplaced valuesto avoid noise in results
Data Extraction from datasets to excel sheets i.e from json objects to spread
sheets (csv files)
Finding text polarity using lexical analysis based on sentiwordnet which is a lexical
resource for option mining
Extracting features like adjectives, nouns, average word length, average sentence
length using nltk library in python
Extracting remaining features like rating, helpfulness values directly from the data
Finding features values for dale-chall and flesh readability using formulaes
Implementing Gradient Boosting classifier to get results for all the extracted
feature values
Implementing Random Forest classifier to get results
Comparing and analyzing the results from above two techniques
6. Future Work
Our future scope is to use a better technique to find sentiment analysis for
polarity on text. As our lexical based approach can’t perform on sarcastic
sentences. Our work is limited to English language words. If a word of other
language is written in English like bakwas, mast etc.. it is unable to guess the word
due to lack of knowledge about other language. We can make use of other
language words if they are written in their respective scripts as it is easier to guess
its polarity by using their dictionary directly. So in future we will find a solution to
find polarity for such type of words which are written in English but mean to other
language.
6.References
[1]Singh, J.P., et al., Predicting the “helpfulness” of online consumer reviews,
Journal ofBusinessResearch(2016),
http://dx.doi.org/10.1016/j.jbusres.2016.08.008
[2]Mohammad Salehana, Dan J. Kim, Predicting the performance of online
consumer reviews: A sentimentmining approach to big data analytics (2015),
http://dx.doi.org/10.1016/j.dss.2015.10.006
[3]Alaa Hamouda, Mohamed Rohaim, Reviews ClassificationUsing SentiWordNet
Lexicon (2012), www.academia.edu/download/8292326/123.pdf
[4]http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBo
ostingClassifier.html
[5]https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-
tuning-gradient-boosting-gbm-python/
[6]http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
[7]https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
[8]Spool, J. (2009). The magic behind Amazon's 2.7 billion dollar question.
Available online at
http://www.uie.com/articles/magicbehindamazon/2009 (Accessed on 15th May
2016)