Web Content Recommendation Using Machine Learning On User Mouse Tracking Data

Web Content Recommendation 
using Machine Learning on 
User Mouse Tracking Data 
 
 
 
Sparsh Gupta 
Pembroke College | Computing Laboratory 
University of Oxford 
Submitted in partial fulfillment of the requirements for the degree of 
Master of Science in Computer Science 
September 2009 
Abstract 
 
ABSTRACT 
The websites are becoming more and more dynamic but not intelligent. Based on 
certain  mouse  clicks  or  user  choices,  today’s  dynamic  websites  can  mold 
themselves  but  cannot  predict  relevant  data  intelligently.  The  data  contained  in 
today’s websites is growing and the number of users demanding unique different 
information  is  also  ever  increasing.  This  has  created  a  challenging  problem  of 
delivering the right content to every user. 
This thesis is an original work concentrating on solving this problem of generating 
relevant  content  for  each  individual  user.  One  of  the  primary  inputs  used  by  the 
project  is  the  mouse  movement  behavior  of  the  user.  If  the  website  capturing 
mouse movements is built in such a way that the mouse pointer is mostly close to 
the  point  of  gaze  of  the  user,  then  the  mouse  movement  behavior  would 
theoretically  mean  tracking  the  eye  of  the  user.  Based  on  this  mouse  movement 
data, further content can be predicted and personalized for each user using one or 
more machine learning models.  
This thesis proposes a complete methodology of building and implementing such a 
system.  As  a  proof  of  concept,  an  online  shopping  website  has  been  built  and 
further  tests  have  been  conducted  which  gave  a  remarkable  accuracy  of  84.09% 
when compared with the actual needs of the user. 
The  working  demonstration  of  the  project  along  with  its  description  is  available 
online at http://sparshgupta.name/MSc/Project 
Keywords: adaptive web, machine learning, mouse movement, gaze point 
ii 
Acknowledgement 
 
ACKNOWLEDGEMENT 
I am heartily thankful to my supervisor, Dr. Vasile Palade, whose encouragement, 
guidance,  confidence  in  my  idea  and  support  from  the  initial  to  the  final  level 
enabled  me  to  develop  this  project  and  understand  the  subject.  I  am  thankful  to 
Computing Laboratory, University of Oxford for accepting my proposal and giving 
me an opportunity to work on this idea. I gratefully acknowledge the support and 
help of all the volunteers who helped me collect the data for my work. 
I would like to thank Prof. Luke Ong and Pembroke College for their co‐operation 
and  readiness  to  always  help  me  when  needed.  I  would  also  like  to  acknowledge 
the efforts and facilities provided by the staff of the Computing Laboratory Library, 
Radcliff Science Library and Pembroke College Library. 
Lastly,  I  offer  my  regard  to  my  parents,  my  sister  and  friends  who  always 
supported me in all respects during the completion of this project. 
Sparsh Gupta 
iii 
Table of Contents 
 
TABLE OF CONTENTS 
Abstract........................................................................... ii 
Acknowledgement.......................................................... iii 
Table of Contents........................................................... iv 
Table of Figures.............................................................. ix 
Introduction........................................................... 1 
1.1  A Primer ..................................................................................1 
1.1.1  The World Wide Web ............................................................................... 1 
1.1.2  The computer mouse device ....................................................................2 
1.1.3  Eye tracking ...............................................................................................2 
1.1.4  WWW and the missing gap .....................................................................3 
1.1.5  Tracking mouse pointer to track user’s eyes ...........................................3 
1.2  Motivation.............................................................................. 4 
1.3  Objectives............................................................................... 5 
1.4  Structure of the dissertation ................................................. 7 
Background, Literature review and Project 
overview .................................................................8 
2.1  Coordination of mouse and eye movements ........................ 8 
2.2  Capturing mouse movements ............................................. 10 
2.3  Tracking mouse movement to determine users behaviour 11 
iv 
 
2.4 Discussion ............................................................................. 11 
2.5  Project overview....................................................................12 
Data Collection and Pre‐processing.....................15 
3.1  The initial website................................................................. 15 
3.1.1  Specifications ........................................................................................... 15 
3.1.2  Implementation ...................................................................................... 16 
     3.1.2.1   Webpage Design ............................................................................... 17 
     3.1.2.2   Database Design...............................................................................20 
     3.1.2.3   Implementing mouse tracking........................................................22 
     3.1.2.4   Final product bought by the user ...................................................25 
3.1.3  Testing the initial website ......................................................................25 
3.2  Data collection ..................................................................... 26 
3.3  Data compilation and cleaning ........................................... 27 

3.3.1  Need and Specifications .........................................................................27 
3.3.2  Implementation......................................................................................28 
     3.3.2.1   Data compilation..............................................................................28 
     3.3.2.2   Data cleaning ...................................................................................30 
     3.3.2.3   Data normalization ......................................................................... 32 
Building machine learning models..................... 34 
4.1  Machine Learning ................................................................ 34 
4.1.1  WEKA....................................................................................................... 35 
4.1.2  Why Machine Learning? ........................................................................ 35 
4.2 Methods evaluated................................................................35 
v 
 
4.2.1  Decision Tree ..........................................................................................36 
4.2.2  Neural Network......................................................................................36 
4.3  Implemented algorithms......................................................37 
4.3.1  Decision Tree (C4.5)...............................................................................38 
4.3.2  Neural Network (Multilayer Perceptron).............................................39 
4.4 Model building..................................................................... 39 
4.4.1  Decision Tree ......................................................................................... 40 
     4.4.1.1   Details of the chosen decision tree ................................................ 40 
     4.4.1.2   Testing the decision tree.................................................................45 
          4.4.1.2.1   Testing on Training Data.........................................................45 
          4.4.1.2.2   Testing by Cross‐Validation (folds 10) .................................. 46 
          4.4.1.2.3   Discussion ............................................................................... 46 
4.4.2  Neural Network .....................................................................................47 
     4.4.2.1   Details of the chosen neural network ........................................... 48 
     4.4.2.2   Testing the neural network model ................................................ 51 
          4.4.2.2.1   Testing on Training Data ........................................................52 
          4.4.2.2.2   Testing by Cross‐Validation (folds 10)...................................52 
          4.4.2.2.3   Discussion................................................................................ 53 
4.4.3  Decision Tree Vs Neural Networks ......................................................54 
Embedding the machine learning models in the 
website ................................................................. 56 
5.1  What and Why? .................................................................... 56 
5.2  Specifications ....................................................................... 56 
5.3  Implementation ....................................................................57 

5.3.1  Implementing the Decision Tree model ...............................................59 
vi 
 
5.3.2  Implementing the Neural Network model...........................................59 
5.4 Using model outputs ...........................................................60 
5.5  What next ............................................................................. 62 
Testing and Results..............................................64 
6.1  Testing methodology ........................................................... 64 
6.2 Testing for model accuracy.................................................. 64 
6.2.1  Testing data collection...........................................................................65 
6.2.2  Model testing in WEKA using test data .............................................. 66 
     6.2.2.1   Decision Tree model........................................................................67 
     6.2.2.2   Neural Network model .................................................................. 68 
6.2.3  Discussion.............................................................................................. 69 
6.3 Testing time performance of the models............................ 69 
6.3.1  Decision Tree model ..............................................................................70 
6.3.2  Neural Network model ..........................................................................70 
6.3.3  Discussion............................................................................................... 71 
6.4 Results ................................................................................... 71 
Conclusion ........................................................... 73 

Future Work ...............................................................................75 
Bibliography......................................................... 78 
Appendix: Source Code........................................ 82 
vii 
 
HTML final webpage .........................................................................................82 
The JavaScript file ..............................................................................................87 
The CSS file ....................................................................................................... 90 
The PHP scripts .................................................................................................92 
     data.php.........................................................................................................92 
     connect.php...................................................................................................92 
     bought.php ....................................................................................................92 
     alignData.php................................................................................................92 
     predict.php ....................................................................................................93 
viii 
Table of Figures 
 
TABLE OF FIGURES 
Figure 1: Project outline.................................................................................................................... 14 
Figure 2: Screenshot of the top half of the developed webpage...................................... 17 
Figure 3: Screenshot of the developed webpage.................................................................... 18 
Figure 4: Code given to each section of the webpage........................................................... 19 
Figure 5: Screenshot with a cell highlighted ............................................................................ 20 
Figure 6: Database table 'data'....................................................................................................... 21 
Figure 7: Database table 'bought'.................................................................................................. 21 
Figure 8: Parameters used for building the Decision Tree model .................................. 42 
Figure 9: Parameters used for building the Neural Network model ............................. 50 
Figure 10: Screenshot of the prediction done by the model ............................................. 62 
ix 
Introduction 
 
1
INTRODUCTION 
This chapter includes a brief overview of a few terms. It then discusses the 
coordination between eye and mouse movement and how mouse movement 
data  can  be  used  as  pseudo  eye  tracking  data.  Later,  this  chapter  talks 
about  the motivation  behind this project and clarifies the  objectives of  the 
research and the structure of this document. 
1.1 A Primer 
This section of the chapter will discuss a brief history of the World Wide Web (WWW), 
the  use  of  a  computer  mouse  and  the  current  eye  tracking  technology.  It  will  later 
explain how the WWW can be improved by using eye tracking data and how a mouse 
pointer can be used to collect pseudo eye tracking data. 
1.1.1 The World Wide Web 
In  1990,  CERN  launched  the  world’s  first  website1,  which  was  only  a  few  lines  of  text 
and  hyperlinks.  In  its  nineteen  years  of  journey,  today’s  websites  have  completely 
revolutionized.  The  plain  text  is  now  being  accompanied  with  all  sorts  of  rich  media 
                                                        
1
 CERN, Welcome to info.cern.ch/, http://info.cern.ch/. 
1 
Introduction 
 
including  images,  music,  videos,  animations,  colours  etc.  Dynamic  data  from  ever‐
increasing databases is rapidly replacing the static content of the websites. Web servers 
are now capable of more real time computing. Data cannot only be shown to a user but 
can be collected from him easily. Recently, the success of AJAX1 has completely changed 
the web experience by making it much more interactive and more data driven. 
Today,  Internet  has  changed  everything,  from  how  we  do  business,  how  we  study, 
connect with friends and in general, how we live. 
1.1.2 The computer mouse device 
Most of the people in the world uses a computer‐pointing device (generally a mouse) to 
navigate through a website. They click hyperlinks spread across different sections of a 
webpage, select texts or scroll through a long page using a computer mouse. Mouse can 
safely  be  called  as  a  personal  assistant  while  working  on  a  computer  and  especially 
while browsing a website. 
1.1.3 Eye tracking 
Eye tracking or Gaze tracking is a process of measuring the gaze, i.e., keeping a track of 
the point at which a user is looking. Most of the websites have visual information in the 
form of texts, images, graphics, etc., and almost all the information a user attains from a 
website is by perceiving it though his eyes. 
Eye tracking when employed to a website can be imagined as a method of determining 
the portion on the screen at which the user is looking. This information can potentially 
                                                        
1
 W3Schools, Ajax, http://www.w3schools.com/Ajax/. 
2 
Introduction 
 
give a fair idea about the sections most relevant to him. The more time a user spends 
looking at a particular section, reading it or simply viewing it, the more interested he is 
in that section compared to the others on the same page. 
1.1.4 WWW and the missing gap 
Websites  have  started  becoming  dynamic  by  accepting  inputs  from  a  user,  which  are 
then used to select relevant content or information for him. The kinds of input, current 
websites  primarily  employ  are:  mouse  clicks,  key  presses,  text  entered  or  choices 
chosen  by  the  user  in  the  form  element  of  the  page.  This,  on  the  contrary,  means  that 
incase the user is not interested in giving any data as input, the website would end by 
being static or without any information on user needs. 
The  eye  tracking  data,  if  captured  for  a  general  user,  can  be  utilized  vastly  in  making 
today’s  websites  more  adaptive  and  intelligent  by  harnessing  the  knowledge  of  users 
interest  and  information  he  is  most  interested  in.  Without  seeking  any  external  data 
from the user, his interests and needs can be determined based on his eye movements 
and he can be served the data he is most interested in. 
1.1.5 Tracking mouse pointer to track user’s eyes 
There has been a lot of research in improving the computing experience for a user by 
tracking his eye location, but there are a few drawbacks associated with it. Firstly, the 
tracking  equipment  is  expensive  and  the  user  needs  to  physically  wear  the  tracking 
gadget.  Not  everyone  using  Internet  would  want  or  can  wear  the  tracking  equipment 
and hence  the  general  public  websites  cannot be  made dependent on them. There  are 
also ongoing researches to determine the movement of eyes using a camera device, but 
as  of  now  the  accuracy  of  determining  the  gaze  position  is  low  and  it  depends  on 
movements  of  the  user,  lighting  conditions  and,  most  importantly,  the  user  need  to 
3 
Introduction 
 
download  an  external  software.  Because  of  these  limitations  of  the  eye  tracking 
methods, there have been researches in finding other alternatives. 
Recently  Googlers  Kerry  Rodden  and  Xin  Fu  proposed  in  their  paper  (Rodden,  et  al. 
2008) that mouse movements show potential as a way to estimate where the user has 
considered  before  deciding  where  to  click.  There  have  been  other  studies  that  have 
provided a reasonable estimate of coordination of mouse and eye especially on a page in 
which  a  click  is  likely  to  happen.  Hence,  tracking  a  user  mouse  movement  can 
sometimes  be  used  as  a  pseudo  eye  tracking  data.  There  are  several  interface  design 
techniques in Human Computer Interaction with which a website can make sure that, in 
most cases, user’s mouse pointer is close to his point of gaze. One of the techniques that 
have been employed in the project is the mouse over cell highlighting. If the content at 
the current location of the mouse pointer is highlighted to make it stand out of rest of 
the  page,  then  this  can  almost  always  ensure  that  the  mouse  pointer  movement  is 
synchronized with the area the user is currently reading or gazing to.  
1.2 Motivation 
Many  websites  do  not  ask  for  any  explicit  input  from  the  user  but  can  still  adapt 
themselves.  They  primarily  use  either  some  geographical  information  (which  can  be 
obtained  from  user’s  IP  address)  or  the  browser/operating  system  specifications  to 
adapt  the  web  content  for  the  user.  This  adaptation  is  of  course  not  targeted  to  an 
individual user and is only a broad adaptation to cater a group of users having similar 
demographics or preferences. The adaptation of a website can be based on any smallest 
bit  of  information  from  the  user.  The  more  information  the  website  attains  about  the 
user, the better it is capable of adapting to his needs.  
4 
Introduction 
 
The  primary  medium  of  interaction  of  a  user  with  a  website  is  a  mouse  device  and  it 
produces  a  huge  amount  of  data  in  the  form  of  mouse  movement  behaviour.  The 
motivation behind this thesis and project is the existing gap between the demand 
of more user data for a website to make it adaptive and the availability of ample 
data from the user in the form of his mouse movements. 
Further, if a website is designed in such a way that most often or not the user’s mouse 
pointer movement is synchronized with his point of gaze, as discussed in Section 1.1.5, 
then the data can also be roughly called as eye tracking data. 
1.3 Objectives 
The objective of this project is to effectively utilize the mouse movement data of a user 
in  making  the  web  content  more  adaptive  for  him,  by  dynamically  predicting  further 
relevant content for him. 
In order to achieve the above main objective, the following sub‐objectives needs to be 
catered: 
• Collecting the initial training dataset of mouse movement behavior from a large 
set  of  users  in  order  to  train  and  build  a  model.  This  will  involve  developing  a 
website with well‐defined area or sections or elements where mouse movements 
could  be  tracked.  The  website  needs  to  be  such  that  users  mouse  pointer 
synchronize with his point of gaze. 
• Asking  volunteering  users  to  visit  this  site  and  choose  or  select  content  for 
themselves  like  they  do  on  any  other  website.  Tracking  the  time  spent  at  each 
section  /  element  of  the  page,  while  the  user  is  browsing  on  it  is  the  required 
data.    The  target  (predicted  or  dependent)  variable  is  the  relevant  content  for 
5 
Introduction 
 
him and hence in order to train the model, this data point (as collected explicitly 
from the user) also needs to be saved in the databases. 
• Data  cleaning and  processing  is an essential step required after data collection. 

This  is  because  it  is  important  to  remove  all  outliers  that  can  harm  the  model. 
The  time  user  spent  at  different  sections  of  the  webpage  should  ideally  be 
normalized by the total time spent by him. 
• Building machine learning models by using the collected mouse movement data 
as  the  training  and  initial  testing  dataset.  The  distribution  of  normalized  time 
spent  by  the  user  at  each  section  of  the  webpage  would  be  the  independent 
variables and hence would become the input attributes of the model, and further 
content for the user will be the output of the model, as the dependent variable. 
• Embedding the machine learning models back into the website so that the model 
can  be  put  into  use.  The  website  would  continue  tracking  users  mouse 
movements and would use the built model to compute further content for him in 
real time. 
• Testing  the  accuracy  of  the  implementation.  To  do  this,  the  predicted  content 
needs to be compared with the actual content desired by the user. 
To  demonstrate  the  objectives,  a  sample  shopping  webpage  has  been  developed.  This 
webpage contains a comparison of the specifications of five laptop models. Based on the 
mouse  movement  behavior  of  a  user  across  this  page,  the  best  laptop  would  be 
recommended to him. 
This can be visualized as follows: If a user has a browsing pattern that signifies that he is 
spending say 40% time reading about the RAM of the laptops (further distribution of time 
spent on different RAM sizes of different models), 30% time reading about the processors, 
20% time about the Hard Disc Drive and the rest 10% time similarly reading about other 
specifications,  then  based  on  this  data  and  the  developed  machine  learning  model,  the 
6 
Introduction 
 
most  suitable  laptop  can  be  recommended  to  him.  The  accuracy  of  the  recommendation 
could  be  checked  by  comparing  the  product  finally  bought  by  the  user  and  the  product 
recommended by the website. 
1.4 Structure of the dissertation 
This  document  will  start  with  giving  an  idea  about  the  related  research  being  done 
across  the  globe.  It  will  then  explain  the  complete  implementation  outline  as  a  big 
picture of the project. In Chapter 3, the thesis will discuss the methodology of collecting 
initial  training  data,  which  would  also  involve  the  complete  description  of  the 
development  procedure  of  the  initial  website.  It  will  explain  the  process  of  collecting 
data  along  with  the  structures  of  the  databases  and  the  data  cleaning  procedure. 
Chapter  4  would  give  the  details  of  the  machine  learning  models  built  and  the 
procedure involved along with the testing results of the models obtained on the training 
data. Chapter 5 would explain the procedure adopted to implement the built model into 
the  website  and  the  details  of  the  AJAX  communication  link  between  the  model,  data 
and  the  website.  Then  the  thesis  explains  the  methodology  to  collect  testing  data  and 
would  explain  the  testing  methodology  and  results  obtained  on  the  model.  The  thesis 
closes  with  some  conclusions  and  the  author’s  view  on  the  possibility  of  future  work. 
The attached appendix contains all the source code. 
The  working  demonstration  of  the  project,  along  with  its  documentation  and  the  GNU 
General  Public  License  source  code  is  available  online  at 
http://sparshgupta.name/MSc/Project 
7 
Background, Literature review and Project overview 
 
2
BACKGROUND, LITERATURE REVIEW AND 
PROJECT OVERVIEW 
This  chapter  explains  the  previous  work  related  to  the  problem  already 
going  on  around  the  world.  The  chapter  is  divided  into  different  sections 
explaining independent and combined work going on or being done in each 
of  the  heading.  The  chapter  later  summarizes  the  ongoing  work  and  also 
presents an overview of the project carried out by the author.  
The work done in the project is an original idea and there is no record of any work being 
done  around  using  the  same  methodology.  The  problem  has  been  tackled  to  some 
extent  and  has  been  considered  by  a  few  research  groups  but  their  methodology  and 
final conclusions were very different from what have been proposed in this thesis. The 
following  parts  of  the  chapter  would  highlight  some  of  the  recent  developments  and 
work done in related fields. 
2.1 Coordination of mouse and eye movements 
The prime question of whether mouse tracking can be substituted, or at least partially 
replicate, eye tracking is active.  
(Chen, Anderson and Sohn 2001) studied the relationship between the gaze position of 
a  user  and  his  cursor  position  on  a  computer  screen  during  web  browsing.  They 
8 
 
conducted tests on several websites and recorded the eye and mouse movements of the 
uses  and  studied  them  separately.  They  concluded  that  there  is  a  strong  relationship 
between gaze position and cursor position and also that there are regular patters of the 
coordination.  They  have  also  argued  that  a  mouse  could  provide  us  more  information 
than just x and y coordinates which could be used to design better interfaces for human 
computer  interactions.  They  wrote  in  their  conclusion  that  “Our  data  show  that  the 
dwell time of cursor among different regions has strong correlation to how likely a user 
will look at that region. Also, in over 75% of chances, a mouse saccade will move to a 
meaningful region and, in these cases, it is quite likely that the eye gaze is very close to 
the  cursor.  This  result  implies  that,  by  predicting  the  users'  interests  on  web  pages, 
mousse device could be a very good alternative to an eye‐tracker as a tool for usability 
evaluation.” 
According  to  the  work  done  at  Google  labs  (Rodden,  et  al.  2008),  several  different 
patters of coordination between eye and mouse pointer were observed on a web search 
result  page.  The  identified  behavior  patters  to  indicate  active  usages  were  –  following 
the  eye  horizontally,  following  the  eye  vertically  and  marking  a  particular  result.  This 
work  was  completely  done  on  a  search  results  page  but  clearly  concludes  that 
coordination between user’s eye and his mouse pointer exists. 
There  have  been more  studies (Byrne,  et  al.  1999) and others  on the  relationship and 
coordination  between  eye  movements  and  mouse  movements  on  the  web.  They  have 
found that some users will use the mouse pointer to help them read the page, or to help 
them  make  a  decision  about  where  to  click.  If  was  concluded  that  given  an  intent  / 
opportunity  to  click  in  the  current  user  activity,  the  mouse  is  much  more  likely  to  be 
close to the eye. Eye tracking can provide insights into users’ behavior while using the 
search results page, but eye‐tracking equipment is expensive and can only be used for 
studies  where  the  user  is  physically  present.  The  equipment  also  requires  calibration, 
9 
 
adding overhead to studies. In contrast, the coordinates of mouse movements on a web 
page can be collected accurately and easily, in a way that is transparent to the user. This 
means  that  it  can  be  used  in  studies  involving  a  number  of  participants  working 
simultaneously,  or  remotely  by  client‐side  implementations  –  greatly  increasing  the 
volume and variety of data available. 
There is a basic rationality that states "If I might click, I might as well keep the mouse 
close  to  my  eyes."  Where  there's  no potential  to  click,  either because  the user is in  an 
evaluative mode or the content of interest is devoid of links, the mouse and eye diverge. 
2.2 Capturing mouse movements 
There can be several different methodologies to capture mouse movement behavior of a 
user  over  a  webpage.  This  primarily  depends  upon  the  type  of  data  required  and  the 
mouse movement expected. (Arroya, Selker and Wei 2006) proposed a tool that need no 
installation  and  is  capable  of  tracking  users  mouse  movement.  This  mouse  movement 
data can be visualized in an inbuilt system and can be used to further refine the usability 
of  the  webpage.  They  however  have  not  proposed  any  methodology  to  automatically 
refine the webpage. 
(Edmonds, et al. 2007) talks about technique and uses of mouse tracking on a website 
but  completely  from  usability  point  of  view.  It  handles  the  capturing  of  the  mouse 
movements  data  of  a  user  in  a  more  detailed  way  capturing  the  coordinates,  row  and 
column  ID  along  with  many  other  parameters.  This  methodology  was  found  effective 
but showed no significance from the current problem point of view. 
The  paper  (Torres  and  Hernando,  Real  time  mouse  tracking  registration  and 
visualization tool for usability evaluation on websites n.d.), proposes a methodology to 
track  mouse  movements  on  a  webpage  and  visualize  them  on  a  tool  that  they  have 
10 
 
developed. They have used the HTML and AJAX languages and have proposed a method 
to  link  the  mouse  movements  with  the  server  logs  and  web‐stat  data  to  get  add‐on 
information of the user’s behavior. 
2.3 Tracking  mouse  movement  to  determine  users 
behaviour 
There  was  a  famous  project  named  ‘Cheese’  done  at  the  MIT  (Mueller  and  Lockerd 
2001), which extended the conventional web interface user model (based on responds 
of only mouse clicks) to account all mouse movements on a page as an additional layer 
of  information  for  inferring  user  interest.  They  developed  a  straightforward  way  to 
record  all  mouse  movements  on  a  page,  and  conducted  a  user  study  to  analyze  and 
investigate mouse behavior trends and found certain mouse behaviors, common across 
many users. They also proposed that there are certain categories of mouse behavior and 
after tracking them, the website could be molded accordingly. 
2.4 Discussion 
It  was  found  after  literature  review  that  a  lot  of  work  has  been  done  to  prove  and 
support the coordination of eye and mouse movement of a user on a website. The eye 
tracking  data  has  been  used  by  Google  to  improve  the  usability  of  their  search  pages. 
There are several ongoing discussions on the effective use of eye or mouse tracking data 
to manually refine the content and usability design of a webpage. 
It was however found that no work has been done in using the mouse tracking data in a 
machine  learning  model  to  automatically  refine  or  predict  content  for  a  website  for  a 
user based on his mouse movement or eye movement behavior.  
11 
 
2.5 Project overview 
The project undertaken can be stated as a method proposed to automatically refine or 
predict the contents of a webpage, for a user, based on his mouse movement behavior. 
From earlier studies, as stated in Section 2.1, it has been assumed that there is certainly 
some coordination between a user’s eye movement and his mouse movement. Based on 
the mouse movements of an individual user, his preferences for content and his needs 
can be predicted and this information can further be used by the owners of the website. 
If  not  the  owners,  this  information  can  definitely  help  the  user  in  finding  the  right 
content for him. 
To do this, the first task was to device a methodology to track user’s mouse movements 
on a webpage. There can be several ways in which tracking could be done, and further 
there  can  be  several  different  data  points  that  can  be  saved  for  a  user  based  on  his 
mouse movements. The thesis proposed a method to track the time spent by a user in 
every  section  of  a  webpage.  There  were  several  JavaScript  functions  written,  and 
modifications  done  to  a  standard  website  to  enable  mouse  tracking  in  a  hidden  layer. 
AJAX  was  used  to  connect  the  JavaScript  functions  with  the  server  end  PHP  scripts, 
which were further connected to MySQL databases for storing the data. To demonstrate 
all  this,  a  new  dummy  website  imitating  a  shopping  portal  was  developed.  Once  the 
website was developed with mouse tracking capabilities, it was made available to public 
for  two  weeks.  This  was  done  to  collect  some  initial  data  on  user’s  mouse  movement 
behavior. The data collected was processed and cleaned before analyzing and modeling 
it.  This  complete  step  of  initial  website  development  and  data  collection  has  been 
explained in details in Chapter 3 (Data Collection and Pre‐processing). 
It was then required to study and analyze the collected data and make a model on it so 
that  it  could  be  used  in  the  future  for  new  visitors.  To  do  this,  WEKA  was  used  and 
12 
 
different types of models were made. The models took the independent variables as the 
time spent in different sections of the webpage by the mouse pointer and predicted the 
relevant content for the user as the dependent variable. They all were built and trained 
on the initially collected data and were tested on the same training data. After several 
iterations,  two  models,  one  based  on  Decision  Tree  and  the  other  on  Neural  Network 
were obtained that gave significant accuracy on the training data. The complete model‐
building  phase  of  the  project  along  with  the  test  results  obtained  are  explained  in 
Chapter 4 (Building machine learning models) 
Once  the  two  models  (each  of  Decision  Tree  and  Neural  Network)  were  obtained,  the 
task  was  to  embed  them  both  into  the  initial  website.  This  was  necessary  so  that  the 
built models could be used for future visitors and the content relevant to them can be 
predicted based on their mouse movement activities. The models were coded in PHP on 
an  apache  server  and  were  connected  with  the  front‐end  HTML  page  using  AJAX.  The 
PHP  script  was  made  to  read  the  real  time  mouse  movement  data  of  a  given  user 
directly  from  the  MySQL  databases  and  execute  the  model  on  it  to  predict  further 
content for him. The whole procedure is explained in details in Chapter 5 (Embedding 
the machine learning models in the website) 
After embedding the two models into the website, volunteers were again asked to visit 
the website. This time not only the user’s mouse movements were captured but also he 
was  recommended  appropriate  content  based  on  one  of  the  two  machine  learning 
models. The mouse movement data was saved in the MySQL databases to be analyzed 
for accuracy later. This step is explained in Chapter 6 (Testing and Results) 
The collected data was used as the test dataset and the two models were evaluated on 
their  accuracy  as  well  as  time  performances.  It  was  found  that  under  the  present 
limitations  of  lack  of  data,  the  Decision  tree  model  edged  over  the  Neural  Network 
13 
 
model both on the accuracy as well as on the time performance front. The details of this 
step are mentioned in Chapter 7(Conclusion) 
The whole project can be outlined as follows: 
Asking volunteering users to 
Building the Initial website  visit the website and capturing 
capable of tracking mouse  their mouse movements. 
movements of the visitors  Cleaning and compiling the 
collected data. 
Using the captured mouse 
Coding the obtained machine 
movement data of the users, 
learning models back into the 
building and training machine 
website 
learning models 
Collecting test dataset from the 
Testing the accuracy of the built 
final website. The website now 
models using the collected test 
is capable of recommending the 
data. Also evaluating the time 
appropriate content for a user 
performances of the models on 
based on his mouse movement 
behavior  the web server 
 
Figure 1: Project outline 
14 
Data Collection and Pre‐processing 
 
3
DATA COLLECTION AND PRE‐PROCESSING 
This chapter will explain the complete training dataset collection steps. This 
would involve details of the initial website developed and explanation of the 
steps  followed  to  obtain  the  required  training  data  from  it.  Later,  this 
chapter will explain the data compilation and cleaning steps performed on 
the initial collected data. 
3.1 The initial website  
To  analyze  the  mouse  movement  behavior  of  the  users  on  a  webpage,  the  first  step 
would  be  the  development  of  the  website  under  consideration.  Since  the  proposed 
method of analysis and modeling the data is machine learning, some initial training data 
is  also  required.  To  cater  both  the  needs,  a  dummy  website  capable  of  tracking  the 
user’s  mouse  movements  was  built  and  made  public.  The  website  was  kept  live  until 
required data was achieved. The specifications and details of the implementation are as 
follows: 
3.1.1 Specifications 
The functionalities, requirements and specifications of the initial webpage built are: 
• The user interface design of the initial webpage needs to be exactly same as that 
of the required final website. This is important because user’s mouse movements 
15 
 
depend on the interface of the webpage. It is necessary that the data collected to 
build  and  train  the  machine‐learning  model  is  of  the  same  webpage  where  the 
model is finally required to be implemented. 
• The mouse tracking needs to be implemented in a hidden layer so that the user 
can experience the web in the same rich way without any compromise on speed, 
performance. He should not be asked any explicit information at any time. 
• As  stated  in  section 1.3,  the  webpage  developed  was  a  dummy  shopping  portal 
showing five laptop models comparing them on their configurations. 
• There were 5 laptops with 22 attributes of each of them. There was an empty (no 
laptop) specification heading information space on the left hand side of the page. 
( )
Total sections in the built page were  5 + 1 × 22 = 132 , where 5 are the number of 
  
laptops,  1  is  for  specification  heading  category  (no  laptop  space)  and  22  being 
the count of attributes per laptop. 
• Each of these 132 sections of the webpage gets highlighted as soon as the mouse 
pointer  reaches  it.  This  ensured  that  the  user  is  most  likely  to  read  the 
highlighted  section  of  the  webpage  and  hence  ensures  that  the  user’s  mouse 
pointer is close to his point of gaze. This step ensured that the mouse movement 
data provides pseudo eye tracking data of the user. The cell‐highlighting feature 
was implemented using Cascading Style Sheets where the cell color was changed 
as soon as mouse pointer enters the cell. 
• A MySQL database was connected for recording the mouse pointer time on each 
section of the webpage. The final product bought by that user was also saved in 
the databases. 
3.1.2 Implementation 
The webpage was developed in HTML using PHP as the server side scripting language. 
JavaScript and Ajax was used to dynamically transfer data from the HTML fields to the 
16 
 
PHP scripts. Database was designed in MySQL and PHP scripts were written to connect 
and transfer data between MySQL and the Apache server. 
3.1.2.1 Webpage Design 
The webpage was designed in HTML in a tabular format with 6 columns and 22 rows. 
Column 1 had the heading of the specifications and rest 5 columns had specifications of 
each laptop and every row had a specification. Each of the 132 cells hence obtained in 
the  table  were  corresponding  to  an  independent  variable  for  the  model  (input 
variables). The screenshot of the top half of the developed page is shown in Figure 2 and 
the screenshot of the complete webpage is shown in Figure 3. 
 
Figure 2: Screenshot of the top half of the developed webpage 
It can be seen clearly that there are 6 columns on the webpage and 22 rows and hence 
132 cells. Since each of these cells is an input variable to the model, they all were given a 
code. Each laptop was given a number from 1 to 5 and the specification heading space 
was  given  the  code  0.  Each  specification  was  given  an  alphabetic  code  from  ‘a’  to  ‘v’. 
Hence, each of the 132 sections of the webpage got the code as the combination of the 
alphabet of the specification and the number of the laptop like a0, a1, a2, a3, a4, a5, b0, 
17 
 
b1, b2, …, v3, v4, v5. The coding methodology for the first few cells is shown in Figure 4. 
These codes were not added anywhere on the webpage but were only used while calling 
the mouse tracking functions as will be explained in subsequent sections. 
 
Figure 3: Screenshot of the developed webpage 
 
18 
 
 
Figure 4: Code given to each section of the webpage 
To make sure that in most cases, the user’s mouse pointer is close to his point of gaze, a 
Cascading  Style  Sheet  was  attached  with  the  HTML  webpage.  The  CSS  file  had  two 
different  style  formats  that  could  be  applied  to  each  cell.  One  of  the  styles  was  the 
normal  white  background  whereas  the  other  format  was  with  blue  background  to 
enable cell highlighting. As soon as mouse enters a cell, the normal style was replaced 
by the highlighting style for that cell. This was again reset as soon as the mouse leaves 
the  highlighted  cell.  Similarly,  the  row  and  the  column  in  which  the  mouse  pointer  is 
currently  present  are  also  highlighted  in  a  light  shade  of  blue.  The  CSS  code  of  the 
different styles is available in the appendix of this thesis. The screenshot with a cell ‘g2’ 
highlighted is shown in Figure 5. 
For  every  visitor  of  the  website  a  unique  user  id  was  generated  as  soon  as  the  page 
loads.  To  keep  the  user  id  simple,  it  was  kept  as  the  current  JavaScript  Time  value  at 
page  load.  JavaScript  time  function  returns  the  current  time  in  milliseconds  since 
January 1, 1970. This ensured that in the current scope of the project, all visiting user 
would have a unique user id. The JavaScript code to generate user id is: 
19 
 
var userId=new Date();

userId=userId.getTime();
 
A  JavaScript  file  named  ‘mouseover.js’  was  associated  with  this  webpage  with  several 
JavaScript variables and functions required to track and record mouse movements. The 
HTML code of the website was also given an ‘onload’ event to call a JavaScript function 
named ‘start_It()’ which triggers the mouse tracking functionality of the website. 
<body onload="start_It();">
 
The algorithms of mouse tracking and the complete implementation would be explained 
later after the details about the database design. 
 
Figure 5: Screenshot with a cell highlighted 
3.1.2.2 Database Design 
A  database  was  created  in  MySQL  with  two  tables  namely  ‘data’  and  ‘bought’.  The 
attributes of the two tables are: 
20 
 
 
Figure 6: Database table 'data' 
 
Figure 7: Database table 'bought' 
Table: data 
• userID  To record the user id of the user 
• cellID  To save the cell ID that was assigned to each sub section of the webpage 
• time  contains the time in milliseconds spent in the cellID 
Table: bought 
• bought  To save the code of the final product bought by the user 
The table ‘data’ would save the time spent in each cell, i.e. section of the webpage by a 
user.  There  could  be  132  different  sections  /  CellIDs  for  each  user  and  they  all  can 
appear multiple times. The time spent in each section by a user will be the independent 
variable for the model. 
The table ‘bought’ is made to record the final product selected by the user. The attribute 
‘userID’ in both the tables is the foreign key and is the primary key in the ‘bought’ table. 
21 
 
The rationale behind such a design was to implement database normalization so that all 
data repetition could be avoided. Also, the insert queries would be simple and short and 
hence  would  be  efficient  and  wont  slow  the  webpage  while  tracking  the  mouse  and 
interacting  with  the  databases  simultaneously.  The  only  drawback  of  such  a  design  is 
that the data would need merging before it could be used for training the model. 
3.1.2.3 Implementing mouse tracking 
Each  of  the  132  cells  of  the  webpage  had  a  JavaScript  ‘onmouseover’  and  ‘onmouseout’ 
event  statements.  OnMouseOver  specifies  that  the  ‘movement_in()’  JavaScript  function 
be  called  every  time  the  mouse  comes  over  that  cell.  OnMouseOut  similarly  specifies 
that  the  ‘movement_out(‘cellID’)’  JavaScript  function  be  called  when  mouse  pointer 
leaves the cell. The code snippet demonstrating these function calls is: 
<td onmouseout="movement_out('c1');" onmouseover="movement_in();"></td>

 
As  soon  as  mouse  pointer  enters  a  cell,  the  current  DateTime  was  recorded  in  a 
temporary variable named ‘cellEntryDate’ in the function ‘movement_in()’. This function 
was not passed any attribute. As soon as the mouse pointer exists a cell, the time spent 
in  that  cell  in  milliseconds  was  calculated  by  subtracting  the  ‘cellEntryDate’  from  the 
current DateTime in the function ‘movement_out(‘CellID’)’. The movement_out() function 
was also passed the unique 2‐letter cell code to record the cell ID. The time spent in the 
cell along with the cell ID was concatenated in the data queue variable named ‘queue1’ 
or ’queue2’. The JavaScript function definitions are as follows: 
function movement_in() {
cellEntryDate = new Date();
}
 
22 
 
function movement_out(cell) {
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0) {
if(flag==0) {
queue1 = queue1+cell+":"+time+"_";
}
else {
}
}
}
 
 
The done variable in the above code was to check if the current user is still active 
and has not bought a product already. Flag was a variable to check which queue 
variable is currently available. 
Two instances of the queue variables were made to ensure that while transmitting one 
of  the  queue  data  to  the  server  via  AJAX,  the  other  queue  variable  can  record  the  cell 
movements. This is of great importance specially when the Internet bandwidth speed is 
low and data transfer in worst case can take a lot of time. This step also ensures that the 
interaction experience of the user will not be affected while mouse tracking is going on 
in the background. 
As  stated  above,  the  built  website  had  an  ‘onload’  JavaScript  event  calling  a  function 
named  ‘start_It()’.  The  start_it()  function  is  a  recursive  function  which  calls  the 
‘sendData()’  function  every  2  seconds.  The  sendData()  function  contains  the  AJAX 
statement to transfer the generated user ID (variable ‘userID’) and the queue variables 
namely  ‘queue1’  or  ‘queue2’  to  the  ‘data.php’  file  at  the  backend  server.  The  self‐
explanatory JavaScript functions definitions are as follows: 
23 
 
function start_It(){
if(done==0) {
setTimeout("sendData()",2000);
}
}
 
 
function sendData(){
var query_string;
if(flag==0) {
queue2="";
flag=1;
query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else {
queue1="";
flag=0;
query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}
 
 
The  ‘sendData()’  JavaScript  function  uses  standard  AJAX  calls  and  standard  ‘http’ 
open, onreadystatechange and send functions. The ‘query_string’ variable contains 
the PHP file to which the arguments were passed via GET method. 
The  ‘data.php’  files  was  coded  such  that  it  takes  the  queue  variable  as  sent  by  the 
JavaScript  ‘sendData()’  function  and  explodes  the  string  to  extract  the  various  cell  IDs 
and  time  values  associated  with  them.  It  then  opens  a  connection  with  the  MySQL 
database and inserts records with cell information in the ‘data’ table using the received 
user ID. The complete code of the ‘data.php’ file is available in appendix of the thesis. 
24 
 
3.1.2.4 Final product bought by the user 
Once the user browse through the webpage and scrolled on the table reading about the 
various configurations of the five laptops giving us one case of the training data, he was 
required  to  select  one  of  the  products.  This  is  to  simulate  the  actual  shopping  portal 
scenario where a person read about various products and finally buy one of it. To select 
a product, he performs a mouse click operation on the ‘Buy Now’ button associated with 
the product as shown in Figure 3. 
As soon as any ‘Buy Now’ button on the webpage is triggered by the user, a JavaScript 
function named ‘bought(‘ProductID’)’ is invoked. This function uses the AJAX protocols 
and  sends  the  userID  and  the  ID  of  the  product  clicked  to  the  ‘bought.php’  file  on  the 
server.  The  ‘bought.php’  file  on  the  web  server  connects  to  the  MySQL  database  and 
inserts this information as a row in the bought table ‘bought’. The complete code of the 
PHP script ‘bought.php’ is available in appendix and the JavaScript function is as follows: 
function bought(product){
done=1;
var query_bought;
query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}
 
 
Once  the  user  selects  the  product,  further  mouse  tracking  is  disabled.  Changing  the 
value of the JavaScript ‘done’ variable does this. 
3.1.3 Testing the initial website 
The  website  once  completed  was  hosted  on  a  public  web  server  and  was  tested 
thoroughly for bugs and errors. The main points in the checklist were: 
25 
 
• The  queue  variables  (‘queue1’  and  ‘queue2’)  in  the  JavaScript  file  are  recording 
the cellID and time appropriately and the data is getting extracted accurately at 
the server. 
• Data is being sent properly from the frontend JavaScript functions to the backend 
PHP files via AJAX. 
• The link between the database and PHP files is working correctly. 
• Both  the  tables  in  the  database  are  getting  data  and  are  inserting  it  properly 
without any error. 
3.2 Data collection 
When  the  website  as  explained  in  the  previous  section  was  developed  and  tested 
completely,  it  was  made  open  for  the  general  public.  Volunteers  via  email  and  social 
media were invited to visit the webpage. The selection of the volunteers was completely 
random and was primarily the contact group of the author. All the volunteers / visitors 
were  asked  to  browse  the  webpage  and  buy  a  product  on  it  (at  cost  zero,  virtually) 
similar to the way they do on a real shopping site. From this sample the initial training 
data  for  the  model  was  collected  and  saved  into  the  databases  as  explained  in  the 
previous section. No personal information or any other data was asked from any visitor. 
The  duration  of  this  step  depends  on  the  requirements  of  the  initial  training  data  for 
building  the  model.  The  more  the  number  of  sections  in  the  website,  i.e.  more  the 
independent  variables  of  the  model,  more  number  of  cases  in  the  initial  training  data 
would be required to build a relevant model. 
In  a  short  span  of  14  days,  292  unique  visitors  accessed  the  webpage.  244  rows  were 
collected  in  the  ‘bought’  table  and  16401  tuples  were  saved  in  the  ‘data’  table.  The 
expected users were around 350‐400 but due to lack of visibility of the project and no 
26 
 
compensation available to the volunteers, the number could not be reached and in lieu 
of the time, the website was taken off and the data was exported for further analysis and 
cleaning. 
3.3 Data compilation and cleaning 
3.3.1 Need and Specifications 
The collected data in the two tables needs to be merged in such a way that each row of 
the  new  table  correspond  to  a  single  user  and  contains  all  information  about  him,  i.e. 
each row is one case of the training data. Each case would include all the times spent in 
132 sections of the webpage along with the user id and the product finally bought by the 
user. This is also the required format to train a machine‐learning model in WEKA.  
Moreover, the collected data needs to be analyzed properly and checked for any errors 
in the data. There might be some users who wouldn’t have provided the information on 
the  actual  product  bought  and  hence  the  data  related  to  them  needs  to  be  scrapped. 
Some users are likely to spend absolutely no time as they might have accidently visited 
the  webpage  and  hence  all  users  spending  less  than  some  calculated  threshold  time, 
needs  to  be  scrapped.  Similarly  users  waiting  on  a  section  for  more  than certain  fixed 
time should be removed. These steps are important to ensure that there are no outliers 
in the collected data and the model that would be built and trained on this data is best 
suited for general usage on the website. 
Since the absolute time spent on different element of the webpage depends on a number 
of  other  features  primarily  the  speed  of  an  individual  user,  the  data  needs  to  be 
normalized. Dividing the time spent by a user on an individual section by the total time 
27 
 
spent  by  that  user  on  the  website  would  give  the  proportion  of  time  spent  by  him 
reading that section of the webpage. 
Hence the final training data should only contain valid users responses of the product 
bought  along  with  the  normalized  breakup  of  the  time  spent  by  them  on  various 
sections of the webpage. 
3.3.2 Implementation 
First all the data needs to be compiled into a single table as stated above and then needs 
to be cleaned. 
3.3.2.1 Data compilation 
A  new  PHP  script  named  ‘alignData.php’  was  written  to  compile  the  data  into  more 
usable format. This file would write all the data to a new table named ‘finalData’ with 
following  134  attributes  (132  corresponding  to  the  time  spent  in  132  sections  of  the 
webpage (independent variables), 1 to record the userID of the user and 1 is to save the 
code of the final product bought (target / dependent variable). The final product bought 
would be the predicted variable in our model that shall be discussed in the next chapter. 
The attributes of the ‘finalData’ table are: 
• a0  Time in milliseconds spent in cell ‘a0’ of the webpage 
28 
 
• b0  Time in milliseconds spent in cell ‘b0’ of the webpage 
• . 
• .     Similarly from ‘b3’ to ‘u3’ 
• . 
• u4  Time in milliseconds spent in cell ‘u4’ of the webpage 
• u5  Time in milliseconds spent in cell ‘u5’ of the webpage 
• v0  Time in milliseconds spent in cell ‘v0’ of the webpage 
• bought  To save the code of the final product bought by the user 
The ‘alignData.php’ file selects all the responses stored in the tables ‘data’ and  ‘bought’ 
and  save  them  in  the  table  ‘finalData’.  The  attribute  ‘userID’  is  the  primary  key  of  the 
table. The algorithm that was implemented in the ‘alignData.php’ file was: 
1. Select a list of unique users from the table ‘data’ 
2. For each user with id ‘userID’, do‐ 
a. Select all the data (cellIDs and associated time) corresponding to that user 
from the table ‘data’. Use the sum aggregate function in the SQL on time 
and group them by cellIDs. 
b. This  will  give  the  total  time  spent  on  each  visited  cell,  i.e.  section  of  the 
webpage visited by that user. 
29 
 
c. Time spent on all other cells, i.e. sections not visited by that user is made 
zero. 
d. Insert  all  the  time  values  for  each  cell  in  the  ‘finalData’  table  along  with 
the user’s userID. 
e. Select the final product bought by the user using a select statement on the 
table ‘bought’. In case the user has not bought any product, i.e. the output 
from  the  ‘bought’  table  for  that  user  is  empty,  assign  him  a  product 
number 0. 
f. Update  the  ‘finalData’  table  by  inserting  the  value  for  the  ‘bought’  field 
corresponding to that user. 
After successful execution of this algorithm in the ‘alignData.php’ script, the ‘finalData’ 
table contained all the data collected from the initial website in a tabular manner with 
each row corresponding to a unique user. This data can now be used directly for model 
building in WEKA but it needs some cleaning. 
The ‘data’ table had a total of 16401 tuples with 292 unique users whereas the ‘bought’ 
table had 244 tuples. After executing the above script, the total number of tuples in the 
‘finalData’ table was 292. Out of 292 tuples, 48  (292 minus 244) users were those who 
left the site without selecting any product. This table ‘finalData’ was then exported in a 
spreadsheet format (Microsoft Excel) for analysis, visualization and cleaning. 
3.3.2.2 Data cleaning 
On  of  the  obtained  292  rows  of  data  in  excel,  the  next  task  is  the  data  cleaning  stage. 
This step is to remove all the outliers and other cases that can harm the training of the 
model  and  eventually  can  harm  the  model.  There  can  be  multiple  reasons  behind  the 
occurrences  of  such  unwanted  cases  in  the  initial  dataset  such  as,  non  serious 
respondents,  accidently  entering  the  webpage  and  closing  it  immediately,  accidently 
30 
 
pressing  the  enter  key,  leaving  the  computer  with  website  on  while  working  on 
something else, etc. 
The following steps to clean the collected data were followed: 
• All the tuples where the value of the attribute ‘bought’ is 0, i.e. the user has not 
bought any product were deleted. This was because the objective of the project is 
to  select  the  best  product  for  a  user  and  hence  the  training  set  should  only 
contain users who have bought a product. Training the model on data predicting 
that the user would not buy would make the model inappropriate for use in the 
current project. 
There were a total of 48 such tuples where the bought product value as 0. 
The  number  of  tuples  in  the  left  data  were  244  each  corresponding  to  a 
unique visitor. All 244 users have bought a product (dependent variable is 
not 0) 
• The total time spent by a user was calculated for all the users using simple excel 
inbuilt  sum  function.  The  distribution  of  the  total  time  spent  by  different  users 
on the built webpage was studied. 
It  was  found  that  the  average  time  spent  by  a  user  on  the  webpage  was 
33.08  seconds.  The  maximum  time  spent  by  a  user  was  225.8  seconds 
whereas the minimum was 1.2 seconds.  
• The  minimum  and  the  maximum  time  spent  by  any  user  were  analyzed  to  find 
the outliers. Since the minimum time in the current data is much lower than the 
expected minimum time any serious volunteer would spend, a threshold value of 
8 seconds was selected. The maximum time of 225.8 seconds was found feasible 
and hence no upper limit was calculated. 
31 
 
This value of 8 seconds was analyzed as a feasible value keeping in mind 
the  webpage  design.  It  was  assumed  that  any  user  taking  less  than  8 
seconds on that webpage has given incorrect data and will be considered 
as an outlier.  There were 44 users who spent less than 8 seconds on the 
initial  website  while  giving  training  data  for  model  building.  Rows 
associated  with  all  44  users  were  deleted  from  the  collected  sample 
leaving the sample size to 200 tuples. 
The  average  time  spent  by  a  user  became  40.26  seconds  and  the 
minimum time spent by a user in the new dataset became 8.3 seconds. 
3.3.2.3 Data normalization 
The  data  collected  from  the  volunteers  have  the  132  time  fields  corresponding  to  the 
time  spent  in  132  sections  of  the  website  in  absolute  value.  It  was  realized  that  data 
normalization would be required. The reason behind this was that different people have 
spent  different  time  on  the  webpage.  The  time  spent  depends  upon  their  individual 
browsing speed, reading speed and other several personal attributes. Since the desired 
model  has  to  cater  a  general  audience,  time  spent  in  one  section  relative  to  the  time 
spent in the other sections was thought to be more appropriate. 
There are several advantages of this step, primarily also that the model now would be 
capable of predicting in real time for a user who is in process of browsing the webpage. 
Whenever the prediction is needed, the current times spent in various sections could be 
normalized and fed into the model. Since, the model now would be immune to absolute 
time value, with every prediction for the same user, the model would not be biased on 
the  time  spent  by  him  but  would  depend  only  on  the  relative  time  spent  on  different 
sections  of  the  webpage.  Another  advantage  is  that  all  the  data  used  for  training  the 
32 
 
model is now equivalent. The 200 cases in the training set are more comparable and do 
not vary on absolute scale. This step is expected to train the model better. 
Implementation 
To carry out data normalization, the total time spent by a user was calculated in excel 
(also  done  in  data  cleaning  step).  Time  spent  in  individual  section  of  the  webpage  by 
that  user  was  then  divided  by  the  total  time  spent  on  the  webpage  by  him.  This  step 
gave the percentage time spent by the user in each section of the webpage. 
The new dataset with 200 tuples and 134 attributes (132 independent variables and 1 
dependent  variable)  with  normalized  time  data  was  saved  in  the  CSV  format,  which 
could be imported directly into WEKA for model building task. The next chapter would 
explain the procedure of building machine learning models on WEKA using the data as 
collected in this chapter. 
33 
Building machine learning models 
 
4
BUILDING MACHINE LEARNING MODELS 
Using  the  collected  data,  various  machinelearning  models  were  built  and 
tested. This chapter explains the complete methodology followed along with 
the  details  of  the  models  obtained.  It  later  explains  the  best  models  that 
were selected and the rationale behind them. 
4.1 Machine Learning 
According  to  Wikipedia1  “Machine  Learning  is  a  scientific  discipline  that  is  concerned 
with the design and development of algorithms that allow computers to learn based on 
data. Such as from sensor data or databases.” It can be defined as a set of algorithms to 
automatically  learn  and  recognize  complex  patterns  and  are  capable  of  making 
intelligent decisions based on data. 
There  are  several  softwares  available  that  could  be  used  to  build  and  implement 
machine‐learning models. MATLAB and WEKA are two commonly used softwares. The 
models used in the project were built using WEKA. 
                                                        
1
 Wikipedia, Machine Learning ‐ Wikipedia, http://en.wikipedia.org/wiki/Machine_learning. 
34 
 
4.1.1 WEKA 
Weka1 is open source data mining software written in Java. It is primarily a collection of 
various  machine‐learning  algorithms  that  could  be  applied  directly  and  easily  on 
different types of data. It has a built‐in interface to visualize the data and can perform 
tasks  like  attribute  selection,  clustering  etc.  It  is  available  under  General  Public  GPU 
License and can be downloaded from its website. 
4.1.2 Why Machine Learning? 
The  primary  objective  of  the  project  is  to  automatically  learn  the  user’s  mouse 
movement behavior from the collected training data. Machine learning as stated above 
is a branch of science that deals with algorithms that are capable of learning patterns. 
This exactly fits the primary requirement. 
The project further demands capability to predict further content for a new user based 
on  his  mouse  movements.  Machine  learning  algorithms  once  trained  on  a  large  set  of 
data  are  then  capable  of  predicting  the  value  of  the  dependent  variable  for  any  new 
case. Moreover, machine‐learning algorithms can be trained again and again with new 
data. The complete objective of the project can easily be catered using machine‐learning 
algorithms. 
4.2 Methods evaluated 
In machine learning, in order to classify/predict for any new case, a model is first made 
and  trained  on  training  data.  There  can  be  a  number  of  different  types  of  models  that 
                                                        
1
 The University of Waikato, Weka 3: Data Mining Software in Java, 
http://www.cs.waikato.ac.nz/ml/weka/. 
35 
 
can be built and further a lot of different algorithms to built a model. Different types of 
machine learning models generally used are Decision Trees, Neural Networks, Genetic 
Algorithms, Fuzzy Networks etc. To keep the scope of this project in mind only Decision 
Trees and Neural Networks based models were evaluated. The data was modeled using 
both  the  methods  using  J48  Classification  algorithm  for  decision  trees  and  multilayer 
perceptrons  for  neural  network.  The  two  models  were  later  evaluated  on  the  training 
data. 
4.2.1 Decision Tree 
A  decision  tree  can  be  defined  as  a  decision  support  classifier  that  uses  a  tree  like 
structure  of  conditions  and  their  possible  consequences.  Each  node  of  a  decision  tree 
can be a leaf node or a decision node where‐ 
• Leaf node – These node mentions the value of the dependent (target) variable 
• Decision node – These nodes contain one condition each specifying some test on 
a  single  attribute‐value.  The  outcome  of  the  condition  is  further  divided  into 
branches with sub‐trees or leaf nodes. 
The attribute that is to be predicted is known as the dependent variable, since its value 
depends  upon,  or  is  decided  by,  the  values  of  all  the  other  attributes.  The  other 
attributes, which help in predicting the value of the dependent variable, are known as 
the independent variables in the dataset. 
4.2.2 Neural Network 
“An  Artificial  Neural  Network  is  an  interconnected  assembly  of  simple  processing 
elements, units or nodes (neurons), whose functionality is inspired by the functioning of 
the natural neuron from brain. The processing ability of the neural network is stored in 
36 
 
the inter‐unit connection strengths, or weights, obtained by a process of learning from a 
set of training patterns.”1 
4.3 Implemented algorithms 
There are several algorithms for decision trees commonly used now days namely ID3, 
C4.5, C5.0 etc. After careful evaluation of these three algorithms, C4.5 was chosen for the 
project. The reason behind choosing C4.5 over ID3 and C5.0 were: 
• C4.5  handles  continuous  variables  in  a  better  way  by  creating  a  threshold  and 
then  splitting  the  list  on  that  value.  Since  all  the  attributes  in  the  required 
decision tree are continuous whereas the target variable has five discrete values, 
C4.5 was used. 
• C4.5 has a capability to prune trees. Pruning is a method of going backwards in a 
tree  to  remove  any  branches  that  do  not  help  in  further  classifications  and 
replace them by leaf nodes. 
• C5.0 is generally ranked above C4.5 because of its higher speed of building a tree 
and low memory requirements. Since the scope of the project demanded none of 
these  features,  there  was  no  significant  advantage  with  C5.0.  Also  C5.0  can  be 
used  to  weighting  attributes,  which  wasn’t  required  in  the  problem  under 
consideration.  
Similarly,  Neural  networks  can  be  implemented  in  one  of  the  various  available  ways 
namely‐  Feedforward  neural  network,  Radial  basis  function  network,  Kohonen  self‐
organizing  network,  Recurrent  network,  Stochastic  neural  networks,  Modular  neural 
                                                        
1
 Kevin N Gurney, An introduction to neural networks, illustrated (CRC Press, 1997). 
37 
 
networks, Holographic associative memory etc. The neural network implemented in the 
project was a feedforward neural network with non‐linear activation function. 
4.3.1 Decision Tree (C4.5) 
WEKA implements Decision tree C4.5 algorithm using ‘J48 Decision tree classifier’. The 
explanation of the C4.5 algorithm as well as the J48 implementation is as follows: 
• Whenever  a  set  of  items  (training  set)  is  encountered,  the  algorithm  identifies 
the attribute that discriminates the various instances most clearly. This is done 
using the standard equation of information gain 
• Among the possible values of this feature, if there is any value for which there is 
no ambiguity, that is, for which the data instances falling within its category have 
the  same  value  for  the  target  variable,  then  that  branch  is  terminated  and  the 
obtained target value is assigned to it. 
• For  all  other  cases,  another  attributes  are  looked  that  gives  the  highest 
information gain. 
• This is continued in the same manner until either a clear decision of the value of 
the  target  variable  is  reached  with  a  combination  of  conditions  on  various 
independent variables/ attributes, or we run out of attributes. 
• In the event of running out of attributes, or getting an ambiguous result from the 
available information, the branch is assigned a target value that the majority of 
the items under this branch possess. 
The name of the classifier in WEKA that follows the above mentioned C4.5 algorithm is 
‘weka.classifiers.trees.J48’ 
38 
 
4.3.2 Neural Network (Multilayer Perceptron) 
Multilayer  perceptrons  is  a  feedforward  neural  network  based  classifier  that  uses 
backpropogation to classify instances. All the nodes in this network are sigmoids, which 
means that the activation function is a sigmoid. 
In  a  multilayer  perceptron,  there  is  an  input  layer  with  a  node  each  for  all  the 
independent variables, at least one hidden layer and an output layer with a node each 
for  different  classes  of  the  target  variable.  The  network  is  trained  by  initial  data  that 
determines the appropriate weights for connections between all the nodes of adjacent 
layers and also determines the bias/ threshold value of each node. 
The name of the classifier in WEKA is ‘weka.classifiers.functions.MultilayerPerceptron’ 
4.4 Model building 
WEKA was opened in Explorer mode and the saved CSV file was opened using the open 
file  button  in  the  preprocess  tab  of  WEKA.  From  the  attributes  pane,  the  attribute 
userID  was  deleted.  This  is  because  this  field  is  irrelevant  in  the  process  of  model 
building.  The  file  was  then  saved  in  Attribute‐Relation  File  Format  (ARFF)  simply  by 
clicking the save button. The saved ARFF file was opened in a text editor to change the 
properties  of  predicted  variable,  i.e.  attribute  ‘bought’  from  number  to  nominal  scale. 
This is essential step because the ‘bought’ variable has only five discrete values each for 
each product. This will also enable the use of J48 tree classifier, as the nominal data for 
the  predicted  variable  is  a  requirement.  To  convert  ‘bought’  from  number  to  nominal 
mode,  the  property  ‘numeric’  was  changed  to  ‘{1,2,3,4,5}’,  where  1,2,3,4,5  were  the 
codes  for  the  five laptop  products.  The  output  expected  from  the  models  is  one  of the 
five laptop codes. File was saved and closed. 
39 
 
4.4.1 Decision Tree 
The  saved  ARFF  was  then  re‐opened  in  WEKA  and  under  the  classify  tab,  J48  tree 
classifier  was  chosen.  There  are  different  parameters  of  J48  tree  classifier  like  binary 
splits,  number  of  folds,  pruning  etc.  Using  trial  and  error  method,  various  parameters 
were  changed  and  each  model  was  tested  for  accuracy  on  the  training  data.  Models 
were  tested  using  two  methodologies  namely  testing  directly  on  training  data  and 
testing using cross validation. The set of parameters giving the maximum percentage of 
correctly classified instances were chosen. The final model giving maximum accuracy on 
the training dataset was also saved for later use. 
4.4.1.1 Details of the chosen decision tree 
The final parameters selected that gave the best output on training data are‐ 
• binarySplits: By WEKA definition of this parameter, it is considered for nominal 
variables  only.  Since  the  dataset  under  consideration  had  no  nominal 
independent variable, the value of this attribute had no impact on the built tree. 
• confidenceFactor: This attribute defines the confidence factor used for pruning. 
It was found that a confidence factor value of 0.75, a good accuracy decision tree 
was obtained when C4.5 pruning was used. 
• debug: This parameter is only used to output some additional information at the 
console. Its value of either true or false didn’t impact the final model. 
• minNumObj:  This  determines  the  minimum  number  of  instances  at  every  leaf 
node. This attribute was set to a value of ‘2’.  
• numFolds:  This  parameter  determines  the  amount  of  data  used  for  reduced‐
error pruning. In the decision tree built, numFolds was kept at ‘11’. This would 
mean that one fold was used for pruning, and rest for growing the tree. 
40 
 
• reducedErrorPruning:  This  was  set  to  ‘False’  as  it  signifies  if  reduced‐error 
pruning should be used instead of C.4.5 pruning. 
• saveInstanceData: This attribute is just to save the instance for visualization in 
future 
• seed:  The  seed  determines  the  number  of  seeds  to  be  used  while  randomizing 
the data when reduced‐error pruning is to be used. Since reduced‐error‐pruning 
was not used, seed parameter had no relevance. 
• subtreeRaising:  Subtree  raising  while  pruning  is  always  advisable  when  used 
with  a  high  confidence  factor.  Since  a  confidence  factor  of  0.75  was  used,  this 
parameter was set as ‘true’. 
• unpruned: Since we wanted pruning to happen, the ‘unpruned’ parameter was 
set to ‘false’. 
• useLaplace: This parameter determines if counts at leaves are smoothed based 
on Laplace. The parameter had no influence on the model output. 
All the parameters used in the final decision tree can be summarized as‐ 
41 
 
 
Figure 8: Parameters used for building the Decision Tree model 
The output from WEKA is as follow: 
=== Run information === 
 
Scheme:        weka.classifiers.trees.J48 ‐L ‐C 0.75 ‐M 2 ‐A 
Relation:       MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1 
Instances:     200 
Attributes:   133 
              [list of attributes omitted] 
42 
 
Test mode:   evaluate on training data 
 
=== Classifier model (full training set) === 
 
J48 pruned tree 
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 
 
b5 <= 0.04509 
|   k4 <= 0.013828 
|   |   v1 <= 0.000362 
|   |   |   r0 <= 0.000626 
|   |   |   |   d5 <= 0.003481 
|   |   |   |   |   d5 <= 0.001586 
|   |   |   |   |   |   g4 <= 0.033267 
|   |   |   |   |   |   |   s3 <= 0.004874 
|   |   |   |   |   |   |   |   u1 <= 0.002108 
|   |   |   |   |   |   |   |   |   f1 <= 0.039667 
|   |   |   |   |   |   |   |   |   |   f4 <= 0.028894 
|   |   |   |   |   |   |   |   |   |   |   i4 <= 0.004699 
|   |   |   |   |   |   |   |   |   |   |   |   d2 <= 0.001173 
|   |   |   |   |   |   |   |   |   |   |   |   |   e5 <= 0.001377 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   e1 <= 0.029566 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   r3 <= 0.000861 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   c1 <= 0.043665 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   a3 <= 0.206815 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b1 <= 0.007319 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   f3 <= 0.001471 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b4 <= 0.00214: 2 (11.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b4 > 0.00214 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   a4 <= 0.004126: 3 (3.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   a4 > 0.004126: 2 (2.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   f3 > 0.001471: 3 (3.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b1 > 0.007319 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b3 <= 0.123969: 2 (12.0/2.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   b3 > 0.123969: 1 (2.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   a3 > 0.206815: 1 (2.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   c1 > 0.043665: 1 (3.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   r3 > 0.000861: 3 (3.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   |   e1 > 0.029566: 1 (3.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   e5 > 0.001377: 3 (2.0) 
|   |   |   |   |   |   |   |   |   |   |   |   d2 > 0.001173 
|   |   |   |   |   |   |   |   |   |   |   |   |   s4 <= 0.002873: 2 (32.0/1.0) 
|   |   |   |   |   |   |   |   |   |   |   |   |   s4 > 0.002873: 4 (2.0) 
|   |   |   |   |   |   |   |   |   |   |   i4 > 0.004699: 1 (3.0/1.0) 
|   |   |   |   |   |   |   |   |   |   f4 > 0.028894: 4 (2.0) 
|   |   |   |   |   |   |   |   |   f1 > 0.039667: 3 (3.0) 
|   |   |   |   |   |   |   |   u1 > 0.002108: 3 (6.0/1.0) 
43 
 
|   |   |   |   |   |   |   s3 > 0.004874 
|   |   |   |   |   |   |   |   q1 <= 0.004708 
|   |   |   |   |   |   |   |   |   r4 <= 0.007391: 3 (16.0) 
|   |   |   |   |   |   |   |   |   r4 > 0.007391: 2 (2.0) 
|   |   |   |   |   |   |   |   q1 > 0.004708: 2 (2.0/1.0) 
|   |   |   |   |   |   g4 > 0.033267 
|   |   |   |   |   |   |   g5 <= 0.004141 
|   |   |   |   |   |   |   |   k4 <= 0.001354: 4 (8.0) 
|   |   |   |   |   |   |   |   k4 > 0.001354: 3 (3.0/1.0) 
|   |   |   |   |   |   |   g5 > 0.004141: 2 (3.0/1.0) 
|   |   |   |   |   d5 > 0.001586: 4 (4.0) 
|   |   |   |   d5 > 0.003481 
|   |   |   |   |   g5 <= 0.004141 
|   |   |   |   |   |   b5 <= 0.002996 
|   |   |   |   |   |   |   g4 <= 0.003922: 2 (4.0) 
|   |   |   |   |   |   |   g4 > 0.003922: 1 (2.0) 
|   |   |   |   |   |   b5 > 0.002996: 3 (2.0) 
|   |   |   |   |   g5 > 0.004141: 5 (3.0) 
|   |   |   r0 > 0.000626: 4 (3.0/1.0) 
|   |   v1 > 0.000362 
|   |   |   s4 <= 0.005561 
|   |   |   |   t4 <= 0.002371 
|   |   |   |   |   e0 <= 0.001979 
|   |   |   |   |   |   h2 <= 0.005305: 1 (18.0/1.0) 
|   |   |   |   |   |   h2 > 0.005305: 2 (2.0) 
|   |   |   |   |   e0 > 0.001979: 2 (2.0) 
|   |   |   |   t4 > 0.002371: 2 (2.0/1.0) 
|   |   |   s4 > 0.005561: 2 (2.0/1.0) 
|   k4 > 0.013828 
|   |   f5 <= 0.001805: 4 (9.0/1.0) 
|   |   f5 > 0.001805: 2 (2.0/1.0) 
b5 > 0.04509 
|   t3 <= 0.000515 
|   |   d4 <= 0.008991 
|   |   |   e2 <= 0.011901 
|   |   |   |   a1 <= 0.001341 
|   |   |   |   |   g2 <= 0.001762: 4 (3.0/1.0) 
|   |   |   |   |   g2 > 0.001762: 5 (2.0) 
|   |   |   |   a1 > 0.001341: 5 (4.0) 
|   |   |   e2 > 0.011901: 4 (3.0) 
|   |   d4 > 0.008991: 2 (2.0/1.0) 
|   t3 > 0.000515: 3 (3.0) 
 
Number of Leaves  :   42 
 
Size of the tree :   83 
 
44 
 
Time taken to build model: 0.75 seconds 
4.4.1.2 Testing the decision tree 
The model was tested using two different methodologies namely, testing directly on the 
training dataset and testing using cross‐validation with 10 folds. 
Testing on the training data gave a result of 89.5% accuracy whereas testing using cross 
validation gave an accuracy of 66%. The complete result along with the discussion is as 
follows: 
4.4.1.2.1 Testing on Training Data 
=== Evaluation on training set === 
=== Summary === 
 
Correctly Classified Instances         179              89.5    % 
Incorrectly Classified Instances       21                10.5    % 
Kappa statistic                                  0.8586 
Mean absolute error                        0.1650  
Root mean squared error              0.2382 
Relative absolute error                                          54.9013 % 
Root relative squared error                                 61.5103 % 
Total Number of Instances               200      
 
=== Detailed Accuracy By Class === 
 
  TP Rate  FP Rate  Precision  Recall  F‐Measure ROC Area  Class 
     0.848    0.030       0.848   0.848        0.848      0.953     1 
     0.959       0.079           0.875       0.959           0.915       0.969         2 
     0.932      0.019           0.932       0.932           0.932        0.994         3 
     0.795       0.019           0.912       0.795           0.849         0.987         4 
     0.818       0.000           1.000       0.818           0.900         0.999         5 
Weighted Avg.  0.895      0.042          0.897       0.895           0.894         0.977 
 
=== Confusion Matrix === 
 
   a  b  c  d  e  <‐‐ classified as 
   28  4  0  1  0  |  a = 1 
   1  70  1  1  0  |  b = 2 
45 
 
   1  2  41  0  0  |  c = 3 
   3  3    2   31   0   |  d = 4 
   0    1  0     1    9   |  e = 5 
4.4.1.2.2 Testing by Cross‐Validation (folds 10) 
=== Stratified cross‐validation === 
 
Correctly Classified Instances       132          66      % 
Incorrectly Classified Instances       68                34      % 
Kappa statistic                            0.5303 
Mean absolute error                        0.2133 
Root mean squared error                 0.308  
Relative absolute error                     70.9833 % 
Root relative squared error                79.5133 % 
Total Number of Instances              200      
 
 
   TP Rate  FP Rate   Precision   Recall    F‐Measure   ROC Area  Class 
                   0.545       0.072        0.6 00          0.545        0.571        0.865      1 
                   0.890       0.315        0.619           0.890        0.730        0.871      2 
                   0.500       0.000        1.000           0.500        0.667        0.874      3 
                   0.538       0.081        0.618           0.538        0.575        0.833      4 
                   0.545       0.016        0.667           0.545        0.600        0.983      5 
Weighted Avg.  0.66        0.143        0.702           0.66          0.653        0.869 
 
 
 
    a    b    c    d    e     <‐‐ classified as 
   18   14    0    0    1   |  a = 1 
    7   65    0    1    0   |  b = 2 
    1   13   22    7    1   |  c = 3 
    4   13    0   21    1   |  d = 4 
    0    0    0    5    6   |  e = 5 
4.4.1.2.3 Discussion 
Testing directly on the training data classified 179 cases correctly out of 200, which is 
an  accuracy  of  89.5%.  Accuracy  while  testing  on  training  data  is  always  desired  very 
high because it signifies the extent to which the model has learnt the training data. Since 
46 
 
there  were  5  classes  in  the  target  variable  (5  products),  any  accuracy  in  model  more 
than 20% (equal probability of each class is 1/5 = 0.2 = 20%) has to be considered good. 
Accuracy  of  89.5%  is  well  within  the  error  range  and  signifies  that  the  built  decision 
tree has learnt the training data quite accurately. 
Testing  using  cross‐validation  is  a  process  of  dividing  the  data  into  different  sub  sets 
and then carrying out the analysis on one subset and testing it on other. Doing this with 
10 folds is the process of carrying out cross‐validation 10 times and averaging out the 
accuracy  score.  Again,  as  stated  above  any  accuracy  of  more  than  20%  is  good.  The 
achieved result of an average of 132 correct classifications out of 200 with an accuracy 
of 66% is well within the desired range. 
Ideally, the model should have been trained on more data. Due to the limitation of time, 
and  no  compensation  available  to  volunteers,  only  200  tuples  of  useful  data  could  be 
collected. It is expected that with the bigger training dataset, the accuracy of the models 
would increase. 
4.4.2 Neural Network 
The  saved  ARFF  file  was  re‐opened  in  WEKA  and  under  the  classify  tab, 
MultilaterPerceptron  function  was  chosen.  There  are  different  parameters  associated 
with  this  neural  network  function  and  as  done  with  decision  trees,  trial  and  error 
method was used to find the best set. The best set of parameters was the one that gave 
maximum  accuracy  of  classification  on  the  training  dataset.  Each  obtained  model  was 
tested  using  two  methodologies  namely  testing  directly  on  training  data  and  testing 
using  cross  validation.  After  multiple  iterations  using  trail  and  error  method,  a  model 
giving a good accuracy of classification was obtained. The model was also saved for later 
use. 
47 
 
4.4.2.1 Details of the chosen neural network 
The final parameters selected that gave the best output on training data are‐ 
• GUI: The GUI parameter brings up an interface. It doesn’t really impact the final 
model,  unless  some  changes  in  the  learning  rate  and  momentum  are  desired 
while training. It was set as ‘False’ in the project. 
• autoBuild: An ANN was built automatically and hence this parameter was set to 
‘true’ 
• debug: This is to view additional information on the console. 
• decay: It was observed that the ‘true’ decay value gave slightly less accuracy and 
hence in the final model, ‘decay’ was set to ‘false’ 
• hiddenLayers: Since an automatic neural network was desired, the WEKA was 
left to decide the number of hidden layers and hence the final set of parameters 
had  a  value  of  ‘a’  in  the  field  of  hiddenLayers.  ‘a’  when  used  as  a  value  for 
hiddenLayers mean ‘automatic’. 
• learningRate:  The  amount  at  which  the  weights  should  be  updated  was  set  to 
0.1 
• momentum: Momentum of 0.2 was applied to the weights during updating. 
• nominalToBinaryFilter: There were no nominal variables in the data and hence 
this parameter had no impact on the model 
• normalizeNumericClass: Since the class is not numeric but already normalized, 
there was no use of using this feature and hence it was set to ‘false’ 
• reset: When the reset was set to false, no error message was received. Moreover 
the set learning rate of 0.1 is already quite low and hence this feature was set as 
‘false’ 
• seed: Seed value of 0 was used. As in case of decision trees, this value is used to 
initialize  the  random  number  generator.  Random  numbers  are  used  for  setting 
48 
 
the  initial  weights  of  the  connections  between  nodes,  and  also  for  shuffling  the 
training data. 
• trainingTime: The number of epochs to train through was set to 5000. 
• validationSetSize: The percentage size of the validation set was made 0 which 
signifies that no validation set will be used and instead the network will train for 
the specified number of epochs, i.e. for 5000 epochs 
• validationThreshold: This parameter was set to 20 which dictates that 20 times 
in a row the validation set error can get worse before training is terminated. 
The parameters used in the final neural network model can be summarized as: 
49 
 
 
Figure 9: Parameters used for building the Neural Network model 
50 
 
It  was  found  impossible  to  include  the  complete  model  output  in  this  document,  and 
hence the summary of the model obtained is as follows‐ 
 
=== Run information === 
 
Scheme:       weka.classifiers.functions.MultilayerPerceptron ‐L 0.1 ‐M 0.2 
‐N 5000 ‐V 0 ‐S 0 ‐E 20 ‐H a ‐R 
Relation: 
MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1 
Instances:    200 
              [list of attributes omitted] 
Test mode:    10‐fold cross‐validation 
 
=== Classifier model (full training set) === 
 
 
The  chosen  neural  network  had  1  hidden  layer  with  68  nodes.  There  were  132  input 
nodes  accepting  132  normalized  time  values  corresponding  to  each  section  of  the 
webpage. The model had 5 output nodes each for one of the five laptops. 
There were a total of 73 threshold values for 73 nodes (68 hidden layer nodes+5 output 
nodes) and there were 9316 weight values (132*68 + 68*5) 
4.4.2.2 Testing the neural network model 
The  neural  network  model  was  also  tested  similarly  as  decision  trees  using  two 
different  methodologies  namely,  tested  directly  on  training  set  and  using  cross‐
validation with 10 folds. 
It was found that testing on training dataset gave an exceptionally good result of 95.0% 
whereas  testing  using  cross  validation  with  10  folds  gave  a  classification  accuracy  of 
41.0% 
51 
 
4.4.2.2.1 Testing on Training Data 
=== Evaluation on training set === 
 
Correctly Classified Instances          190              95      % 
Incorrectly Classified Instances      10              5      % 
Kappa statistic                            0.9335 
Root mean squared error                  0.1313 
Relative absolute error                      7.2772 % 
Root relative squared error                33.8899 % 
Total Number of Instances                200      
 
 
                TP Rate   FP Rate   Precision   Recall  F‐Measure ROC Area  Class 
                   0.939       0.012        0.939         0.939      0.939        0.966      1 
                   0.918       0.024        0.957          0.918     0.937        0.936      2 
                   1           0.026        0.917          1              0.957        0.993      3 
                   0.949       0.006        0.974          0.949     0.961        0.957      4 
                   1           0            1              1             1            1          5 
Weighted Avg.    0.95        0.017        0.951          0.95        0.95         0.961 
 
 
 
   a    b    c    d    e     <‐‐ classified as 
   31   2    0    0    0   |  a = 1 
   2   67   3    1    0   |  b = 2 
   0    0   44   0    0   |  c = 3 
   0    1    1   37   0   |  d = 4 
   0    0    0    0   11   |  e = 5 
 
4.4.2.2.2 Testing by Cross‐Validation (folds 10) 
=== Stratified cross‐validation === 
 
Correctly Classified Instances            82               41      % 
Incorrectly Classified Instances         118              59      % 
Kappa statistic                            0.2165 
Mean absolute error                        0.236  
Root mean squared error                    0.4551 
Relative absolute error                     78.4778 % 
52 
 
Root relative squared error                117.4608 % 
Total Number of Instances          200      
 
 
                 TP Rate FP Rate  Precision  Recall  F‐Measure ROC Area  Class 
                   0.333     0.12        0.355       0.333     0.344        0.614      1 
                   0.575     0.22        0.6         0.575     0.587        0.706      2 
                   0.295     0.237     0.26        0.295     0.277        0.626      3 
                   0.282      0.155     0.306       0.282     0.293        0.652      4 
                   0.455     0.042     0.385       0.455     0.417        0.856      5 
Weighted Avg.  0.41      0.185     0.415       0.41        0.412        0.671 
 
 
 
    a    b    c    d    e     <‐‐ classified as 
   11   8    5    8    1   |  a = 1 
    9   42   19   2    1   |  b = 2 
    6   12   13   11   2   |  c = 3 
    5    7   12   11   4   |  d = 4 
    0    1    1    4    5   |  e = 5 
 
4.4.2.2.3 Discussion 
Testing  on  the  training  data  classified  190  cases  correctly  out  of  200,  which  is  an 
accuracy of 95.0%. Such a high value of classification accuracy clearly signifies that the 
built neural network model has learnt the training data with high accuracy.  
Testing  using  cross‐validation  is  a  process  of  dividing  the  data  into  different  sub  sets 
and then carrying out the analysis on one subset and testing it on other. Doing this with 
10 folds is the process of carrying out cross‐validation 10 times and averaging out the 
accuracy  score.  The  achieved  result  of  an  average  of  82  correct  classifications  out  of 
200, i.e. an accuracy of 41.0% is comparatively low but is well within the desired range. 
Ideally, the model should have been trained on more data. Due to the limitation of time, 
and  no  compensation  available  to  volunteers,  only  200  tuples  of  useful  data  could  be 
53 
 
collected. Since there is a hidden later with 68 nodes and a total of 9316 weight values 
are involved, a much bigger training dataset was required. It is expected that with the 
bigger training dataset, the accuracy of testing would increase. 
4.4.3 Decision Tree Vs Neural Networks 
Based on the initial 200 data cases, one model each of decision tree and neural network 
was trained. Upon testing on the training  dataset, decision tree showed slightly better 
accuracy  as  compared  to  the  neural  network  model.  The  other  factors  worth 
considering about the two models are: 
• Building  a  neural  network  model  is  easy  but  time  consuming  in  WEKA  but 
moreover,  it  slows  down  the  performance  of  the  website  after  its 
implementation. The objective of the project is to determine the product for the 
users  in  real‐time  while  they  are  still  browsing  and  it  will  require  very  fast 
computation. Decision trees are a set of conditions, which can be evaluated much 
efficiently  than  the  calculations  and  temporary  variables  required  in  neural 
networks.  However,  if  a  parallel  web  server  is  used  which  is  capable  of 
performing  calculations  faster,  a  neural  network  could  also  be  considered  for 
implementation. 
• With  time  the  website  would  keep  accumulating  more  and  more  mouse 
movement  data  and  the  model  should  be  improved  /  trained  on  new  data 
whenever  required.  This  would  require  re‐implementing  the  new  model  every 
time the updating is desired. As stated above this would be more difficult, time 
consuming and error prone in neural networks as compared to decision trees. 
• Decision  trees  are  more  transparent  as  compared  to  neural  network  models. 
This mean that for a person visually seeing the two models, a decision tree could 
give  him  some  information  where  as  a  neural  networks  can  visually  tell  him 
54 
 
nothing. This was however not one of the points considered before taking a final 
call on the model to be chosen. 
Despite all these points, final models of both neural networks and decision trees were 
implemented  in  two  similar  copies  of  the  same  website.  Further  tests  of  accuracy  and 
performance were conducted later in order to conclude a better model for the problem 
in hand. 
The next chapter will explain the steps required to put these models into the website so 
that they can be used in real time for a user for predicting relevant content for him. 
55 
Embedding the machine learning models in the website 
 
5
EMBEDDING THE MACHINE LEARNING MODELS IN 
THE WEBSITE 
This chapter explains the complete methodology adopted to apply the built 
machine  learning  models  in  the  website.  It  also  explains  the  interaction 
between the model and the website and how a user’s mouse movement data 
was used to predict the best content for him in real time. 
5.1 What and Why? 
As  explained  in  the  previous  chapter,  a  decision  tree  and  a  neural  network  model 
capable  of  predicting  the  product  the  user  is  most  likely  to  buy  were  modeled.  These 
models  needs  to  be  implemented  in  the  website  so  that  they  can  take  further  mouse 
movement  behavior  of  new  users  as  input  and  can  predict  for  him  the  appropriate 
product. 
5.2 Specifications 
The  initial  website  built  as  explained  in  Chapter  3  for  collecting  the  training  data  was 
modified  to  implement  the  decision  tree  and  neural  network  models.  Some  additional 
characteristics required from the website were: 
56 
 
• The  model  should  reside  on  the  server.  This  is  essential  from  security  point  of 
view else any user would have access to the model which by reverse engineering 
can give information about the products bought by other users. 
• Real time model evaluation on the real time mouse movement data. 
• Real time transfer of model output from the web server to the frontend website 
so that the website can use the model prediction. 
• Determining  the  product  the  user  is  most  likely  to  buy  using  the  embedded 
models periodically after every say 10 seconds. This would involve including the 
latest mouse movement data and transferring the output again to the front end 
HTML  website  so  that  if  any  change  is  predicted  in  the  final  product,  it  can  be 
reflected on the front end. 
• All the tracking and model evaluation was carried out in a hidden layer and the 
user  was  not  asked  for  any  explicit  information  or  was  not  compromised  on 
speed and performance. 
• Not  to  mention,  the  website  should  continue  to  track  mouse  movement  as  was 
explained in earlier chapters. 
5.3 Implementation 
The  website  built  initially  to  collect  training  data  had  mouse  movement  tracking 
capability. A few new functions and scripts were added to enable the model evaluation 
on the captured mouse movements. 
A new JavaScript function named ‘predict()’ was programmed in the JavaScript file. The 
‘predict()’  was  a  recursive  function  that  was  made  to  call  itself  every  10  seconds.  This 
was  because,  it  was  expected  that  the  prediction  is  made  every  10  seconds  using  the 
machine  learning  model.  Every  subsequent  10  seconds,  the  database  would  contain 
57 
 
more  mouse  movement  data  that  could  be  used  by  the  machine  learning  models  to 
ideally predict more accurately. 
The  ‘predict()’  function  takes  no  arguments  and  calls  a  PHP  script  named  ‘predict.php’ 
passing it the userID of the current user via GET method. ‘predict.php’ file resides on the 
server and the calling from JavaScript was programmed using standard AJAX protocols. 
The code snippet of the JavaScript ‘predict()’ function is: 
function autoPredict() {
setTimeout("predict()",10000);
}
function predict() {
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}
 
 
The ‘predict.php’ file connects to the MySQL databases and selects the mouse movement 
data for the current user using a simple SQL ‘SELECT *…’ statement. Mouse movement 
data  was  saved  into  132  temporary  variables  that  correspond  to  each  section  of  the 
webpage.  The  total  time  spent  by  the  user  till  now,  was  also  calculated  while  saving 
these temporary variables. The absolute time value spent in each section as saved in the 
132 temporary variables was then replaced by the normalized time spent in that section 
by divided the absolute time value by the total time spent by that user. 
Hence,  after  this  step  the  132  temporary  variables  in  ‘predict.php’  file  will  contain  the 
normalized time / relative time spent by the user in corresponding 132 sections of the 
webpage.  These  132  temporary  variables  are  the  132  independent  input  variables  for 
the model. 
58 
 
The  two  models  (decision  tree  and  neural  network)  were  than  coded  and  were  given 
access to these 132 temporary variables so that they can evaluate the normalized time 
and can make their respective predictions. It should however be noted that for testing, 
only  one  of  the  models  was  used.  Both  the  models  were  tested  separately  later  for 
comparison purpose. The implementation of the two models is as follows: 
5.3.1 Implementing the Decision Tree model 
A function named ‘decisionTree()’ was coded in the PHP file ‘predict.php’. This function 
had access to all the 132 input variables as stated above.  
The  model  made  in  WEKA  had  a  set  of  83  if‐else  statements  (83  being  the  size  of  the 
tree).  All  these  83  if‐else  statements  from  the  WEKA  model  along  with  the  prediction 
value  were  coded  in  PHP.  The  if‐else  statements  were  doing  comparisons  on  the  132 
independent variables so as to imitate the decision tree. The output of this set of if‐else 
statement was one value that is also the output of the decision tree model. This output is 
the product the user is most likely to buy according to the implemented decision tree. 
This value was returned to the main program by the function. The complete code of the 
function ‘decisionTree()’ and the ‘predict.php’ file is available in appendix 
5.3.2 Implementing the Neural Network model 
Another  function  named  ‘neuralNetwork()’  was  implemented.  This  function  also  had 
access to the 132 independent input variables as stated above. 
The neural network built in WEKA had one hidden layer with 68 nodes. To implement 
this  hidden  layer,  68  new  temporary  variables  named  ‘Node5’,  ‘Node6’,  ‘Node7’,  ….., 
‘Node72’ were created with value computed based on standard neural network formula. 
59 
 
All  the  coefficient  values  as  well  as  the  threshold  limits  were  used  as  given  by  WEKA 
while model building. 
To implement the output layer the same formula was used based on these temporary 68 
variables (68 hidden layer nodes, i.e. Node5, Node6…. Node72). The output layer of the 
neural  network  model  had  5  nodes  corresponded  to  the  five  laptop  products.  The 
product corresponding to the node with highest value was predicted as the laptop the 
current user is most likely to buy. 
5.4 Using model outputs 
As stated above, only one of the two models was used at a time for a given user. After 
receiving the output from the used models (decision tree or neural network), the output 
was sent back to the frontend JavaScript function named ‘predictResponse()’ via AJAX. It 
should be noted that model output was the code of one of the 5 laptops that the current 
user is most likely to buy. 
The  ‘predictResponse()’  JavaScript  function  after  receiving  the  prediction  can  now  be 
programmed  as  per  the  needs.  In  the  current  project,  the  author  decided  to  simply 
highlight the border of the predicted laptop in red color. The predicted laptop is the one 
the  user  is  most  likely  to  buy  that  has  been  predicted  by  one  of  the  machine‐learning 
model  based  on  the  user’s  mouse  movement  behavior.  The  function  definition  of  the 
‘predictResponse()’ function is as follows: 
60 
 
function predictResponse() {
if (http.readyState == 4) {
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg2").className="";
document.getElementById("cg"+colName).className="oce-predict";
alert("Product : "+predictProduct);
}
}
 
 
The  function  above  gets  the  response  from  the  PHP  script  via  standard  AJAX 
http.responseText function. The output was then used to simple change the style of 
the  column  containing  that  product.  The  style  of  all  other  columns  is  first  reset 
before  changing  the  predicted  laptop  column  style.  The  JavaScript  ‘predict()’ 
function has been called every 10,000 milliseconds. In the current demonstration, a 
popup was also shown to the user with the code of the laptop he is most likely to 
buy. This was done using the alert statement. 
There can be several other usages of the prediction. It can be imagined that a customer 
would be served more easily and appropriately if the shopkeeper knows the product the 
customer is most likely to buy. The customer could be given other options similar to the 
product  that  was  predicted.  If  not  used  by  the  content  generator  of  the  website,  this 
prediction can always be used by the visitors in finding information he has been looking 
for.  The  screenshot  of  the  prediction  made  by  the  Decision  Tree  model  is  shown  in 
Figure 10 
61 
 
 
Figure 10: Screenshot of the prediction done by the model 
5.5 What next 
Once the website was programmed and the machine learning models were embedded, it 
was  again  made  public  and  the  users  were  invited  to  visit  it  again.  All  the  mouse 
movement data was saved in the databases as designed earlier along with the product 
the  user  buys.  The  users  were  also  shown  the  real‐time  prediction  as  per  the  model 
after  every  10  seconds.  The  prediction  done  by  the  model  was  not  saved  in  any 
databases because of following reasons: 
• Connecting the ‘predict.php’ file with the databases and saving data will certainly 
take  time.  This  time  used  up  in  saving  predicted  output  would  effect  the 
performance of the website mainly because it will delay the return of the model 
output from ‘predict.php’ file to the javaScript ‘predictResponse()’ function.  
• The final prediction done for any user as per the model can always be calculated 
again  as  the  databases  are  keeping  a  record  of  the  mouse  movement  data  for 
every user. This would be done later in the testing phase of the project. 
62 
 
• The prediction was done every 10 seconds. This would mean, that there would 
be several predictions (average 4 predictions) done for every user. The count of 
four  predictions  was  estimated  because  it  was  earlier  found  in  section  3.3.2.2 
that  average  time  spent  by  a  user  on  the  webpage  is  40.26  seconds.  Saving  all 
predictions  per  user  is  again  a  performance  issue,  as  the  table  saving  this  is 
expected to grow with time. 
The  final  website  capable  of  predicting  the  product  the  user  is  most  likely  to  buy  was 
made public and was kept online for 7 days. The users were again invited using emails, 
social media, chats etc and were asked to surf on the final version of the webpage. The 
volunteers were required to buy one of the products after evaluating all the options (5 
laptops) available on that page. While doing so, the users were shown the product they 
are  most  likely  to  buy.  It  was  told  by  the  visitors  informally  via  email  and  in‐person 
conversations that the predictions were quite accurate. 
The  next  chapter  will  explain  a  much  formal  and  quantitative  method  of  testing  the 
prediction  done  by  the  two  models.  It  will  also  describe  the  methodology  adopted  to 
test the time performances of the two models. 
63 
Testing and Results 
 
6
TESTING AND RESULTS 
This chapter describes the complete testing phase of the project. It describes 
the  data  collection  steps  and  the  parameters  on  which  the  models  were 
evaluated. It also explains the testing methodology and summary of the final 
results obtained. 
6.1 Testing methodology 
There were two types of tests conducted to evaluate the implementation. One test was 
conducted on WEKA on the collected test data to check for the classification accuracy of 
the  model  (decision  tree  or  neural  network).  The  other  test  was  conducted  on  the 
‘predict.php’  file  to  check  the  time  performance  of  the  website  after  implementing  the 
model. 
Both  the  above‐mentioned  tests  were  performed  on  both  the  models  separately.  The 
methodology adopted and the results obtained are mentioned in the following sections. 
6.2 Testing for model accuracy 
Testing data was collected while the final website was live and was used to further test 
the two models in WEKA. It was found that the decision tree model gave an accuracy of 
84.09% whereas the neural network model gave an accuracy of 34.09% on the collected 
test data. Details about the test conducted are as follows: 
64 
 
6.2.1 Testing data collection 
While  the  website  with  one  of  the  machine  learning  model  was  live,  the  users  mouse 
tracking data and the final product bought by the user was getting saved in the tables 
‘data’ and ‘bought’ respectively. It was found that in 7 days time (duration for which the 
test website was live), 49 unique users visited the webpage. There were 1275 tuples in 
the  ‘data’  table  and  44  tuples  in  the  ‘bought’  table.  The  difference  between  the 
cardinality  of bought  table  and the number of  visitors was because 5  users (49 minus 
44) didn’t click the buy button and left the site after browsing it for a while. 
This  data was  processed  in  the  similar way  as  the  initial  data as  mentioned  in section 
3.3.2. The steps followed to analyze and prepare the test data are as follows: 
• The  data  was  converted  into  a  more  usable  format  using  the  php  script  named 
‘alignData.php’.  The  details  of  this  script  are  mentioned  in  section  3.3.2.1.  This 
step converted the test data into a ‘one user per row data’ with the time values of 
each user in a same row along with the product bought. 
• This data was exported into excel and was normalized. To normalize the times, 
total time spent by each user was calculated and then time spent in each section 
/ cell was divided by the total time. This is explained in details in section 3.3.2.3 
• It should be noted that in this step no outliers were removed. The reason is that 
the  data  was  collected  from  the  actual  users  and  it  is  expected  that  all  kinds  of 
people  will  use  the  website  in  all  possible  way  and  the  accurate  measure  of 
accuracy would be when all these cases are taken into considerations including 
any outliers. 
• This data was saved in a CSV file that is then opened in WEKA. 
65 
 
• Opened the CSV file in WEKA and was saved in WEKA default ARFF format. The 
ARFF  format  was  opened  in  a  text  editor  and  the  property  of  the  bought  table 
was changed from number to nominal as stated in section 4.4 
This  data  was  then  opened  in  WEKA  again  the  model  testing  was  carried  out  as 
explained in following sections: 
6.2.2 Model testing in WEKA using test data 
Using WEKA the saved files of the two models were opened. In the classifier tab, testing 
on  supplied  test  dataset  option  was  chosen  and  after  pressing  the  set  button,  the 
collected and normalized test data file was opened. Now the loaded model was made to 
evaluate on this testing data by right clicking the model and selecting “Re‐evaluate the 
model  on  current  test‐set”.  This  method  would  evaluate  the  model  on  the  test  dataset 
collected and would show the accuracy results on this test data. 
This method is similar to running the model on the website using PHP. The output given 
by the model while testing in WEKA would be exactly same to the one given by the PHP 
script  online.  This  is  because  the  obtained  WEKA  model  was  the  one  which  was 
implemented in the website. This was the reason that the predictions were not saved as 
stated  in  section  5.5.  Now  checking  for  accuracy  is  simply  comparing  the  model 
prediction with the actual product bought by the user. 
The  details  of  the  results  given  by  both  the  model  when  evaluated  on  the  test  set  are 
explained in the following sub sections‐ 
66 
 
6.2.2.1 Decision Tree model 
The test dataset with 44 cases was evaluated using the built decision tree model. It was 
found that the tree was able to correctly classify 37 out of 44 cases with an accuracy of 
84.0909%. 
The output obtained after re‐evaluation from WEKA was: 
=== Re‐evaluation on test set === 
 
User supplied test set 
Relation:     MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1 
Instances:     unknown (yet). Reading incrementally 
 
 
Correctly Classified Instances            37                 84.0909 % 
Incorrectly Classified Instances         7                 15.9091 % 
Kappa statistic                            0.7825 
Root mean squared error                    0.2814 
Total Number of Instances                 44      
 
 
                          TP Rate   FP Rate   Precision   Recall  F‐Measure   ROC Area  Class 
                            0.857     0.027         0.857        0.857     0.857            0.986          1 
                            0.875     0.143         0.778        0.875     0.824            0.872          2 
                            0.917     0.031         0.917        0.917     0.917            0.921          3 
                            0.833     0.026         0.833        0.833     0.833            0.945          4 
                            0.333     0                 1                 0.333     0.5                 0.89            5 
Weighted Avg. 0.841    0.068         0.851        0.841     0.834            0.915 
 
 
   a    b    c    d    e    <‐‐ classified as 
   6    1    0    0    0   |  a = 1 
   1   14   0    1   0   |  b = 2 
   0    1   11   0    0   |  c = 3 
   0    0    1    5    0   |  d = 4 
   0    2    0    0    1   |  e = 5 
 
67 
 
6.2.2.2 Neural Network model 
The  dataset  having  44  test  cases  was  evaluated  on  neural  network  model  and  it  was 
found that it classified 15 cases correctly. This shows an accuracy of only 34.0909% on 
testing data of the neural network model. 
The output obtained from WEKA was: 
=== Re‐evaluation on test set === 
 
User supplied test set 
Relation:     MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1 
Instances:     unknown (yet). Reading incrementally 
 
 
Correctly Classified Instances          15                34.0909 % 
Incorrectly Classified Instances       29                65.9091 % 
Kappa statistic                            0.1367 
Root mean squared error                  0.5001 
Total Number of Instances               44      
 
 
                 TP Rate    FP Rate   Precision  Recall  F‐Measure  ROC Area  Class 
                   0.429       0.135        0.375       0.429      0.4          0.695         1 
                   0.313       0.25         0.417       0.313      0.357        0.694         2 
                   0.25        0.281        0.25        0.25        0.25         0.505         3 
                   0.5         0.184        0.3         0.5         0.375        0.623         4 
                   0.333       0.024        0.5         0.333     0.4          0.78          5 
Weighted Avg.    0.341       0.216        0.354       0.341      0.34         0.639 
 
 
 
   a   b   c   d   e     <‐‐ classified as 
   3   3   0   1   0   | a = 1 
   2   5   7   2   0   | b = 2 
   3   4   3   2   0   | c = 3 
   0   0   2   3   1   | d = 4 
   0   0   0   2   1   | e = 5 
68 
 
6.2.3 Discussion 
The training data gave an accuracy of 89.5% for decision tree whereas gave an accuracy 
of  95%  for  neural  networks.  The  same  decision  tree  and  neural  network  models  gave 
accuracies of 84.0909% and 34.0909% respectively when evaluated on the test dataset. 
For models comparison, accuracy on the test dataset, i.e. the data on which model has 
not  been  trained  is  the  one  of  the  most  important  parameter.  As  discussed  in  section 
4.4.3, there are several drawbacks of using neural networks in the present situation but 
after  conducting  the  evaluation  of  the  two  models  on  test  dataset,  it  is  clear  that 
decision  trees  have  clearly  out  performed  neural  networks  and  should  be  used  while 
predictions. 
This  however  depends  on  a  lot  of  parameters,  most  important  being  the  size  of  the 
training and testing dataset. Since the scope of this project was limited, a large amount 
of  data  could  not  be  collected  but  it  is  advised  that  both  decision  trees  and  neural 
networks  should  be  evaluated  along  with  other  machine  learning  models  before  pin‐
pointing on one of them.  
6.3 Testing time performance of the models 
A new PHP script was written and executed on the server to estimate the average time 
the  model  processing  is  taking  when  executed  in  real  time.  To  do  this,  the  PHP  script 
was connected to the database containing the test data. Both the model functions were 
then called and the time taken by them to evaluate all the 44 test cases was calculated. 
This  was  averaged  out  over  44  cases  to  estimate  the  average  time  each  model  takes 
while making every prediction in PHP. This process was carried out 10 times separately 
to  estimate  the  average  time  so  as  to  avoid  any  clashes  with  unforeseen  tasks  at  the 
server that might delay the model execution. 
69 
 
Time  taken  by  the  model  is  an  important  feature  as  the  expectation  of  the  intelligent 
website is to predict the output as soon as possible and of course in real‐time. A model 
taking more than some threshold value for calculations is of no good use. The process 
and results are explained in the following sections‐ 
6.3.1 Decision Tree model 
The  decision  tree  was  made  to  execute  on  all  the  44  test  cases.  The  time  taken  was 
averaged  out.  This  was  done  10  times  and  the  average  times  in  seconds  taken  by  the 
script to evaluate decision tree were: 
0.000929258,   0.000544337,   0.000656968,   0.004135495, 

0.000538674,   0.000537385,   0.000534681,   0.000545979, 
0.000546981,   0.007368538 
From the above 10 time values, the following insights can be seen:  
• Minimum time taken by the model was approximately 0.00053 seconds 
• Maximum time taken by the model was approximately 0.00737 seconds 
• Average time taken by the model was 0.00163 seconds 
6.3.2 Neural Network model 
As done for decision trees, the neural network model was also made to evaluate the 44 
test  cases.  The  average  time  taken  was  noted.  This  was  done  10  times.  The  average 
times of execution taken by the neural network model were: 
0.658177257,   0.543146627,   0.658050104,   0.746059109, 

0.482899054,   0.536261456,   0.639314229,   0.505210876, 
0.496645451,   0.707032805 
70 
 
From the above 10 time values, the following insights can be seen:  
• Minimum time taken by the model was approximately 0.4839 seconds 
• Maximum time taken by the model was approximately 0.7461 seconds 
• Average time taken by the model was 0.5973 seconds 
6.3.3 Discussion 
It  is  clearly  seen  that  neural  network  model  is  taking  far  more  time  to  execute  as 
compared  to  decision  tree  model.  It  was  also  analyzed  that  the  chosen  decision  tree 
model runs at least 350 times faster than the chosen neural network. 
Since, the objective is to predict in read‐time, speed is a very important parameter and 
decision tree model has completely won the time performance battle. 
6.4 Results 
After testing both the models (decision tree and neural network) on prediction accuracy 
and time performance parameters, it was clearly found that decision tree proved much 
better for implementation in the current problem as compared to neural network. 
The result obtained in the tests is summarized below: 
• Accuracy (on test dataset): 
o Decision Tree: 84.0909% 
o Neural Network: 34.0909% 
• Time Performance (PHP scripts running on apache): 
o Decision Tree: 0.0016 seconds 
o Neural Network: 0.5973 seconds 
71 
 
It  should  however  be  noted  that  these  results  were  obtained  when  the  models  were 
trained on only 200 cases. The neural network model had a total of 73 nodes including 
68  hidden  nodes.  To  properly  train  the  neural  network  a  few  thousand  cases  were  at 
least required. The neural network model was built to establish the fact that it can be 
used  on  a  website  to  predict  relevant  content  for  the  user.  The  decision  tree  on  the 
other  hand  is  also  expected  to  give  better  results  when  larger  training  and  testing 
datasets are available. 
It should also be noted that there were five classes of the dependent variable (5 possible 
laptop  products)  and  hence  the  model  would  have  been  considered  void  only  if  the 
accuracy is close to 20% (20% being the equally likely chance of each model). Since the 
accuracy  obtained  for  both  the  machine  learning  models  was  far  above  the  20% 
benchmark,  both  models  have  shown  some  promise  that  they  do  have  potential  to 
recommend relevant content for a user based on his mouse movement behaviors. 
The next chapter will give a brief conclusion of the work done. 
72 
Conclusion 
 
7
CONCLUSION 
This  chapter  gives  the  conclusion  of  the  project  and  discusses  the  scope  of 
future  work  possible.  It  also  talks  about  some  other  implementations 
possible of the explained methodology 
It  has  been  successfully  demonstrated  that  by  building  a  machine‐learning  model  on 
users mouse movement data, appropriate content for him can be predicted. The dummy 
shopping  website  developed,  embedded  with  a  decision  tree  machine  learning  model 
gave a remarkable accuracy of 84.09% on the test data. The accuracy was measured as 
the  ratio  of  the  correct  predictions  to  the  total  number  of  predictions  done  by  the 
model.  It  was  also  found  that  implementing  a  decision  tree  model  in  a  website  would 
not affect the performance of the page as the average time taken by the dummy model 
was  found  to  be  around  1.6  milliseconds.  A  Neural  network  model  was  similarly 
evaluated  and  it  gave  an  accuracy  of  34.09%  and  took  an  average  time  of  577.3 
milliseconds to process a single case of data. 
The  objective  of  the  project  was  to  use  the  mouse  movement  behavior  of  a  user  and 
predict the appropriate content for him intelligently and in real time. This objective was 
successfully achieved and several other sub‐objectives were also reached while working 
on the project. 
User’s mouse tracking was implemented successfully using a completely new algorithm. 
This was done using PHP, AJAX, HTML and MySQL. The performance of the website after 
73 
Conclusion 
 
implementing  mouse  tracking  was  not  compromised  and  the  accuracy  of  the  mouse 
tracking data collected was found to be very high. A webpage was developed imitating a 
shopping portal and some highlighting techniques were applied to it to make sure that 
the user’s mouse pointer is close to his point of gaze. 
The initial website developed in PHP was live for around two weeks and it collected 200 
cases of training data. The data was then used to train two separate machine‐learning 
models,  namely  a  Decision  Tree  model  and  a  Neural  Network  model.  Both  the  models 
gave promising results when tested on the training data, which proved that the models 
built have learned the mouse movement behavior appropriately. 
Both  the  machine  learning  models  were  coded  back  into  the  website  using  PHP  and 
AJAX.  The  website  was  made  to  collect  mouse  movement  data  which  was  dynamically 
read by the models and an output was generated. This predicted output was sent to the 
webpage  for  further  personalization.  A  total  of  44  test  cases  were  also  collected  from 
the final website. 
Using  the  collected  44  test  cases,  both  models  were  evaluated  and  the  decision  tree 
model was found to perform extremely well as compared to the neural network model, 
both from the point of view of accuracy and time performance. Decision tree classified 
2.5 times accurately in a set of 44 cases, and was 350 times faster than neural network 
model.  This  however  cannot  be  generalized  as  it  depends  on  the  size  of  the  initial 
training  dataset,  (which  was  small  in  the  current  scope  of  the  project)  and  on  the 
number of independent variables (which was large in the current implementation). 
The  working  demonstration  of  the  project,  along  with  its  documentation  and  the  GNU 
General  Public  License  source  code  is  available  online  at 
http://sparshgupta.name/MSc/Project 
74 
Conclusion 
 
7.1 Future Work 
The  proposed  idea  has  shows  a  huge  potential  and  there  is  a  lot  of  scope  for  future 
innovations  and  improvements  if  properly  explored.  The  lack  of  data  was  the  prime 
limitation in the current study. If a commercial website is required to be intelligent then 
models built on several thousands of cases of training data should be used and once that 
data is obtained, possibilities of other machine learning algorithms could be explored. 
The data collected in the testing phase can later be used to train the models. There is a 
never‐ending  chain  of  model  training  and  improvement  involved  in  the  current 
proposed  concept  and  implementation.  This  is  because  with  time,  the  website  will 
accumulate  a  lot  of  data  that  at  regular  intervals  can  be  used  to  further  train  the 
implemented  model  or  to  make  a  new  model.  It  is  expected  that  with  every 
improvement in the model, its capability to predict the relevant content for a new user 
will increase. 
The proposed implementation requires that each section of the website calls the mouse 
tracking  function  whenever  mouse  enters  the  section  and  leaves  it.  This  requires 
explicit  coding  of  function  call  statements  in  every  cell.  This  might  not  be  possible  in 
highly  dynamic  websites  and  hence  work  could  be  done  on  implementing  the  idea  on 
any given website, requiring almost no change in the existing web coding. 
In the current project, the information about the predicted content (i.e., the laptop user 
is  most  likely  to  buy)  was  not  exploited.  Work  can  be  done  to  make  the  website 
interacting  with  the  user  like  a  salesman.  The  website  can  remove  all  the  products 
which the user would be least interested in and can only show him products he is most 
likely to buy. 
75 
Conclusion 
 
Current implementations involved using only a single machine‐learning model at a time. 
Multiple  models  can  be  implemented  in  the  webpage  and  the  strength  of  prediction 
made  can  also  be  used  to  further  interact  with  the  user.  Incase  all  the  different 
implemented  models  gave  the  same  prediction  than  it  can  be  assumed  to  be  a  strong 
prediction and hence the webpage can adapt accordingly immediately. 
Other implementations possible 
A Shopping portal with intelligent prediction of the product a user is most likely to buy 
is  one  of  the  many  implementations  possible  of  the  proposed  concept.  Some  other 
possible implementations could be: 
• A Search Engine Feedback System: Current search engines display the results in 
a form of list of links along with a small text relevant to the search. Most of the 
users choose the links after reading the text snippet associated with the link and 
they  spend  different  times  on  different  links.  Current  search  feedback  is 
completely based on mouse click that in a sense is a binary feedback (either Yes 
or  No).  The  feedback  system  can  be  made  more  accurate  by  determining  the 
relative time a user spent on a link compared to other links. 
• News  Content  Prediction:  An  online  news  website  shows  several  news  under 
different heads on a page. Many common users have different priorities for news. 
Based on a user’s mouse movement activity, relevant news content can be shown 
to him. For example, if a user is spending more time around football and cricket 
news headlines than Political headlines, then it can be predicted that he is more 
interested in sport news and, accordingly, the website can be molded for him. 
76 
 
 
 
 
77 
<<Bibliography 
 
BIBLIOGRAPHY 
Aaltonen, Antti, Aulikki Hyrskykari, and Kari-Jo Räihä. "101 spots, or how do users
read menus?" Conference on Human Factors in Computing Systems, 1998: 132 -
139.
Arroya, Ernesto, Ted Selker, and Willy Wei. "Usability tool for analysis of web
designs using mouse tracks." Conference on Human Factors in Computing Systems,
2006: 484 - 489.
Atterer, Richard, and Albrecht Schmidt. "Tracking the interaction of users with AJAX
applications for usability testing." Conference on Human Factors in Computing
Systems, 2007: 1347 - 1350.
Atterer, Richard, Monica Wnuk, and Albrecht Schmidt. "Knowing the User’s Every
Move – User Activity Tracking for Website Usability Evaluation and Implicit
Interaction." ACM.
Balabanovic, Marko, Yoav Shoham, and Yeogirl Yun. "An Adaptive Agent for
Automated Web Browsing." 1997.
Byrne, Michael D, John R Anderson, Scott Douglass, and Michael Matessa. "Eye
tracking the visual search of click-down menus." Conference on Human Factors in
Computing Systems, 1999.
CERN. Welcome to info.cern.ch/. http://info.cern.ch/.
78 
<<Bibliography 
 
Chen, Mon Chu, John R Anderson, and Myeong Ho Sohn. "What can a mouse
cursor tell us more?: correlation of eye/mouse movements on web browsing."
Conference on Human Factors in Computing Systems, 2001.
Dutta, Partha, Sandip Debnath, and Sandip Sen. "A shopper's assistant."
International Conference on Autonomous Agents, 2001.
Edmonds, A, R White, D Morris, and S Drucker. "Instrumenting the Dynamic Web."

Journal of Web Engineering 6, no. 3 (2007): 243-260.
Edmonds, Andy. "Why the Mouse Doesn't Always Keep Up with the Eye." 2008.
Guo, Qi, and Eugene Agichtein. "Exploring mouse movements for inferring query
intent." Annual ACM Conference on Research and Development in Information
Retrieval, 2008: 1.
Gurney, Kevin N. An introduction to neural networks. illustrated. CRC Press, 1997.
Haykin, Simon. Neural Networks: A comprehensive Foundations. Prentice Hall.
Jayaputera, G. T., S. W. Loke, and A. Zaslavsky. "Design, implementation and run-

time evolution of a mission-based multiagent system." Web Intelligence and Agent
Systems 5, no. 2 (2007): 20.
Kohn, Nicholas, and Takashi Yamauchi. "Feature Inference: Tracking Mouse

Movement."
Linden, Greg. "Geeking with Greg Exploring the future of personalized information."
79 
<<Bibliography 
 
Mitchell, Tom. Decision Tree Learning, Machine Learning. The McGraw-Hill

Companies, Inc., 1997.
Mueller, Florian, and Andrea Lockerd. "Cheese: tracking mouse movement activity
on websites, a tool for user modeling." Conference on Human Factors in Computing
Systems, 2001.
Pazzani, Michael, and Daniel Billsus. "Learning and Revising User Profiles: The
Identification of Interesting Web Sites." Machine Learning 27, no. 3 (1997): 313 -
331.
Perkowitz, M, and O Etzioni. "Towards adaptive web sites: Conceptual framework

and case study." Artificial Intelligence 118, no. 1 (2000): 245 - 275.
Quinlan, J. R. "Improved Use of Continuous Attributes in C4.5." Journal of Artificial

Intelligence Research 4 (1996): 77-90.
Rodden, Kerry, Xin Fu, Anne Aula, and Ian Spiro. "Eye-Mouse Coordination Patterns
on Web Search Results Pages." Conference on Human Factors in Computing
Systems, 2008: 5.
Salzberg, Steven L. "C4.5: Programs for Machine Learning." Machine Learning 16,
no. 3 (1994): 235-240.
Schafer, J. Ben, Joseph Konstan, and John Riedi. "Recommender systems in e-

commerce." Electronic Commerce, 1999.
The University of Waikato. Weka 3: Data Mining Software in Java.

http://www.cs.waikato.ac.nz/ml/weka/.
80 
>> Appendix: Source Code 
 
Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."
http://smt.speedzinemedia.com/smt/docs/smt_IADIS07.pdf.
Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."
Usmani, Zeeshan-ul-hassan, Fawzi A. Alghamdi, and Talal Naveed Puri. "Intelligent

Web Interactions - What, When and How?" Web Intelligence & Intelligent Agent,
2008: 3.
W3Schools. Ajax. http://www.w3schools.com/Ajax/.
Wikipedia. C4.5 Algorithm. http://en.wikipedia.org/wiki/C4.5_algorithm.
—. Machine Learning Wikipedia. http://en.wikipedia.org/wiki/Machine_learning.
—. Multilayer Perceptron. http://en.wikipedia.org/wiki/Multilayer_perceptron.
Winston, P. Learning by building identification trees. Addison-Wesley Publishing

Company, 1992.
Witten, Ian H, and Eibe Frank. Data Mining: Practical machine learning tools and
techniques. San Francisco: Morgan Kaufmann, 2005.
81 
 
APPENDIX: SOURCE CODE 
HTML final webpage 
The  HTML  code  of  the  final  website  developed  capable  of  tracking  user’s  mouse 
movements  as  well  as  capable  of  predicting  the  relevant  product  to  the  user  is  as 
follows: 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>MSc Project - Compare Laptops</title>
<link rel="stylesheet" type="text/css" href="mouseover.css"/>
<script type="text/javascript" src="mouseover.js" ></script>
</head>
<body onload="start_It();">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><p class="oce-first"><span class="bold">NOTE:</span> Surf on this page like you do on a
shopping portal comparison page and decide on a model based on its configuration and buy it.
Thanks</p></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"
onMouseOver="hiliteColumn(event);" onMouseOut="resetColumn(event);" class="one-column-emphasis">
<colgroup class="oce-first" id="na"></colgroup>
<colgroup id="cg2" class=""></colgroup>
<thead>
<tr>
<th onmouseout="movement_out('a0');" onmouseover="movement_in();">Product Name</th>
<th onmouseout="movement_out('a1');" onmouseover="movement_in();">Lenovo IdeaPad Y650
4185</th>
<th onmouseout="movement_out('a2');" onmouseover="movement_in();">HP Pavilion dv7-
1285dx</th>
<th onmouseout="movement_out('a3');" onmouseover="movement_in();">Sony VAIO VGN-P588E</th>
<th onmouseout="movement_out('a4');" onmouseover="movement_in();">Dell Studio XPS 16</th>
<th onmouseout="movement_out('a5');" onmouseover="movement_in();">Toshiba Satellite A205-
S4617</th>
</tr>
</thead>
<tbody>
<tr>
<td class="oce-first" onmouseout="movement_out('b0');"
onmouseover="movement_in();"> </td>
<td onmouseout="movement_out('b1');" onmouseover="movement_in();"><img src="images/1.gif"
width="120" height="90" border="0" /></td>
82 
 

</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('c0');"
onmouseover="movement_in();">Price</td>
<td onmouseout="movement_out('c1');" onmouseover="movement_in();">$1,249.00</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('d0');" onmouseover="movement_in();">CNET
editors' rating</td>
<td onmouseout="movement_out('d1');" onmouseover="movement_in();">3.5/5.0</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('e0');" onmouseover="movement_in();">Average
user rating</td>
<td onmouseout="movement_out('e1');" onmouseover="movement_in();">No Data</td>
<td onmouseout="movement_out('e2');" onmouseover="movement_in();">4.0/5.0</td>
<td onmouseout="movement_out('e4');" onmouseover="movement_in();">No Data</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('f0');" onmouseover="movement_in();">Release
date</td>
<td onmouseout="movement_out('f1');" onmouseover="movement_in();">April 15, 2009</td>
<td onmouseout="movement_out('f2');" onmouseover="movement_in();">February 01, 2009</td>
<td onmouseout="movement_out('f3');" onmouseover="movement_in();">January 08, 2009</td>
<td onmouseout="movement_out('f4');" onmouseover="movement_in();">January 07, 2009</td>
<td onmouseout="movement_out('f5');" onmouseover="movement_in();">April 16, 2007</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('g0');" onmouseover="movement_in();">The
Bottom Line</td>
<td onmouseout="movement_out('g1');" onmouseover="movement_in();">Online media consumers
who want a portable laptop with high style and plenty of screen real estate should give the Y650 a
look.</td>
<td onmouseout="movement_out('g2');" onmouseover="movement_in();">HP's Pavilion dv7-1245dx
is a slick multimedia machine with great battery life, but for $1,200, we want a full 1080p
display.</td> <td onmouseout="movement_out('g3');" onmouseover="movement_in();">Sony's upscale
Atom-powered Lifestyle PC has the components of a cheaper machine but the design of a more
expensive one. The end result will be a useful travel PC for some and a conversation piece for
others.</td>
<td onmouseout="movement_out('g4');" onmouseover="movement_in();">Dell's new 16:9 Studio
XPS 16 adds upscale extras such as a leather trim and a backlit keyboard to a fairly standard set
of components, without jacking up the price (too much).</td>
<td onmouseout="movement_out('g5');" onmouseover="movement_in();">Toshiba adds faster Draft
N Wi-Fi to this attractive if otherwise fairly conventional laptop. Just be sure you've got an
802.11n router to go along with it.</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('h0');" onmouseover="movement_in();">Similar
Products</td>
<td onmouseout="movement_out('h1');" onmouseover="movement_in();"> </td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('i0');"
onmouseover="movement_in();">Networking</td>
<td onmouseout="movement_out('i1');" onmouseover="movement_in();">Network adapter -
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />
83 
 
- IEEE 802.11g<br />

- Fast Ethernet<br />
- Gigabit Ethernet<br />
- Bluetooth 2.1 EDR<br />
- IEEE 802.11n (draft)</td>
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- IEEE 802.11n (draft)
</td>
Ethernet<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- Bluetooth 2.1 EDR<br />
- IEEE 802.11n (draft)
</td>
<td onmouseout="movement_out('i4');" onmouseover="movement_in();">Network adapter - Gigabit
Ethernet</td>
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- IEEE 802.11n (draft)</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('j0');"
onmouseover="movement_in();">Graphics Controller</td>
<td onmouseout="movement_out('j1');" onmouseover="movement_in();">NVIDIA GeForce G105M -
256 MB</td>
<td onmouseout="movement_out('j2');" onmouseover="movement_in();">NVIDIA GeForce 9600M GT -
512 MB</td>
<td onmouseout="movement_out('j3');" onmouseover="movement_in();">Intel GMA 500</td>
<td onmouseout="movement_out('j4');" onmouseover="movement_in();">ATI Mobility RADEON? HD
3670 - 512MB - 512 MB</td>
<td onmouseout="movement_out('j5');" onmouseover="movement_in();">Intel GMA 950 - 8 MB</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('k0');"
onmouseover="movement_in();">Notebook Camera</td>
<td onmouseout="movement_out('k1');" onmouseover="movement_in();">Integrated - 1.3
Megapixel</td>
<td onmouseout="movement_out('k2');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('k3');" onmouseover="movement_in();">Integrated</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('l0');" onmouseover="movement_in();">Optical
Storage</td>
<td onmouseout="movement_out('l1');" onmouseover="movement_in();">DVD-Writer -
Integrated</td>
<td onmouseout="movement_out('l2');" onmouseover="movement_in();">DVD?RW (?R DL) / DVD-RAM
with LightScribe Technology</td>
<td onmouseout="movement_out('l3');" onmouseover="movement_in();">None</td>
<td onmouseout="movement_out('l4');" onmouseover="movement_in();">8X DVD+/- RW(DVD/CD
read/write) Slot Load Drive</td>
<td onmouseout="movement_out('l5');" onmouseover="movement_in();">DVD?RW (?R DL) / DVD-RAM
- Integrated</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('m0');"
onmouseover="movement_in();">RAM</td>
<td onmouseout="movement_out('m1');" onmouseover="movement_in();">4 GB (installed) / 8 GB
(max) - DDR3 SDRAM - 1066 MHz - PC3-8500 ( 2 x 2 GB )</td>
84 
 

(max) - DDR2 SDRAM</td>
(max) - DDR2 SDRAM - 533 MHz ( 1 x 2 GB )</td>
<td onmouseout="movement_out('m4');" onmouseover="movement_in();">4 GB DDR3 SDRAM</td>
(max) - DDR2 SDRAM - 667 MHz - PC2-5300</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('n0');" onmouseover="movement_in();">Cache
Memory</td>
<td onmouseout="movement_out('n1');" onmouseover="movement_in();">3 MB - L2 cache</td>
<td onmouseout="movement_out('n3');" onmouseover="movement_in();">512 KB - L2 cache</td>
<td onmouseout="movement_out('n4');" onmouseover="movement_in();">Info unavailable</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('o0');"
onmouseover="movement_in();">Processor</td>
<td onmouseout="movement_out('o1');" onmouseover="movement_in();">Intel Core 2 Duo P8700 /
2.53 GHz ( Dual-Core )</td>
<td onmouseout="movement_out('o3');" onmouseover="movement_in();">Intel 1.33 GHz</td>
2.53 GHz</td>
<td onmouseout="movement_out('o5');" onmouseover="movement_in();">Intel Core 2 Duo T5500 /
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('p0');" onmouseover="movement_in();">Hard
Drive</td>
<td onmouseout="movement_out('p1');" onmouseover="movement_in();">320 GB - Serial ATA-300 -
5400 rpm</td>
5400 rpm</td>
<td onmouseout="movement_out('p3');" onmouseover="movement_in();">64 GB - Serial ATA-
150</td>
<td onmouseout="movement_out('p4');" onmouseover="movement_in();">500 GB - 5400 rpm</td>
4200 rpm</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('q0');"
onmouseover="movement_in();">Display</td>
<td onmouseout="movement_out('q1');" onmouseover="movement_in();">16 in TFT active matrix
1366 x 768 ( WXGA ) - VibrantView</td>
1440 x 900 ( WXGA+ ) - BrightView</td>
1600 x 768</td>
<td onmouseout="movement_out('q4');" onmouseover="movement_in();">16.0</td>
<td onmouseout="movement_out('q5');" onmouseover="movement_in();">15.4 in TFT active matrix
1280 x 800 ( WXGA ) - 24-bit (16.7 million colors)</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('r0');"
onmouseover="movement_in();">Battery</td>
<td onmouseout="movement_out('r1');" onmouseover="movement_in();">Lithium ion</td>
<td onmouseout="movement_out('r4');" onmouseover="movement_in();">Info unavailable</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('s0');"
onmouseover="movement_in();">Dimensions (WxDxH)</td>
<td onmouseout="movement_out('s1');" onmouseover="movement_in();">15.4 in x 10.2 in x 1
in</td>
<td onmouseout="movement_out('s2');" onmouseover="movement_in();">15.6 in x 11.2 in x 1.7
in</td>
in</td>
85 
 
<td onmouseout="movement_out('s4');" onmouseover="movement_in();">Info unavailable</td>

in</td>
</tr><tr>
<td class="oce-first" onmouseout="movement_out('t0');"
onmouseover="movement_in();">Weight</td>
<td onmouseout="movement_out('t1');" onmouseover="movement_in();">5.5 lbs</td>
<td onmouseout="movement_out('t4');" onmouseover="movement_in();">Info unavailable</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('u0');" onmouseover="movement_in();">OS
Provided</td>
<td onmouseout="movement_out('u1');" onmouseover="movement_in();">Microsoft Windows Vista
Home Premium 64-bit Edition</td>
Home Premium</td>
Home Premium Edition</td>
<td onmouseout="movement_out('u4');" onmouseover="movement_in();">Microsoft Windows
Vista</td>
Home Premium</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('v0');"
onmouseover="movement_in();">Attribute X</td>
<td onmouseout="movement_out('v1');" onmouseover="movement_in();">x1</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('w0');"
onmouseover="movement_in();">Attribute Y</td>
<td onmouseout="movement_out('w1');" onmouseover="movement_in();">y1</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('x0');"
onmouseover="movement_in();">Attribute Z</td>
<td onmouseout="movement_out('x1');" onmouseover="movement_in();">z1</td>
</tr> </tbody>
<tfoot>
<tr>
<td class="oce-first" onmouseout="movement_out('z0');"
onmouseover="movement_in();"> </td>
<td onmouseout="movement_out('z1');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('1');" value="Buy Now" /></td>
</tr></tfoot>
</table></td>
</tr>
</table>
</body>
</html>
86 
 
The JavaScript file 
var http = getHTTPObject();

var cellEntryDate;
var cellExitDate;
var time;
var queue1="";
var queue2="";
var flag=0;
var done = 0;
var tempQueue="";
var userId=new Date();
userId=userId.getTime();
var startpredict=0;
var predictProduct=0;
function autoPredict()
{
}
function predict()
{
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}
function predictResponse()
{
if (http.readyState == 4)
{
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg"+colName).className="oce-predict";
alert("Product : "+predictProduct);
}
}
function handleHttpResponse()
{
{
startIt();
}
}
function handleHttpResponseBought()
{
{
alert("Thanks for Participating");
}
}
function start_It() {
if(done==0)
{
setTimeout("sendData()",2000);
}
if(startpredict==0)
{
++startpredict;
autoPredict();
}
87 
 
function sendData() {
if(flag==0)
{
queue2="";
flag=1;
var query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else
{
queue1="";
flag=0;
var query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}
function movement_in(){
cellEntryDate = new Date();
}
function movement_out(cell){
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0)
{
if(flag==0)
{
}
else
{
}
}
}
function bought(product){
done=1;
var query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}
function getHTTPObject() {
var xmlhttp;
/*@cc_on
@if (@_jscript_version >= 5)
try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}
@else
xmlhttp = false;
@end @*/
if (!xmlhttp && typeof XMLHttpRequest != 'undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp = false;
}
}
return xmlhttp;
88 
 
function hiliteColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="over";
}
function resetColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="";
}
89 
 
The CSS file 
body {
margin-left: 0px;
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
text-align: left;
}
colgroup.over {
background: #ebeeff;
}
.oce-first
{
background: #d0dafd;
border-right: 10px solid transparent;
border-left: 10px solid transparent;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}
.oce-predict
{
background: #d0dafd;
border-right: 3px solid #F00;
border-left: 3px solid #F00;
border-top: 3px solid #F00;
border-bottom: 3px solid #F00;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}
table.one-column-emphasis
{
font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
font-size: 12px;
width: 100%;
border-collapse: collapse;
color: #969;
}
table.one-column-emphasis th
{
font-size: 14px;
font-weight: bold;
padding: 12px 15px;
color: #039;
text-align:center;
}
table.one-column-emphasis td
{
padding: 10px 15px;
color: #669;
border-top: 1px solid #e8edff;
min-width:166px;
text-align:center;
}
table.one-column-emphasis tr:hover td
{
background: #ebeeff;
text-align: center;
}
table.one-column-emphasis tr:hover td:hover
90 
 
{
color: #039;
background: #94acff;
}
.bold {
font-weight: bold;
}
.italics {
font-style: italic;
}
.oce-first {
text-align: justify;
}
91 
 
The PHP scripts 
data.php 
<?php
$queue=$HTTP_GET_VARS['queue'];
$userId=$HTTP_GET_VARS['userId'];
include("connect.php");
$queueArray=explode("_",$queue);
for($i=0;$i<substr_count($queue,"_");$i++)
{
$values=explode(":",$queueArray[$i]);
mysql_query("INSERT into data
values(\"".$userId."\",\"".$values[0]."\",\"".$values[1]."\")");
}
mysql_close($conn);
?>
connect.php 
<?php
$dbhost = 'localhost:8889';
$dbuser = 'root';
$dbpass = 'root';
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ('Error connecting to mysql');
$dbname = 'MSc';
mysql_select_db($dbname);
?>
bought.php 
<?php
$product=$HTTP_GET_VARS['product'];
$userId=$HTTP_GET_VARS['userId'];
mysql_query("INSERT into bought values(\"".$userId."\",\"".$product."\")");
mysql_close($conn);
?>
alignData.php 
<?php
$result=mysql_query("SELECT * FROM `bought`");
while($row = mysql_fetch_array($result))
{
$result_1=mysql_query("SELECT * FROM `data` WHERE ùserID`=\"".$row['userId']."\" order by
`cellID`");
$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
$previous_column=$row_1['cellID'];
$previous_value=$row_1['time'];
while($row_1 = mysql_fetch_array($result_1))
{
if($previous_column==$row_1['cellID']) {
$previous_value+=$row_1['time'];
}
92 
 
else {
$columnNames=$columnNames.",".$previous_column;
$values=$values.",\"".$previous_value."\"";
}
}
$columnNames=$columnNames.",".$previous_column.",product";
$values=$values.",\"".$previous_value."\",\"".$row['product']."\"";
mysql_query("INSERT INTO finalData(userID".$columnNames.") values
(\"".$row['userId']."\"".$values.")");
mysql_query("DELETE from `bought` where ùserId` = \"".$row['userId']."\"");
mysql_query("DELETE from `data` where ùserID` = \"".$row['userId']."\"");
}
mysql_close($conn);
?>
predict.php 
<?php
$totalTime=0;
$result_1=mysql_query("SELECT * FROM `data` WHERE ùserID`=\"".$_GET['userId']."\" order by
cellID");
$result_2=mysql_query("SELECT * FROM `data` WHERE ùserID`=\"".$_GET['userId']."\" order by

cellID");
$totalTime+=$row_2['time'];
$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
{
if($previous_column==$row_1['cellID'])
{
$previous_value+=$row_1['time'];
}
else
{
$$previous_column=$previous_value/$totalTime;
}
}
$$previous_column=$previous_value/$totalTime;
decisionTree();
//neuralNetwork();
function decisionTree()
{
$model_DT=0;
if($b5 <= 0.04509)
if($k4 <= 0.013828)
if($v1 <= 0.000362)
if($r0 <= 0.000626)
if($d5 <= 0.003481)
if($d5 <= 0.001586)
if($g4 <= 0.033267)
if($s3 <= 0.004874)
if($u1 <= 0.002108)
if($f1 <= 0.039667)
if($f4 <= 0.028894)
if($i4 <= 0.004699)
if($d2 <= 0.001173)
if($e5 <= 0.001377)
93 
 
if($e1 <= 0.029566)

if($r3 <= 0.000861)
if($c1 <= 0.043665)
if($a3 <= 0.206815)
if($b1 <= 0.007319)
if($f3 <= 0.001471)
if($b4 <= 0.00214)
$model_DT=2;
else
if($a4 <=
0.004126) $model_DT=3;
else $model_DT=2;
else $model_DT=3;
else
if($b3 <= 0.123969)
$model_DT=2;
else $model_DT=1;
else $model_DT=1;
else $model_DT=1;
else $model_DT=3;
else $model_DT=1;
else $model_DT=3;
else
if($s4 <= 0.002873) $model_DT=2;
else $model_DT=4;
else $model_DT=1;
else $model_DT=4;
else $model_DT=3;
else $model_DT=3;
else
if($q1 <= 0.004708)
if($r4 <= 0.007391) $model_DT=3;
else $model_DT=2;
else $model_DT=2;
else
if($g5 <= 0.004141)
if($k4 <= 0.001354) $model_DT=4;
else $model_DT=3;
else $model_DT=2;
else $model_DT=4;
else
if($g5 <= 0.004141)
if($b5 <= 0.002996)
if($g4 <= 0.003922) $model_DT=2;
else $model_DT=1;
else $model_DT=3;
else $model_DT=5;
else $model_DT=4;
else
if($s4 <= 0.005561)
if($t4 <= 0.002371)
if($e0 <= 0.001979)
if($h2 <= 0.005305) $model_DT=1;
else $model_DT=2;
else $model_DT=2;
else $model_DT=2;
else $model_DT=2;
else
if($f5 <= 0.001805) $model_DT=4;
else $model_DT=2;
else
if($t3 <= 0.000515)
if($d4 <= 0.008991)
if($e2 <= 0.011901)
if($a1 <= 0.001341)
if($g2 <= 0.001762) $model_DT=4;
else $model_DT=5;
else $model_DT=5;
else $model_DT=4;
else $model_DT=2;
else $model_DT=3;
echo $model_DT;
}
94 
 
function neuralNetwork()
{
$Node5=(-0.0209449762256399)+($a0*0.0120761574490061)+($a1*-0.0174298014185729)+($a2*-
0.0175622955697642)+($a3*-0.000798046164731245)+($a4*-0.00566210278243689)+($a5*-
0.00257021437573848)+($b0*0.0813554156049207)+($b1*-
0.0383601651270091)+($b2*0.0315342748963075)+($b3*0.04750940128612)+($b4*0.00444930879229902)+($b5*
0.0447743155601993)+($c0*0.0127846301489485)+($c1*0.0167829106398277)+($c2*0.0412283962113621)+($c3
*0.0647197008365273)+($c4*0.026137495413712)+($c5*0.0292672102649498)+($d0*0.0575247995032596)+($d1
*-0.0248903478567491)+($d2*-0.0356248056960633)+($d3*0.0131503378763436)+($d4*-
0.00943722882163672)+($d5*0.0254130310753136)+($e0*0.0953293388209953)+($e1*-
0.0358630730881965)+($e2*0.09184645890614)+($e3*0.0879998946588433)+($e4*-
0.0210989430518799)+($e5*0.0236328879965554)+($f0*0.0521255666178908)+($f1*0.0562279524027289)+($f2
*0.0420766593208718)+($f3*0.0219358641315261)+($f4*0.0500915161629286)+($f5*0.0598788090622592)+($g
0*-0.0106339935340819)+($g1*0.0158371741591566)+($g2*0.0828753056435395)+($g3*-
0.0152508552198513)+($g4*-
0.00815101349601804)+($g5*0.0268439313590316)+($h0*0.070123678107641)+($h1*-
0.0147305324346031)+($h2*0.0517135568746786)+($h3*-0.0117294349734072)+($h4*-
0.00594235655570873)+($h5*0.0410639065208286)+($i0*-0.00105630930040345)+($i1*-
0.00543787837624847)+($i2*0.0603755263497366)+($i3*0.0287693595250936)+($i4*0.0554227984526808)+($i
5*0.0600355834517169)+($j0*0.0186135251521197)+($j1*0.00984875030922667)+($j2*0.0193290574626347)+(
$j3*0.021484574396215)+($j4*0.0484829773111019)+($j5*0.0233728871769681)+($k0*0.0410110073637687)+(
$k1*-
0.00743846515678319)+($k2*0.0446579060767132)+($k3*0.00789530586935209)+($k4*0.0185589336156669)+($
k5*0.0178833473514336)+($l0*0.0366297156412459)+($l1*0.0297884220860898)+($l2*0.0450253751867714)+(
$l3*0.0705159823038729)+($l4*0.074643360814636)+($l5*0.049178643898654)+($m0*0.00649293306157912)+(
$m1*0.0235761949995652)+($m2*0.0282972581223614)+($m3*0.00995247757969736)+($m4*0.0635360916248171)
+($m5*-0.0185514952082912)+($n0*0.0798799834823821)+($n1*-
0.0367274799798666)+($n2*0.0461992904934746)+($n3*0.0354383668658634)+($n4*-
0.00123240277220675)+($n5*-0.0150807856098709)+($o0*-
0.0260784636646052)+($o1*0.0553028912171675)+($o2*0.0802089447351997)+($o3*-
0.0235601224487924)+($o4*-0.0281363990127924)+($o5*0.0319917291420718)+($p0*-
0.0257109331590629)+($p1*-0.0279769700636828)+($p2*0.0433907293866429)+($p3*-
0.0310545628159805)+($p4*0.0348153094694314)+($p5*-0.00776438719161176)+($q0*-
0.0069736497593223)+($q1*0.0161811177301145)+($q2*0.0576906924312276)+($q3*0.0441712928131897)+($q4
*0.0165528172670987)+($q5*-0.0274805831321372)+($r0*0.0120430047036489)+($r1*-
0.000892653621313331)+($r2*0.0868045378672117)+($r3*0.0281943074796785)+($r4*0.0670839346752799)+($
r5*0.0110772507057164)+($s0*0.0214207237015366)+($s1*-
0.032511653106313)+($s2*0.0328856849361516)+($s3*0.0313926662260086)+($s4*0.0111177031525771)+($s5*
0.0284289901014687)+($t0*0.0428425565992686)+($t1*0.0534413420371503)+($t2*0.0244766875457709)+($t3
*0.0647078085232812)+($t4*0.0112235270733354)+($t5*0.0097765520400492)+($u0*0.0259846759422365)+($u
1*-
0.0430507927467189)+($u2*0.107107831659775)+($u3*0.0467301403971514)+($u4*0.0571975966844622)+($u5*
-0.0079845822250066)+($v0*0.0303173561775128)+($v1*-
0.0043169837441232)+($v2*0.0866140345320475)+($v3*0.00261036151061667)+($v4*0.00523185366643474)+($
v5*-0.0239702999191261);
// Similar codes for the rest 72 nodes have been omitted. This was done because it was a 40 page
long code. The complete code is available online for reference
$max=max((1/(1+(1/pow(2.718282,$Node0)))),(1/(1+(1/pow(2.718282,$Node1)))),(1/(1+(1/pow(2.718282,$N
ode2)))),(1/(1+(1/pow(2.718282,$Node3)))),(1/(1+(1/pow(2.718282,$Node4)))));
if($max==(1/(1+(1/pow(2.718282,$Node0))))) echo "1";

else if($max==(1/(1+(1/pow(2.718282,$Node1))))) echo "2";
}
mysql_close($conn);
?> 
95 

Web Content Recommendation Using Machine Learning On User Mouse Tracking Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Content Recommendation Using Machine Learning On User Mouse Tracking Data

Uploaded by

Copyright:

Available Formats

Web Content Recommendation

1.1.2 The computer mouse device ....................................................................2

1.1.3 Eye tracking ...............................................................................................2

1.1.4 WWW and the missing gap .....................................................................3

1.1.5 Tracking mouse pointer to track user’s eyes ...........................................3

1.4 Structure of the dissertation ................................................. 7

2.2 Capturing mouse movements ............................................. 10

2.3 Tracking mouse movement to determine users behaviour 11

3.1.2 Implementation ...................................................................................... 16

3.1.2.1 Webpage Design ............................................................................... 17

3.1.3 Testing the initial website ......................................................................25

3.2 Data collection ..................................................................... 26

3.3 Data compilation and cleaning ........................................... 27

3.3.2.3 Data normalization ......................................................................... 32

4.1.2 Why Machine Learning? ........................................................................ 35

4.4.1.1 Details of the chosen decision tree ................................................ 40

4.4.1.2.2 Testing by Cross‐Validation (folds 10) .................................. 46

4.4.1.2.3 Discussion ............................................................................... 46

4.4.2 Neural Network .....................................................................................47

4.4.2.1 Details of the chosen neural network ........................................... 48

4.4.2.2 Testing the neural network model ................................................ 51

4.4.3 Decision Tree Vs Neural Networks ......................................................54

5.2 Specifications ....................................................................... 56

5.3 Implementation ....................................................................57

5.5 What next ............................................................................. 62

6.2.2 Model testing in WEKA using test data .............................................. 66

6.2.2.2 Neural Network model .................................................................. 68

6.2.3 Discussion.............................................................................................. 69

6.3.2 Neural Network model ..........................................................................70

6.3.3 Discussion............................................................................................... 71

6.4 Results ................................................................................... 71

Conclusion ........................................................... 73

The CSS file ....................................................................................................... 90

Figure 5: Screenshot with a cell highlighted ............................................................................ 20

Figure 8: Parameters used for building the Decision Tree model .................................. 42

Figure 9: Parameters used for building the Neural Network model ............................. 50

Figure 10: Screenshot of the prediction done by the model ............................................. 62

• Data cleaning and processing is an essential step required after data collection.

There have been more studies (Byrne, et al. 1999) and others on the relationship and

2.3 Tracking mouse movement to determine users

var userId=new Date();

<td onmouseout="movement_out('c1');" onmouseover="movement_in();"></td>

0.000929258, 0.000544337, 0.000656968, 0.004135495,

0.658177257, 0.543146627, 0.658050104, 0.746059109,

CERN. Welcome to info.cern.ch/. http://info.cern.ch/.

Edmonds, A, R White, D Morris, and S Drucker. "Instrumenting the Dynamic Web."

Gurney, Kevin N. An introduction to neural networks. illustrated. CRC Press, 1997.

Haykin, Simon. Neural Networks: A comprehensive Foundations. Prentice Hall.

Jayaputera, G. T., S. W. Loke, and A. Zaslavsky. "Design, implementation and run-

Kohn, Nicholas, and Takashi Yamauchi. "Feature Inference: Tracking Mouse

Mitchell, Tom. Decision Tree Learning, Machine Learning. The McGraw-Hill

Perkowitz, M, and O Etzioni. "Towards adaptive web sites: Conceptual framework

Quinlan, J. R. "Improved Use of Continuous Attributes in C4.5." Journal of Artificial

Schafer, J. Ben, Joseph Konstan, and John Riedi. "Recommender systems in e-

The University of Waikato. Weka 3: Data Mining Software in Java.

Usmani, Zeeshan-ul-hassan, Fawzi A. Alghamdi, and Talal Naveed Puri. "Intelligent

W3Schools. Ajax. http://www.w3schools.com/Ajax/.

Wikipedia. C4.5 Algorithm. http://en.wikipedia.org/wiki/C4.5_algorithm.

—. Machine Learning Wikipedia. http://en.wikipedia.org/wiki/Machine_learning.

—. Multilayer Perceptron. http://en.wikipedia.org/wiki/Multilayer_perceptron.

Winston, P. Learning by building identification trees. Addison-Wesley Publishing

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

Web Content Recommendation 

1.1.2  The computer mouse device ....................................................................2 

1.1.3  Eye tracking ...............................................................................................2 

1.1.4  WWW and the missing gap .....................................................................3 

1.1.5  Tracking mouse pointer to track user’s eyes ...........................................3 

1.4  Structure of the dissertation ................................................. 7 

2.2  Capturing mouse movements ............................................. 10 

2.3  Tracking mouse movement to determine users behaviour 11 

3.1.2  Implementation ...................................................................................... 16 

     3.1.2.1   Webpage Design ............................................................................... 17 

3.1.3  Testing the initial website ......................................................................25 

3.2  Data collection ..................................................................... 26 

3.3  Data compilation and cleaning ........................................... 27 

     3.3.2.3   Data normalization ......................................................................... 32 

4.1.2  Why Machine Learning? ........................................................................ 35 

     4.4.1.1   Details of the chosen decision tree ................................................ 40 

          4.4.1.2.2   Testing by Cross‐Validation (folds 10) .................................. 46 

          4.4.1.2.3   Discussion ............................................................................... 46 

4.4.2  Neural Network .....................................................................................47 

     4.4.2.1   Details of the chosen neural network ........................................... 48 

     4.4.2.2   Testing the neural network model ................................................ 51 

4.4.3  Decision Tree Vs Neural Networks ......................................................54 

5.2  Specifications ....................................................................... 56 

5.3  Implementation ....................................................................57 

5.5  What next ............................................................................. 62 

6.2.2  Model testing in WEKA using test data .............................................. 66 

     6.2.2.2   Neural Network model .................................................................. 68 

6.2.3  Discussion.............................................................................................. 69 

6.3.2  Neural Network model ..........................................................................70 

6.3.3  Discussion............................................................................................... 71 

6.4 Results ................................................................................... 71 

Conclusion ........................................................... 73 

The CSS file ....................................................................................................... 90 

Figure 5: Screenshot with a cell highlighted ............................................................................ 20 

Figure 8: Parameters used for building the Decision Tree model .................................. 42 

Figure 9: Parameters used for building the Neural Network model ............................. 50 

Figure 10: Screenshot of the prediction done by the model ............................................. 62 

• Data  cleaning and  processing  is an essential step required after data collection. 

There  have  been more  studies (Byrne,  et  al.  1999) and others  on the  relationship and 

2.3 Tracking  mouse  movement  to  determine  users 

0.000929258,   0.000544337,   0.000656968,   0.004135495, 

0.658177257,   0.543146627,   0.658050104,   0.746059109,