You are on page 1of 104

Web
Content
Recommendation


using
Machine
Learning
on

User
Mouse
Tracking
Data



Sparsh
Gupta


Pembroke
College
|
Computing
Laboratory

University
of
Oxford


Submitted
in
partial
fulfillment
of
the
requirements
for
the
degree
of


Master
of
Science
in
Computer
Science


September
2009

Abstract


ABSTRACT

The
websites
are
becoming
more
and
more
dynamic
but
not
intelligent.
Based
on

certain
 mouse
 clicks
 or
 user
 choices,
 today’s
 dynamic
 websites
 can
 mold

themselves
 but
 cannot
 predict
 relevant
 data
 intelligently.
 The
 data
 contained
 in

today’s
websites
is
growing
and
the
number
of
users
demanding
unique
different

information
 is
 also
 ever
 increasing.
 This
 has
 created
 a
 challenging
 problem
 of

delivering
the
right
content
to
every
user.


This
thesis
is
an
original
work
concentrating
on
solving
this
problem
of
generating

relevant
 content
 for
 each
 individual
 user.
 One
 of
 the
 primary
 inputs
 used
 by
 the

project
 is
 the
 mouse
 movement
 behavior
 of
 the
 user.
 If
 the
 website
 capturing

mouse
movements
is
built
in
such
a
way
that
the
mouse
pointer
is
mostly
close
to

the
 point
 of
 gaze
 of
 the
 user,
 then
 the
 mouse
 movement
 behavior
 would

theoretically
 mean
 tracking
 the
 eye
 of
 the
 user.
 Based
 on
 this
 mouse
 movement

data,
further
content
can
be
predicted
and
personalized
for
each
user
using
one
or

more
machine
learning
models.



This
thesis
proposes
a
complete
methodology
of
building
and
implementing
such
a

system.
 As
 a
 proof
 of
 concept,
 an
 online
 shopping
 website
 has
 been
 built
 and

further
 tests
 have
 been
 conducted
 which
 gave
 a
 remarkable
 accuracy
 of
 84.09%

when
compared
with
the
actual
needs
of
the
user.


The
 working
 demonstration
 of
 the
 project
 along
 with
 its
 description
 is
 available

online
at
http://sparshgupta.name/MSc/Project


Keywords:
adaptive
web,
machine
learning,
mouse
movement,
gaze
point


ii

Acknowledgement


ACKNOWLEDGEMENT

I
am
heartily
thankful
to
my
supervisor,
Dr.
Vasile
Palade,
whose
encouragement,

guidance,
 confidence
 in
 my
 idea
 and
 support
 from
 the
 initial
 to
 the
 final
 level

enabled
 me
 to
 develop
 this
 project
 and
 understand
 the
 subject.
 I
 am
 thankful
 to

Computing
Laboratory,
University
of
Oxford
for
accepting
my
proposal
and
giving

me
an
opportunity
to
work
on
this
idea.
I
gratefully
acknowledge
the
support
and

help
of
all
the
volunteers
who
helped
me
collect
the
data
for
my
work.


I
would
like
to
thank
Prof.
Luke
Ong
and
Pembroke
College
for
their
co‐operation

and
 readiness
 to
 always
 help
 me
 when
 needed.
 I
 would
 also
 like
 to
 acknowledge

the
efforts
and
facilities
provided
by
the
staff
of
the
Computing
Laboratory
Library,

Radcliff
Science
Library
and
Pembroke
College
Library.


Lastly,
 I
 offer
 my
 regard
 to
 my
 parents,
 my
 sister
 and
 friends
 who
 always

supported
me
in
all
respects
during
the
completion
of
this
project.


Sparsh
Gupta


iii

Table
of
Contents


TABLE
OF
CONTENTS


Abstract........................................................................... ii


Acknowledgement.......................................................... iii


Table
of
Contents........................................................... iv


Table
of
Figures.............................................................. ix


Introduction........................................................... 1

1.1
 A
Primer ..................................................................................1

1.1.1
 The
World
Wide
Web ............................................................................... 1


1.1.2
 The
computer
mouse
device ....................................................................2


1.1.3
 Eye
tracking ...............................................................................................2


1.1.4
 WWW
and
the
missing
gap .....................................................................3


1.1.5
 Tracking
mouse
pointer
to
track
user’s
eyes ...........................................3


1.2
 Motivation.............................................................................. 4


1.3
 Objectives............................................................................... 5


1.4
 Structure
of
the
dissertation ................................................. 7


Background,
Literature
review
and
Project

overview .................................................................8

2.1
 Coordination
of
mouse
and
eye
movements ........................ 8


2.2
 Capturing
mouse
movements ............................................. 10


2.3
 Tracking
mouse
movement
to
determine
users
behaviour 11


iv

Table
of
Contents


2.4
Discussion ............................................................................. 11


2.5
 Project
overview....................................................................12


Data
Collection
and
Pre‐processing.....................15

3.1
 The
initial
website................................................................. 15

3.1.1
 Specifications ........................................................................................... 15


3.1.2
 Implementation ...................................................................................... 16







3.1.2.1


Webpage
Design ............................................................................... 17







3.1.2.2


Database
Design...............................................................................20







3.1.2.3


Implementing
mouse
tracking........................................................22







3.1.2.4


Final
product
bought
by
the
user ...................................................25


3.1.3
 Testing
the
initial
website ......................................................................25


3.2
 Data
collection ..................................................................... 26


3.3
 Data
compilation
and
cleaning ........................................... 27



3.3.1
 Need
and
Specifications .........................................................................27


3.3.2
 Implementation......................................................................................28







3.3.2.1


Data
compilation..............................................................................28







3.3.2.2


Data
cleaning ...................................................................................30







3.3.2.3


Data
normalization ......................................................................... 32


Building
machine
learning
models..................... 34

4.1
 Machine
Learning ................................................................ 34

4.1.1
 WEKA....................................................................................................... 35


4.1.2
 Why
Machine
Learning? ........................................................................ 35


4.2
Methods
evaluated................................................................35


v

Table
of
Contents


4.2.1
 Decision
Tree ..........................................................................................36


4.2.2
 Neural
Network......................................................................................36


4.3
 Implemented
algorithms......................................................37

4.3.1
 Decision
Tree
(C4.5)...............................................................................38


4.3.2
 Neural
Network
(Multilayer
Perceptron).............................................39


4.4
Model
building..................................................................... 39

4.4.1
 Decision
Tree ......................................................................................... 40







4.4.1.1


Details
of
the
chosen
decision
tree ................................................ 40







4.4.1.2


Testing
the
decision
tree.................................................................45












4.4.1.2.1


Testing
on
Training
Data.........................................................45












4.4.1.2.2


Testing
by
Cross‐Validation
(folds
10) .................................. 46












4.4.1.2.3


Discussion ............................................................................... 46


4.4.2
 Neural
Network .....................................................................................47







4.4.2.1


Details
of
the
chosen
neural
network ........................................... 48







4.4.2.2


Testing
the
neural
network
model ................................................ 51












4.4.2.2.1


Testing
on
Training
Data ........................................................52












4.4.2.2.2


Testing
by
Cross‐Validation
(folds
10)...................................52












4.4.2.2.3


Discussion................................................................................ 53


4.4.3
 Decision
Tree
Vs
Neural
Networks ......................................................54


Embedding
the
machine
learning
models
in
the

website ................................................................. 56

5.1
 What
and
Why? .................................................................... 56


5.2
 Specifications ....................................................................... 56


5.3
 Implementation ....................................................................57



5.3.1
 Implementing
the
Decision
Tree
model ...............................................59


vi

Table
of
Contents


5.3.2
 Implementing
the
Neural
Network
model...........................................59


5.4
Using
model
outputs ...........................................................60


5.5
 What
next ............................................................................. 62


Testing
and
Results..............................................64

6.1
 Testing
methodology ........................................................... 64


6.2
Testing
for
model
accuracy.................................................. 64

6.2.1
 Testing
data
collection...........................................................................65


6.2.2
 Model
testing
in
WEKA
using
test
data .............................................. 66







6.2.2.1


Decision
Tree
model........................................................................67







6.2.2.2


Neural
Network
model .................................................................. 68


6.2.3
 Discussion.............................................................................................. 69


6.3
Testing
time
performance
of
the
models............................ 69

6.3.1
 Decision
Tree
model ..............................................................................70


6.3.2
 Neural
Network
model ..........................................................................70


6.3.3
 Discussion............................................................................................... 71


6.4
Results ................................................................................... 71


Conclusion ........................................................... 73



Future
Work ...............................................................................75


Bibliography......................................................... 78


Appendix:
Source
Code........................................ 82


vii

Table
of
Contents


HTML
final
webpage .........................................................................................82


The
JavaScript
file ..............................................................................................87


The
CSS
file ....................................................................................................... 90


The
PHP
scripts .................................................................................................92







data.php.........................................................................................................92







connect.php...................................................................................................92







bought.php ....................................................................................................92







alignData.php................................................................................................92







predict.php ....................................................................................................93


viii

Table
of
Figures


TABLE
OF
FIGURES

Figure
1:
Project
outline.................................................................................................................... 14


Figure
2:
Screenshot
of
the
top
half
of
the
developed
webpage...................................... 17


Figure
3:
Screenshot
of
the
developed
webpage.................................................................... 18


Figure
4:
Code
given
to
each
section
of
the
webpage........................................................... 19


Figure
5:
Screenshot
with
a
cell
highlighted ............................................................................ 20


Figure
6:
Database
table
'data'....................................................................................................... 21


Figure
7:
Database
table
'bought'.................................................................................................. 21


Figure
8:
Parameters
used
for
building
the
Decision
Tree
model .................................. 42


Figure
9:
Parameters
used
for
building
the
Neural
Network
model ............................. 50


Figure
10:
Screenshot
of
the
prediction
done
by
the
model ............................................. 62


ix

Introduction


1
INTRODUCTION

This
chapter
includes
a
brief
overview
of
a
few
terms.
It
then
discusses
the

coordination
between
eye
and
mouse
movement
and
how
mouse
movement


data
 can
 be
 used
 as
 pseudo
 eye
 tracking
 data.
 Later,
 this
 chapter
 talks

about
 the
motivation
 behind
this
project
and
clarifies
the
 objectives
of
 the

research
and
the
structure
of
this
document.


1.1 A
Primer


This
section
of
the
chapter
will
discuss
a
brief
history
of
the
World
Wide
Web
(WWW),

the
 use
 of
 a
 computer
 mouse
 and
 the
 current
 eye
 tracking
 technology.
 It
 will
 later

explain
how
the
WWW
can
be
improved
by
using
eye
tracking
data
and
how
a
mouse

pointer
can
be
used
to
collect
pseudo
eye
tracking
data.


1.1.1 The
World
Wide
Web


In
 1990,
 CERN
 launched
 the
 world’s
 first
 website1,
 which
 was
 only
 a
 few
 lines
 of
 text

and
 hyperlinks.
 In
 its
 nineteen
 years
 of
 journey,
 today’s
 websites
 have
 completely

revolutionized.
 The
 plain
 text
 is
 now
 being
 accompanied
 with
 all
 sorts
 of
 rich
 media




























































1

CERN,
Welcome
to
info.cern.ch/,
http://info.cern.ch/.


1

Introduction


including
 images,
 music,
 videos,
 animations,
 colours
 etc.
 Dynamic
 data
 from
 ever‐
increasing
databases
is
rapidly
replacing
the
static
content
of
the
websites.
Web
servers

are
now
capable
of
more
real
time
computing.
Data
cannot
only
be
shown
to
a
user
but

can
be
collected
from
him
easily.
Recently,
the
success
of
AJAX1
has
completely
changed

the
web
experience
by
making
it
much
more
interactive
and
more
data
driven.


Today,
 Internet
 has
 changed
 everything,
 from
 how
 we
 do
 business,
 how
 we
 study,

connect
with
friends
and
in
general,
how
we
live.


1.1.2 The
computer
mouse
device


Most
of
the
people
in
the
world
uses
a
computer‐pointing
device
(generally
a
mouse)
to

navigate
through
a
website.
They
click
hyperlinks
spread
across
different
sections
of
a

webpage,
select
texts
or
scroll
through
a
long
page
using
a
computer
mouse.
Mouse
can

safely
 be
 called
 as
 a
 personal
 assistant
 while
 working
 on
 a
 computer
 and
 especially

while
browsing
a
website.


1.1.3 Eye
tracking


Eye
tracking
or
Gaze
tracking
is
a
process
of
measuring
the
gaze,
i.e.,
keeping
a
track
of

the
point
at
which
a
user
is
looking.
Most
of
the
websites
have
visual
information
in
the

form
of
texts,
images,
graphics,
etc.,
and
almost
all
the
information
a
user
attains
from
a

website
is
by
perceiving
it
though
his
eyes.


Eye
tracking
when
employed
to
a
website
can
be
imagined
as
a
method
of
determining

the
portion
on
the
screen
at
which
the
user
is
looking.
This
information
can
potentially




























































1

W3Schools,
Ajax,
http://www.w3schools.com/Ajax/.


2

Introduction


give
a
fair
idea
about
the
sections
most
relevant
to
him.
The
more
time
a
user
spends

looking
at
a
particular
section,
reading
it
or
simply
viewing
it,
the
more
interested
he
is

in
that
section
compared
to
the
others
on
the
same
page.


1.1.4 WWW
and
the
missing
gap


Websites
 have
 started
 becoming
 dynamic
 by
 accepting
 inputs
 from
 a
 user,
 which
 are

then
used
to
select
relevant
content
or
information
for
him.
The
kinds
of
input,
current

websites
 primarily
 employ
 are:
 mouse
 clicks,
 key
 presses,
 text
 entered
 or
 choices

chosen
 by
 the
 user
 in
 the
 form
 element
 of
 the
 page.
 This,
 on
 the
 contrary,
 means
 that

incase
the
user
is
not
interested
in
giving
any
data
as
input,
the
website
would
end
by


being
static
or
without
any
information
on
user
needs.


The
 eye
 tracking
 data,
 if
 captured
 for
 a
 general
 user,
 can
 be
 utilized
 vastly
 in
 making

today’s
 websites
 more
 adaptive
 and
 intelligent
 by
 harnessing
 the
 knowledge
 of
 users

interest
 and
 information
 he
 is
 most
 interested
 in.
 Without
 seeking
 any
 external
 data

from
the
user,
his
interests
and
needs
can
be
determined
based
on
his
eye
movements

and
he
can
be
served
the
data
he
is
most
interested
in.


1.1.5 Tracking
mouse
pointer
to
track
user’s
eyes


There
has
been
a
lot
of
research
in
improving
the
computing
experience
for
a
user
by

tracking
his
eye
location,
but
there
are
a
few
drawbacks
associated
with
it.
Firstly,
the

tracking
 equipment
 is
 expensive
 and
 the
 user
 needs
 to
 physically
 wear
 the
 tracking

gadget.
 Not
 everyone
 using
 Internet
 would
 want
 or
 can
 wear
 the
 tracking
 equipment

and
hence
 the
 general
 public
 websites
 cannot
be
 made
dependent
on
them.
There
 are

also
ongoing
researches
to
determine
the
movement
of
eyes
using
a
camera
device,
but


as
 of
 now
 the
 accuracy
 of
 determining
 the
 gaze
 position
 is
 low
 and
 it
 depends
 on

movements
 of
 the
 user,
 lighting
 conditions
 and,
 most
 importantly,
 the
 user
 need
 to


3

Introduction


download
 an
 external
 software.
 Because
 of
 these
 limitations
 of
 the
 eye
 tracking

methods,
there
have
been
researches
in
finding
other
alternatives.


Recently
 Googlers
 Kerry
 Rodden
 and
 Xin
 Fu
 proposed
 in
 their
 paper
 (Rodden,
 et
 al.

2008)
that
mouse
movements
show
potential
as
a
way
to
estimate
where
the
user
has

considered
 before
 deciding
 where
 to
 click.
 There
 have
 been
 other
 studies
 that
 have

provided
a
reasonable
estimate
of
coordination
of
mouse
and
eye
especially
on
a
page
in

which
 a
 click
 is
 likely
 to
 happen.
 Hence,
 tracking
 a
 user
 mouse
 movement
 can

sometimes
 be
 used
 as
 a
 pseudo
 eye
 tracking
 data.
 There
 are
 several
 interface
 design

techniques
in
Human
Computer
Interaction
with
which
a
website
can
make
sure
that,
in

most
cases,
user’s
mouse
pointer
is
close
to
his
point
of
gaze.
One
of
the
techniques
that

have
been
employed
in
the
project
is
the
mouse
over
cell
highlighting.
If
the
content
at

the
current
location
of
the
mouse
pointer
is
highlighted
to
make
it
stand
out
of
rest
of

the
 page,
 then
 this
 can
 almost
 always
 ensure
 that
 the
 mouse
 pointer
 movement
 is

synchronized
with
the
area
the
user
is
currently
reading
or
gazing
to.



1.2 Motivation


Many
 websites
 do
 not
 ask
 for
 any
 explicit
 input
 from
 the
 user
 but
 can
 still
 adapt

themselves.
 They
 primarily
 use
 either
 some
 geographical
 information
 (which
 can
 be

obtained
 from
 user’s
 IP
 address)
 or
 the
 browser/operating
 system
 specifications
 to

adapt
 the
 web
 content
 for
 the
 user.
 This
 adaptation
 is
 of
 course
 not
 targeted
 to
 an

individual
user
and
is
only
a
broad
adaptation
to
cater
a
group
of
users
having
similar

demographics
or
preferences.
The
adaptation
of
a
website
can
be
based
on
any
smallest

bit
 of
 information
 from
 the
 user.
 The
 more
 information
 the
 website
 attains
 about
 the

user,
the
better
it
is
capable
of
adapting
to
his
needs.



4

Introduction


The
 primary
 medium
 of
 interaction
 of
 a
 user
 with
 a
 website
 is
 a
 mouse
 device
 and
 it

produces
 a
 huge
 amount
 of
 data
 in
 the
 form
 of
 mouse
 movement
 behaviour.
 The

motivation
behind
this
thesis
and
project
is
the
existing
gap
between
the
demand

of
more
user
data
for
a
website
to
make
it
adaptive
and
the
availability
of
ample

data
from
the
user
in
the
form
of
his
mouse
movements.


Further,
if
a
website
is
designed
in
such
a
way
that
most
often
or
not
the
user’s
mouse

pointer
movement
is
synchronized
with
his
point
of
gaze,
as
discussed
in
Section
1.1.5,

then
the
data
can
also
be
roughly
called
as
eye
tracking
data.


1.3 Objectives


The
objective
of
this
project
is
to
effectively
utilize
the
mouse
movement
data
of
a
user

in
 making
 the
 web
 content
 more
 adaptive
 for
 him,
 by
 dynamically
 predicting
 further

relevant
content
for
him.


In
order
to
achieve
the
above
main
objective,
the
following
sub‐objectives
needs
to
be

catered:


• Collecting
the
initial
training
dataset
of
mouse
movement
behavior
from
a
large

set
 of
 users
 in
 order
 to
 train
 and
 build
 a
 model.
 This
 will
 involve
 developing
 a

website
with
well‐defined
area
or
sections
or
elements
where
mouse
movements

could
 be
 tracked.
 The
 website
 needs
 to
 be
 such
 that
 users
 mouse
 pointer

synchronize
with
his
point
of
gaze.


• Asking
 volunteering
 users
 to
 visit
 this
 site
 and
 choose
 or
 select
 content
 for

themselves
 like
 they
 do
 on
 any
 other
 website.
 Tracking
 the
 time
 spent
 at
 each

section
 /
 element
 of
 the
 page,
 while
 the
 user
 is
 browsing
 on
 it
 is
 the
 required

data.
 
 The
 target
 (predicted
 or
 dependent)
 variable
 is
 the
 relevant
 content
 for


5

Introduction


him
and
hence
in
order
to
train
the
model,
this
data
point
(as
collected
explicitly

from
the
user)
also
needs
to
be
saved
in
the
databases.


• Data
 cleaning
and
 processing
 is
an
essential
step
required
after
data
collection.



This
 is
 because
 it
 is
 important
 to
 remove
 all
 outliers
 that
 can
 harm
 the
 model.

The
 time
 user
 spent
 at
 different
 sections
 of
 the
 webpage
 should
 ideally
 be

normalized
by
the
total
time
spent
by
him.


• Building
machine
learning
models
by
using
the
collected
mouse
movement
data

as
 the
 training
 and
 initial
 testing
 dataset.
 The
 distribution
 of
 normalized
 time

spent
 by
 the
 user
 at
 each
 section
 of
 the
 webpage
 would
 be
 the
 independent

variables
and
hence
would
become
the
input
attributes
of
the
model,
and
further


content
for
the
user
will
be
the
output
of
the
model,
as
the
dependent
variable.


• Embedding
the
machine
learning
models
back
into
the
website
so
that
the
model

can
 be
 put
 into
 use.
 The
 website
 would
 continue
 tracking
 users
 mouse

movements
and
would
use
the
built
model
to
compute
further
content
for
him
in

real
time.


• Testing
 the
 accuracy
 of
 the
 implementation.
 To
 do
 this,
 the
 predicted
 content

needs
to
be
compared
with
the
actual
content
desired
by
the
user.


To
 demonstrate
 the
 objectives,
 a
 sample
 shopping
 webpage
 has
 been
 developed.
 This

webpage
contains
a
comparison
of
the
specifications
of
five
laptop
models.
Based
on
the

mouse
 movement
 behavior
 of
 a
 user
 across
 this
 page,
 the
 best
 laptop
 would
 be

recommended
to
him.


This
can
be
visualized
as
follows:
If
a
user
has
a
browsing
pattern
that
signifies
that
he
is

spending
say
40%
time
reading
about
the
RAM
of
the
laptops
(further
distribution
of
time

spent
on
different
RAM
sizes
of
different
models),
30%
time
reading
about
the
processors,

20%
time
about
the
Hard
Disc
Drive
and
the
rest
10%
time
similarly
reading
about
other

specifications,
 then
 based
 on
 this
 data
 and
 the
 developed
 machine
 learning
 model,
 the


6

Introduction


most
 suitable
 laptop
 can
 be
 recommended
 to
 him.
 The
 accuracy
 of
 the
 recommendation

could
 be
 checked
 by
 comparing
 the
 product
 finally
 bought
 by
 the
 user
 and
 the
 product

recommended
by
the
website.


1.4 Structure
of
the
dissertation


This
 document
 will
 start
 with
 giving
 an
 idea
 about
 the
 related
 research
 being
 done

across
 the
 globe.
 It
 will
 then
 explain
 the
 complete
 implementation
 outline
 as
 a
 big

picture
of
the
project.
In
Chapter
3,
the
thesis
will
discuss
the
methodology
of
collecting

initial
 training
 data,
 which
 would
 also
 involve
 the
 complete
 description
 of
 the

development
 procedure
 of
 the
 initial
 website.
 It
 will
 explain
 the
 process
 of
 collecting

data
 along
 with
 the
 structures
 of
 the
 databases
 and
 the
 data
 cleaning
 procedure.

Chapter
 4
 would
 give
 the
 details
 of
 the
 machine
 learning
 models
 built
 and
 the

procedure
involved
along
with
the
testing
results
of
the
models
obtained
on
the
training

data.
Chapter
5
would
explain
the
procedure
adopted
to
implement
the
built
model
into

the
 website
 and
 the
 details
 of
 the
 AJAX
 communication
 link
 between
 the
 model,
 data

and
 the
 website.
 Then
 the
 thesis
 explains
 the
 methodology
 to
 collect
 testing
 data
 and

would
 explain
 the
 testing
 methodology
 and
 results
 obtained
 on
 the
 model.
 The
 thesis

closes
 with
 some
 conclusions
 and
 the
 author’s
 view
 on
 the
 possibility
 of
 future
 work.

The
attached
appendix
contains
all
the
source
code.


The
 working
 demonstration
 of
 the
 project,
 along
 with
 its
 documentation
 and
 the
 GNU

General
 Public
 License
 source
 code
 is
 available
 online
 at

http://sparshgupta.name/MSc/Project


7

Background,
Literature
review
and
Project
overview


2
BACKGROUND,
LITERATURE
REVIEW
AND

PROJECT
OVERVIEW

This
 chapter
 explains
 the
 previous
 work
 related
 to
 the
 problem
 already

going
 on
 around
 the
 world.
 The
 chapter
 is
 divided
 into
 different
 sections

explaining
independent
and
combined
work
going
on
or
being
done
in
each

of
 the
 heading.
 The
 chapter
 later
 summarizes
 the
 ongoing
 work
 and
 also

presents
an
overview
of
the
project
carried
out
by
the
author.



The
work
done
in
the
project
is
an
original
idea
and
there
is
no
record
of
any
work
being

done
 around
 using
 the
 same
 methodology.
 The
 problem
 has
 been
 tackled
 to
 some

extent
 and
 has
 been
 considered
 by
 a
 few
 research
 groups
 but
 their
 methodology
 and

final
conclusions
were
very
different
from
what
have
been
proposed
in
this
thesis.
The

following
 parts
 of
 the
 chapter
 would
 highlight
 some
 of
 the
 recent
 developments
 and

work
done
in
related
fields.


2.1 Coordination
of
mouse
and
eye
movements


The
prime
question
of
whether
mouse
tracking
can
be
substituted,
or
at
least
partially

replicate,
eye
tracking
is
active.



(Chen,
Anderson
and
Sohn
2001)
studied
the
relationship
between
the
gaze
position
of

a
 user
 and
 his
 cursor
 position
 on
 a
 computer
 screen
 during
 web
 browsing.
 They


8

Background,
Literature
review
and
Project
overview


conducted
tests
on
several
websites
and
recorded
the
eye
and
mouse
movements
of
the

uses
 and
 studied
 them
 separately.
 They
 concluded
 that
 there
 is
 a
 strong
 relationship

between
gaze
position
and
cursor
position
and
also
that
there
are
regular
patters
of
the

coordination.
 They
 have
 also
 argued
 that
 a
 mouse
 could
 provide
 us
 more
 information

than
just
x
and
y
coordinates
which
could
be
used
to
design
better
interfaces
for
human

computer
 interactions.
 They
 wrote
 in
 their
 conclusion
 that
 “Our
 data
 show
 that
 the

dwell
time
of
cursor
among
different
regions
has
strong
correlation
to
how
likely
a
user

will
look
at
that
region.
Also,
in
over
75%
of
chances,
a
mouse
saccade
will
move
to
a

meaningful
region
and,
in
these
cases,
it
is
quite
likely
that
the
eye
gaze
is
very
close
to

the
 cursor.
 This
 result
 implies
 that,
 by
 predicting
 the
 users'
 interests
 on
 web
 pages,


mousse
device
could
be
a
very
good
alternative
to
an
eye‐tracker
as
a
tool
for
usability

evaluation.”


According
 to
 the
 work
 done
 at
 Google
 labs
 (Rodden,
 et
 al.
 2008),
 several
 different

patters
of
coordination
between
eye
and
mouse
pointer
were
observed
on
a
web
search

result
 page.
 The
 identified
 behavior
 patters
 to
 indicate
 active
 usages
 were
 –
 following

the
 eye
 horizontally,
 following
 the
 eye
 vertically
 and
 marking
 a
 particular
 result.
 This

work
 was
 completely
 done
 on
 a
 search
 results
 page
 but
 clearly
 concludes
 that

coordination
between
user’s
eye
and
his
mouse
pointer
exists.


There
 have
 been
more
 studies
(Byrne,
 et
 al.
 1999)
and
others
 on
the
 relationship
and


coordination
 between
 eye
 movements
 and
 mouse
 movements
 on
 the
 web.
 They
 have

found
that
some
users
will
use
the
mouse
pointer
to
help
them
read
the
page,
or
to
help

them
 make
 a
 decision
 about
 where
 to
 click.
 If
 was
 concluded
 that
 given
 an
 intent
 /

opportunity
 to
 click
 in
 the
 current
 user
 activity,
 the
 mouse
 is
 much
 more
 likely
 to
 be

close
to
the
eye.
Eye
tracking
can
provide
insights
into
users’
behavior
while
using
the

search
results
page,
but
eye‐tracking
equipment
is
expensive
and
can
only
be
used
for

studies
 where
 the
 user
 is
 physically
 present.
 The
 equipment
 also
 requires
 calibration,


9

Background,
Literature
review
and
Project
overview


adding
overhead
to
studies.
In
contrast,
the
coordinates
of
mouse
movements
on
a
web

page
can
be
collected
accurately
and
easily,
in
a
way
that
is
transparent
to
the
user.
This

means
 that
 it
 can
 be
 used
 in
 studies
 involving
 a
 number
 of
 participants
 working

simultaneously,
 or
 remotely
 by
 client‐side
 implementations
 –
 greatly
 increasing
 the

volume
and
variety
of
data
available.


There
is
a
basic
rationality
that
states
"If
I
might
click,
I
might
as
well
keep
the
mouse

close
 to
 my
 eyes."
 Where
 there's
 no
potential
 to
 click,
 either
because
 the
user
is
in
 an

evaluative
mode
or
the
content
of
interest
is
devoid
of
links,
the
mouse
and
eye
diverge.


2.2 Capturing
mouse
movements


There
can
be
several
different
methodologies
to
capture
mouse
movement
behavior
of
a

user
 over
 a
 webpage.
 This
 primarily
 depends
 upon
 the
 type
 of
 data
 required
 and
 the

mouse
movement
expected.
(Arroya,
Selker
and
Wei
2006)
proposed
a
tool
that
need
no

installation
 and
 is
 capable
 of
 tracking
 users
 mouse
 movement.
 This
 mouse
 movement

data
can
be
visualized
in
an
inbuilt
system
and
can
be
used
to
further
refine
the
usability

of
 the
 webpage.
 They
 however
 have
 not
 proposed
 any
 methodology
 to
 automatically


refine
the
webpage.


(Edmonds,
et
al.
2007)
talks
about
technique
and
uses
of
mouse
tracking
on
a
website

but
 completely
 from
 usability
 point
 of
 view.
 It
 handles
 the
 capturing
 of
 the
 mouse

movements
 data
 of
 a
 user
 in
 a
 more
 detailed
 way
 capturing
 the
 coordinates,
 row
 and

column
 ID
 along
 with
 many
 other
 parameters.
 This
 methodology
 was
 found
 effective

but
showed
no
significance
from
the
current
problem
point
of
view.


The
 paper
 (Torres
 and
 Hernando,
 Real
 time
 mouse
 tracking
 registration
 and

visualization
tool
for
usability
evaluation
on
websites
n.d.),
proposes
a
methodology
to


track
 mouse
 movements
 on
 a
 webpage
 and
 visualize
 them
 on
 a
 tool
 that
 they
 have


10

Background,
Literature
review
and
Project
overview


developed.
They
have
used
the
HTML
and
AJAX
languages
and
have
proposed
a
method

to
 link
 the
 mouse
 movements
 with
 the
 server
 logs
 and
 web‐stat
 data
 to
 get
 add‐on

information
of
the
user’s
behavior.


2.3 Tracking
 mouse
 movement
 to
 determine
 users


behaviour


There
 was
 a
 famous
 project
 named
 ‘Cheese’
 done
 at
 the
 MIT
 (Mueller
 and
 Lockerd

2001),
which
extended
the
conventional
web
interface
user
model
(based
on
responds

of
only
mouse
clicks)
to
account
all
mouse
movements
on
a
page
as
an
additional
layer

of
 information
 for
 inferring
 user
 interest.
 They
 developed
 a
 straightforward
 way
 to

record
 all
 mouse
 movements
 on
 a
 page,
 and
 conducted
 a
 user
 study
 to
 analyze
 and

investigate
mouse
behavior
trends
and
found
certain
mouse
behaviors,
common
across

many
users.
They
also
proposed
that
there
are
certain
categories
of
mouse
behavior
and

after
tracking
them,
the
website
could
be
molded
accordingly.


2.4 Discussion


It
 was
 found
 after
 literature
 review
 that
 a
 lot
 of
 work
 has
 been
 done
 to
 prove
 and

support
the
coordination
of
eye
and
mouse
movement
of
a
user
on
a
website.
The
eye

tracking
 data
 has
 been
 used
 by
 Google
 to
 improve
 the
 usability
 of
 their
 search
 pages.

There
are
several
ongoing
discussions
on
the
effective
use
of
eye
or
mouse
tracking
data

to
manually
refine
the
content
and
usability
design
of
a
webpage.


It
was
however
found
that
no
work
has
been
done
in
using
the
mouse
tracking
data
in
a

machine
 learning
 model
 to
 automatically
 refine
 or
 predict
 content
 for
 a
 website
 for
 a

user
based
on
his
mouse
movement
or
eye
movement
behavior.



11

Background,
Literature
review
and
Project
overview


2.5 Project
overview


The
project
undertaken
can
be
stated
as
a
method
proposed
to
automatically
refine
or

predict
the
contents
of
a
webpage,
for
a
user,
based
on
his
mouse
movement
behavior.

From
earlier
studies,
as
stated
in
Section
2.1,
it
has
been
assumed
that
there
is
certainly

some
coordination
between
a
user’s
eye
movement
and
his
mouse
movement.
Based
on

the
mouse
movements
of
an
individual
user,
his
preferences
for
content
and
his
needs

can
be
predicted
and
this
information
can
further
be
used
by
the
owners
of
the
website.

If
 not
 the
 owners,
 this
 information
 can
 definitely
 help
 the
 user
 in
 finding
 the
 right

content
for
him.


To
do
this,
the
first
task
was
to
device
a
methodology
to
track
user’s
mouse
movements

on
a
webpage.
There
can
be
several
ways
in
which
tracking
could
be
done,
and
further

there
 can
 be
 several
 different
 data
 points
 that
 can
 be
 saved
 for
 a
 user
 based
 on
 his

mouse
movements.
The
thesis
proposed
a
method
to
track
the
time
spent
by
a
user
in

every
 section
 of
 a
 webpage.
 There
 were
 several
 JavaScript
 functions
 written,
 and

modifications
 done
 to
 a
 standard
 website
 to
 enable
 mouse
 tracking
 in
 a
 hidden
 layer.

AJAX
 was
 used
 to
 connect
 the
 JavaScript
 functions
 with
 the
 server
 end
 PHP
 scripts,

which
were
further
connected
to
MySQL
databases
for
storing
the
data.
To
demonstrate

all
 this,
 a
 new
 dummy
 website
 imitating
 a
 shopping
 portal
 was
 developed.
 Once
 the

website
was
developed
with
mouse
tracking
capabilities,
it
was
made
available
to
public

for
 two
 weeks.
 This
 was
 done
 to
 collect
 some
 initial
 data
 on
 user’s
 mouse
 movement

behavior.
The
data
collected
was
processed
and
cleaned
before
analyzing
and
modeling

it.
 This
 complete
 step
 of
 initial
 website
 development
 and
 data
 collection
 has
 been

explained
in
details
in
Chapter
3
(Data
Collection
and
Pre‐processing).


It
was
then
required
to
study
and
analyze
the
collected
data
and
make
a
model
on
it
so

that
 it
 could
 be
 used
 in
 the
 future
 for
 new
 visitors.
 To
 do
 this,
 WEKA
 was
 used
 and


12

Background,
Literature
review
and
Project
overview


different
types
of
models
were
made.
The
models
took
the
independent
variables
as
the

time
spent
in
different
sections
of
the
webpage
by
the
mouse
pointer
and
predicted
the

relevant
content
for
the
user
as
the
dependent
variable.
They
all
were
built
and
trained

on
the
initially
collected
data
and
were
tested
on
the
same
training
data.
After
several

iterations,
 two
 models,
 one
 based
 on
 Decision
 Tree
 and
 the
 other
 on
 Neural
 Network

were
obtained
that
gave
significant
accuracy
on
the
training
data.
The
complete
model‐
building
 phase
 of
 the
 project
 along
 with
 the
 test
 results
 obtained
 are
 explained
 in

Chapter
4
(Building
machine
learning
models)


Once
 the
 two
 models
 (each
 of
 Decision
 Tree
 and
 Neural
 Network)
 were
 obtained,
 the

task
 was
 to
 embed
 them
 both
 into
 the
 initial
 website.
 This
 was
 necessary
 so
 that
 the

built
models
could
be
used
for
future
visitors
and
the
content
relevant
to
them
can
be

predicted
based
on
their
mouse
movement
activities.
The
models
were
coded
in
PHP
on

an
 apache
 server
 and
 were
 connected
 with
 the
 front‐end
 HTML
 page
 using
 AJAX.
 The

PHP
 script
 was
 made
 to
 read
 the
 real
 time
 mouse
 movement
 data
 of
 a
 given
 user

directly
 from
 the
 MySQL
 databases
 and
 execute
 the
 model
 on
 it
 to
 predict
 further

content
for
him.
The
whole
procedure
is
explained
in
details
in
Chapter
5
(Embedding

the
machine
learning
models
in
the
website)


After
embedding
the
two
models
into
the
website,
volunteers
were
again
asked
to
visit

the
website.
This
time
not
only
the
user’s
mouse
movements
were
captured
but
also
he


was
 recommended
 appropriate
 content
 based
 on
 one
 of
 the
 two
 machine
 learning

models.
The
mouse
movement
data
was
saved
in
the
MySQL
databases
to
be
analyzed

for
accuracy
later.
This
step
is
explained
in
Chapter
6
(Testing
and
Results)


The
collected
data
was
used
as
the
test
dataset
and
the
two
models
were
evaluated
on

their
 accuracy
 as
 well
 as
 time
 performances.
 It
 was
 found
 that
 under
 the
 present

limitations
 of
 lack
 of
 data,
 the
 Decision
 tree
 model
 edged
 over
 the
 Neural
 Network


13

Background,
Literature
review
and
Project
overview


model
both
on
the
accuracy
as
well
as
on
the
time
performance
front.
The
details
of
this

step
are
mentioned
in
Chapter
7(Conclusion)


The
whole
project
can
be
outlined
as
follows:


Asking
volunteering
users
to

Building
the
Initial
website
 visit
the
website
and
capturing

capable
of
tracking
mouse
 their
mouse
movements.

movements
of
the
visitors
 Cleaning
and
compiling
the

collected
data.


Using
the
captured
mouse

Coding
the
obtained
machine

movement
data
of
the
users,

learning
models
back
into
the

building
and
training
machine

website

learning
models


Collecting
test
dataset
from
the

Testing
the
accuracy
of
the
built

final
website.
The
website
now

models
using
the
collected
test

is
capable
of
recommending
the

data.
Also
evaluating
the
time

appropriate
content
for
a
user

performances
of
the
models
on

based
on
his
mouse
movement

behavior
 the
web
server



Figure
1:
Project
outline


14

Data
Collection
and
Pre‐processing


3
DATA
COLLECTION
AND
PRE‐PROCESSING

This
chapter
will
explain
the
complete
training
dataset
collection
steps.
This

would
involve
details
of
the
initial
website
developed
and
explanation
of
the

steps
 followed
 to
 obtain
 the
 required
 training
 data
 from
 it.
 Later,
 this

chapter
will
explain
the
data
compilation
and
cleaning
steps
performed
on

the
initial
collected
data.


3.1 The
initial
website



To
 analyze
 the
 mouse
 movement
 behavior
 of
 the
 users
 on
 a
 webpage,
 the
 first
 step

would
 be
 the
 development
 of
 the
 website
 under
 consideration.
 Since
 the
 proposed

method
of
analysis
and
modeling
the
data
is
machine
learning,
some
initial
training
data

is
 also
 required.
 To
 cater
 both
 the
 needs,
 a
 dummy
 website
 capable
 of
 tracking
 the

user’s
 mouse
 movements
 was
 built
 and
 made
 public.
 The
 website
 was
 kept
 live
 until

required
data
was
achieved.
The
specifications
and
details
of
the
implementation
are
as

follows:


3.1.1 Specifications


The
functionalities,
requirements
and
specifications
of
the
initial
webpage
built
are:


• The
user
interface
design
of
the
initial
webpage
needs
to
be
exactly
same
as
that

of
the
required
final
website.
This
is
important
because
user’s
mouse
movements


15

Data
Collection
and
Pre‐processing


depend
on
the
interface
of
the
webpage.
It
is
necessary
that
the
data
collected
to

build
 and
 train
 the
 machine‐learning
 model
 is
 of
 the
 same
 webpage
 where
 the

model
is
finally
required
to
be
implemented.


• The
mouse
tracking
needs
to
be
implemented
in
a
hidden
layer
so
that
the
user

can
experience
the
web
in
the
same
rich
way
without
any
compromise
on
speed,

performance.
He
should
not
be
asked
any
explicit
information
at
any
time.


• As
 stated
 in
 section
1.3,
 the
 webpage
 developed
 was
 a
 dummy
 shopping
 portal

showing
five
laptop
models
comparing
them
on
their
configurations.


• There
were
5
laptops
with
22
attributes
of
each
of
them.
There
was
an
empty
(no

laptop)
specification
heading
information
space
on
the
left
hand
side
of
the
page.


( )
Total
sections
in
the
built
page
were
 5 + 1 × 22 = 132 ,
where
5
are
the
number
of




laptops,
 1
 is
 for
 specification
 heading
 category
 (no
 laptop
 space)
 and
 22
 being

the
count
of
attributes
per
laptop.


• Each
of
these
132
sections
of
the
webpage
gets
highlighted
as
soon
as
the
mouse

pointer
 reaches
 it.
 This
 ensured
 that
 the
 user
 is
 most
 likely
 to
 read
 the

highlighted
 section
 of
 the
 webpage
 and
 hence
 ensures
 that
 the
 user’s
 mouse

pointer
is
close
to
his
point
of
gaze.
This
step
ensured
that
the
mouse
movement


data
provides
pseudo
eye
tracking
data
of
the
user.
The
cell‐highlighting
feature

was
implemented
using
Cascading
Style
Sheets
where
the
cell
color
was
changed

as
soon
as
mouse
pointer
enters
the
cell.


• A
MySQL
database
was
connected
for
recording
the
mouse
pointer
time
on
each

section
of
the
webpage.
The
final
product
bought
by
that
user
was
also
saved
in

the
databases.


3.1.2 Implementation


The
webpage
was
developed
in
HTML
using
PHP
as
the
server
side
scripting
language.


JavaScript
and
Ajax
was
used
to
dynamically
transfer
data
from
the
HTML
fields
to
the


16

Data
Collection
and
Pre‐processing


PHP
scripts.
Database
was
designed
in
MySQL
and
PHP
scripts
were
written
to
connect

and
transfer
data
between
MySQL
and
the
Apache
server.


3.1.2.1 Webpage
Design


The
webpage
was
designed
in
HTML
in
a
tabular
format
with
6
columns
and
22
rows.

Column
1
had
the
heading
of
the
specifications
and
rest
5
columns
had
specifications
of

each
laptop
and
every
row
had
a
specification.
Each
of
the
132
cells
hence
obtained
in

the
 table
 were
 corresponding
 to
 an
 independent
 variable
 for
 the
 model
 (input

variables).
The
screenshot
of
the
top
half
of
the
developed
page
is
shown
in
Figure
2
and

the
screenshot
of
the
complete
webpage
is
shown
in
Figure
3.



Figure
2:
Screenshot
of
the
top
half
of
the
developed
webpage


It
can
be
seen
clearly
that
there
are
6
columns
on
the
webpage
and
22
rows
and
hence

132
cells.
Since
each
of
these
cells
is
an
input
variable
to
the
model,
they
all
were
given
a

code.
Each
laptop
was
given
a
number
from
1
to
5
and
the
specification
heading
space

was
 given
 the
 code
 0.
 Each
 specification
 was
 given
 an
 alphabetic
 code
 from
 ‘a’
 to
 ‘v’.

Hence,
each
of
the
132
sections
of
the
webpage
got
the
code
as
the
combination
of
the


alphabet
of
the
specification
and
the
number
of
the
laptop
like
a0,
a1,
a2,
a3,
a4,
a5,
b0,


17

Data
Collection
and
Pre‐processing


b1,
b2,
…,
v3,
v4,
v5.
The
coding
methodology
for
the
first
few
cells
is
shown
in
Figure
4.

These
codes
were
not
added
anywhere
on
the
webpage
but
were
only
used
while
calling

the
mouse
tracking
functions
as
will
be
explained
in
subsequent
sections.



Figure
3:
Screenshot
of
the
developed
webpage


18

Data
Collection
and
Pre‐processing



Figure
4:
Code
given
to
each
section
of
the
webpage


To
make
sure
that
in
most
cases,
the
user’s
mouse
pointer
is
close
to
his
point
of
gaze,
a

Cascading
 Style
 Sheet
 was
 attached
 with
 the
 HTML
 webpage.
 The
 CSS
 file
 had
 two

different
 style
 formats
 that
 could
 be
 applied
 to
 each
 cell.
 One
 of
 the
 styles
 was
 the

normal
 white
 background
 whereas
 the
 other
 format
 was
 with
 blue
 background
 to

enable
cell
highlighting.
As
soon
as
mouse
enters
a
cell,
the
normal
style
was
replaced

by
the
highlighting
style
for
that
cell.
This
was
again
reset
as
soon
as
the
mouse
leaves

the
 highlighted
 cell.
 Similarly,
 the
 row
 and
 the
 column
 in
 which
 the
 mouse
 pointer
 is

currently
 present
 are
 also
 highlighted
 in
 a
 light
 shade
 of
 blue.
 The
 CSS
 code
 of
 the

different
styles
is
available
in
the
appendix
of
this
thesis.
The
screenshot
with
a
cell
‘g2’

highlighted
is
shown
in
Figure
5.


For
 every
 visitor
 of
 the
 website
 a
 unique
 user
 id
 was
 generated
 as
 soon
 as
 the
 page

loads.
 To
 keep
 the
 user
 id
 simple,
 it
 was
 kept
 as
 the
 current
 JavaScript
 Time
 value
 at

page
 load.
 JavaScript
 time
 function
 returns
 the
 current
 time
 in
 milliseconds
 since

January
1,
1970.
This
ensured
that
in
the
current
scope
of
the
project,
all
visiting
user

would
have
a
unique
user
id.
The
JavaScript
code
to
generate
user
id
is:


19

Data
Collection
and
Pre‐processing


var userId=new Date();


userId=userId.getTime();

A
 JavaScript
 file
 named
 ‘mouseover.js’
 was
 associated
 with
 this
 webpage
 with
 several

JavaScript
variables
and
functions
required
to
track
and
record
mouse
movements.
The

HTML
code
of
the
website
was
also
given
an
‘onload’
event
to
call
a
JavaScript
function

named
‘start_It()’
which
triggers
the
mouse
tracking
functionality
of
the
website.


<body onload="start_It();">

The
algorithms
of
mouse
tracking
and
the
complete
implementation
would
be
explained

later
after
the
details
about
the
database
design.



Figure
5:
Screenshot
with
a
cell
highlighted


3.1.2.2 Database
Design


A
 database
 was
 created
 in
 MySQL
 with
 two
 tables
 namely
 ‘data’
 and
 ‘bought’.
 The

attributes
of
the
two
tables
are:


20

Data
Collection
and
Pre‐processing



Figure
6:
Database
table
'data'



Figure
7:
Database
table
'bought'


Table:
data


• userID

To
record
the
user
id
of
the
user


• cellID

To
save
the
cell
ID
that
was
assigned
to
each
sub
section
of
the
webpage


• time

contains
the
time
in
milliseconds
spent
in
the
cellID


Table:
bought


• userID

To
record
the
user
id
of
the
user


• bought

To
save
the
code
of
the
final
product
bought
by
the
user


The
table
‘data’
would
save
the
time
spent
in
each
cell,
i.e.
section
of
the
webpage
by
a

user.
 There
 could
 be
 132
 different
 sections
 /
 CellIDs
 for
 each
 user
 and
 they
 all
 can

appear
multiple
times.
The
time
spent
in
each
section
by
a
user
will
be
the
independent

variable
for
the
model.


The
table
‘bought’
is
made
to
record
the
final
product
selected
by
the
user.
The
attribute

‘userID’
in
both
the
tables
is
the
foreign
key
and
is
the
primary
key
in
the
‘bought’
table.


21

Data
Collection
and
Pre‐processing


The
rationale
behind
such
a
design
was
to
implement
database
normalization
so
that
all

data
repetition
could
be
avoided.
Also,
the
insert
queries
would
be
simple
and
short
and

hence
 would
 be
 efficient
 and
 wont
 slow
 the
 webpage
 while
 tracking
 the
 mouse
 and

interacting
 with
 the
 databases
 simultaneously.
 The
 only
 drawback
 of
 such
 a
 design
 is

that
the
data
would
need
merging
before
it
could
be
used
for
training
the
model.


3.1.2.3 Implementing
mouse
tracking


Each
 of
 the
 132
 cells
 of
 the
 webpage
 had
 a
 JavaScript
 ‘onmouseover’
 and
 ‘onmouseout’

event
 statements.
 OnMouseOver
 specifies
 that
 the
 ‘movement_in()’
 JavaScript
 function

be
 called
 every
 time
 the
 mouse
 comes
 over
 that
 cell.
 OnMouseOut
 similarly
 specifies


that
 the
 ‘movement_out(‘cellID’)’
 JavaScript
 function
 be
 called
 when
 mouse
 pointer

leaves
the
cell.
The
code
snippet
demonstrating
these
function
calls
is:


<td onmouseout="movement_out('c1');" onmouseover="movement_in();"></td>


As
 soon
 as
 mouse
 pointer
 enters
 a
 cell,
 the
 current
 DateTime
 was
 recorded
 in
 a

temporary
variable
named
‘cellEntryDate’
in
the
function
‘movement_in()’.
This
function

was
not
passed
any
attribute.
As
soon
as
the
mouse
pointer
exists
a
cell,
the
time
spent

in
 that
 cell
 in
 milliseconds
 was
 calculated
 by
 subtracting
 the
 ‘cellEntryDate’
 from
 the

current
DateTime
in
the
function
‘movement_out(‘CellID’)’.
The
movement_out()
function

was
also
passed
the
unique
2‐letter
cell
code
to
record
the
cell
ID.
The
time
spent
in
the

cell
along
with
the
cell
ID
was
concatenated
in
the
data
queue
variable
named
‘queue1’

or
’queue2’.
The
JavaScript
function
definitions
are
as
follows:


function movement_in() {
cellEntryDate = new Date();
}

22

Data
Collection
and
Pre‐processing


function movement_out(cell) {
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0) {
if(flag==0) {
queue1 = queue1+cell+":"+time+"_";
}
else {
queue2 = queue2+cell+":"+time+"_";
}
}
}


The
done
variable
in
the
above
code
was
to
check
if
the
current
user
is
still
active

and
has
not
bought
a
product
already.
Flag
was
a
variable
to
check
which
queue


variable
is
currently
available.


Two
instances
of
the
queue
variables
were
made
to
ensure
that
while
transmitting
one

of
 the
 queue
 data
 to
 the
 server
 via
 AJAX,
 the
 other
 queue
 variable
 can
 record
 the
 cell

movements.
This
is
of
great
importance
specially
when
the
Internet
bandwidth
speed
is

low
and
data
transfer
in
worst
case
can
take
a
lot
of
time.
This
step
also
ensures
that
the

interaction
experience
of
the
user
will
not
be
affected
while
mouse
tracking
is
going
on

in
the
background.


As
 stated
 above,
 the
 built
 website
 had
 an
 ‘onload’
 JavaScript
 event
 calling
 a
 function

named
 ‘start_It()’.
 The
 start_it()
 function
 is
 a
 recursive
 function
 which
 calls
 the

‘sendData()’
 function
 every
 2
 seconds.
 The
 sendData()
 function
 contains
 the
 AJAX

statement
to
transfer
the
generated
user
ID
(variable
‘userID’)
and
the
queue
variables

namely
 ‘queue1’
 or
 ‘queue2’
 to
 the
 ‘data.php’
 file
 at
 the
 backend
 server.
 The
 self‐
explanatory
JavaScript
functions
definitions
are
as
follows:


23

Data
Collection
and
Pre‐processing


function start_It(){
if(done==0) {
setTimeout("sendData()",2000);
}
}


function sendData(){
var query_string;
if(flag==0) {
queue2="";
flag=1;
query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else {
queue1="";
flag=0;
query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}



The
 ‘sendData()’
 JavaScript
 function
 uses
 standard
 AJAX
 calls
 and
 standard
 ‘http’

open,
onreadystatechange
and
send
functions.
The
‘query_string’
variable
contains

the
PHP
file
to
which
the
arguments
were
passed
via
GET
method.


The
 ‘data.php’
 files
 was
 coded
 such
 that
 it
 takes
 the
 queue
 variable
 as
 sent
 by
 the

JavaScript
 ‘sendData()’
 function
 and
 explodes
 the
 string
 to
 extract
 the
 various
 cell
 IDs

and
 time
 values
 associated
 with
 them.
 It
 then
 opens
 a
 connection
 with
 the
 MySQL

database
and
inserts
records
with
cell
information
in
the
‘data’
table
using
the
received

user
ID.
The
complete
code
of
the
‘data.php’
file
is
available
in
appendix
of
the
thesis.


24

Data
Collection
and
Pre‐processing


3.1.2.4 Final
product
bought
by
the
user


Once
the
user
browse
through
the
webpage
and
scrolled
on
the
table
reading
about
the

various
configurations
of
the
five
laptops
giving
us
one
case
of
the
training
data,
he
was

required
 to
 select
 one
 of
 the
 products.
 This
 is
 to
 simulate
 the
 actual
 shopping
 portal

scenario
where
a
person
read
about
various
products
and
finally
buy
one
of
it.
To
select

a
product,
he
performs
a
mouse
click
operation
on
the
‘Buy
Now’
button
associated
with

the
product
as
shown
in
Figure
3.


As
soon
as
any
‘Buy
Now’
button
on
the
webpage
is
triggered
by
the
user,
a
JavaScript

function
named
‘bought(‘ProductID’)’
is
invoked.
This
function
uses
the
AJAX
protocols


and
 sends
 the
 userID
 and
 the
 ID
 of
 the
 product
 clicked
 to
 the
 ‘bought.php’
 file
 on
 the

server.
 The
 ‘bought.php’
 file
 on
 the
 web
 server
 connects
 to
 the
 MySQL
 database
 and

inserts
this
information
as
a
row
in
the
bought
table
‘bought’.
The
complete
code
of
the

PHP
script
‘bought.php’
is
available
in
appendix
and
the
JavaScript
function
is
as
follows:


function bought(product){
done=1;
var query_bought;
query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}


Once
 the
 user
 selects
 the
 product,
 further
 mouse
 tracking
 is
 disabled.
 Changing
 the

value
of
the
JavaScript
‘done’
variable
does
this.


3.1.3 Testing
the
initial
website


The
 website
 once
 completed
 was
 hosted
 on
 a
 public
 web
 server
 and
 was
 tested

thoroughly
for
bugs
and
errors.
The
main
points
in
the
checklist
were:


25

Data
Collection
and
Pre‐processing


• The
 queue
 variables
 (‘queue1’
 and
 ‘queue2’)
 in
 the
 JavaScript
 file
 are
 recording

the
cellID
and
time
appropriately
and
the
data
is
getting
extracted
accurately
at

the
server.


• Data
is
being
sent
properly
from
the
frontend
JavaScript
functions
to
the
backend

PHP
files
via
AJAX.


• The
link
between
the
database
and
PHP
files
is
working
correctly.


• Both
 the
 tables
 in
 the
 database
 are
 getting
 data
 and
 are
 inserting
 it
 properly

without
any
error.


3.2 Data
collection


When
 the
 website
 as
 explained
 in
 the
 previous
 section
 was
 developed
 and
 tested

completely,
 it
 was
 made
 open
 for
 the
 general
 public.
 Volunteers
 via
 email
 and
 social

media
were
invited
to
visit
the
webpage.
The
selection
of
the
volunteers
was
completely

random
and
was
primarily
the
contact
group
of
the
author.
All
the
volunteers
/
visitors

were
 asked
 to
 browse
 the
 webpage
 and
 buy
 a
 product
 on
 it
 (at
 cost
 zero,
 virtually)

similar
to
the
way
they
do
on
a
real
shopping
site.
From
this
sample
the
initial
training

data
 for
 the
 model
 was
 collected
 and
 saved
 into
 the
 databases
 as
 explained
 in
 the

previous
section.
No
personal
information
or
any
other
data
was
asked
from
any
visitor.


The
 duration
 of
 this
 step
 depends
 on
 the
 requirements
 of
 the
 initial
 training
 data
 for

building
 the
 model.
 The
 more
 the
 number
 of
 sections
 in
 the
 website,
 i.e.
 more
 the

independent
 variables
 of
 the
 model,
 more
 number
 of
 cases
 in
 the
 initial
 training
 data

would
be
required
to
build
a
relevant
model.


In
 a
 short
 span
 of
 14
 days,
 292
 unique
 visitors
 accessed
 the
 webpage.
 244
 rows
 were

collected
 in
 the
 ‘bought’
 table
 and
 16401
 tuples
 were
 saved
 in
 the
 ‘data’
 table.
 The

expected
users
were
around
350‐400
but
due
to
lack
of
visibility
of
the
project
and
no


26

Data
Collection
and
Pre‐processing


compensation
available
to
the
volunteers,
the
number
could
not
be
reached
and
in
lieu

of
the
time,
the
website
was
taken
off
and
the
data
was
exported
for
further
analysis
and

cleaning.


3.3 Data
compilation
and
cleaning


3.3.1 Need
and
Specifications


The
collected
data
in
the
two
tables
needs
to
be
merged
in
such
a
way
that
each
row
of

the
 new
 table
 correspond
 to
 a
 single
 user
 and
 contains
 all
 information
 about
 him,
 i.e.

each
row
is
one
case
of
the
training
data.
Each
case
would
include
all
the
times
spent
in

132
sections
of
the
webpage
along
with
the
user
id
and
the
product
finally
bought
by
the

user.
This
is
also
the
required
format
to
train
a
machine‐learning
model
in
WEKA.



Moreover,
the
collected
data
needs
to
be
analyzed
properly
and
checked
for
any
errors

in
the
data.
There
might
be
some
users
who
wouldn’t
have
provided
the
information
on

the
 actual
 product
 bought
 and
 hence
 the
 data
 related
 to
 them
 needs
 to
 be
 scrapped.

Some
users
are
likely
to
spend
absolutely
no
time
as
they
might
have
accidently
visited

the
 webpage
 and
 hence
 all
 users
 spending
 less
 than
 some
 calculated
 threshold
 time,

needs
 to
 be
 scrapped.
 Similarly
 users
 waiting
 on
 a
 section
 for
 more
 than
certain
 fixed

time
should
be
removed.
These
steps
are
important
to
ensure
that
there
are
no
outliers

in
the
collected
data
and
the
model
that
would
be
built
and
trained
on
this
data
is
best

suited
for
general
usage
on
the
website.


Since
the
absolute
time
spent
on
different
element
of
the
webpage
depends
on
a
number

of
 other
 features
 primarily
 the
 speed
 of
 an
 individual
 user,
 the
 data
 needs
 to
 be

normalized.
Dividing
the
time
spent
by
a
user
on
an
individual
section
by
the
total
time


27

Data
Collection
and
Pre‐processing


spent
 by
 that
 user
 on
 the
 website
 would
 give
 the
 proportion
 of
 time
 spent
 by
 him

reading
that
section
of
the
webpage.


Hence
the
final
training
data
should
only
contain
valid
users
responses
of
the
product

bought
 along
 with
 the
 normalized
 breakup
 of
 the
 time
 spent
 by
 them
 on
 various

sections
of
the
webpage.


3.3.2 Implementation


First
all
the
data
needs
to
be
compiled
into
a
single
table
as
stated
above
and
then
needs

to
be
cleaned.


3.3.2.1 Data
compilation


A
 new
 PHP
 script
 named
 ‘alignData.php’
 was
 written
 to
 compile
 the
 data
 into
 more

usable
format.
This
file
would
write
all
the
data
to
a
new
table
named
‘finalData’
with

following
 134
 attributes
 (132
 corresponding
 to
 the
 time
 spent
 in
 132
 sections
 of
 the

webpage
(independent
variables),
1
to
record
the
userID
of
the
user
and
1
is
to
save
the

code
of
the
final
product
bought
(target
/
dependent
variable).
The
final
product
bought

would
be
the
predicted
variable
in
our
model
that
shall
be
discussed
in
the
next
chapter.

The
attributes
of
the
‘finalData’
table
are:


• userID

To
record
the
user
id
of
the
user


• a0

Time
in
milliseconds
spent
in
cell
‘a0’
of
the
webpage


• a1

Time
in
milliseconds
spent
in
cell
‘a1’
of
the
webpage


• a2

Time
in
milliseconds
spent
in
cell
‘a2’
of
the
webpage


• a3

Time
in
milliseconds
spent
in
cell
‘a3’
of
the
webpage


• a4

Time
in
milliseconds
spent
in
cell
‘a4’
of
the
webpage


• a5

Time
in
milliseconds
spent
in
cell
‘a5’
of
the
webpage


28

Data
Collection
and
Pre‐processing


• b0

Time
in
milliseconds
spent
in
cell
‘b0’
of
the
webpage


• b1

Time
in
milliseconds
spent
in
cell
‘b1’
of
the
webpage


• b2

Time
in
milliseconds
spent
in
cell
‘b2’
of
the
webpage


• .


• .

 
 Similarly
from
‘b3’
to
‘u3’


• .


• u4

Time
in
milliseconds
spent
in
cell
‘u4’
of
the
webpage


• u5

Time
in
milliseconds
spent
in
cell
‘u5’
of
the
webpage


• v0

Time
in
milliseconds
spent
in
cell
‘v0’
of
the
webpage


• v1

Time
in
milliseconds
spent
in
cell
‘v1’
of
the
webpage


• v2

Time
in
milliseconds
spent
in
cell
‘v2’
of
the
webpage


• v3

Time
in
milliseconds
spent
in
cell
‘v3’
of
the
webpage


• v4

Time
in
milliseconds
spent
in
cell
‘v4’
of
the
webpage


• v5

Time
in
milliseconds
spent
in
cell
‘v5’
of
the
webpage


• bought

To
save
the
code
of
the
final
product
bought
by
the
user


The
‘alignData.php’
file
selects
all
the
responses
stored
in
the
tables
‘data’
and

‘bought’

and
 save
 them
 in
 the
 table
 ‘finalData’.
 The
 attribute
 ‘userID’
 is
 the
 primary
 key
 of
 the

table.
The
algorithm
that
was
implemented
in
the
‘alignData.php’
file
was:


1. Select
a
list
of
unique
users
from
the
table
‘data’

2. For
each
user
with
id
‘userID’,
do‐

a. Select
all
the
data
(cellIDs
and
associated
time)
corresponding
to
that
user

from
the
table
‘data’.
Use
the
sum
aggregate
function
in
the
SQL
on
time

and
group
them
by
cellIDs.

b. This
 will
 give
 the
 total
 time
 spent
 on
 each
 visited
 cell,
 i.e.
 section
 of
 the

webpage
visited
by
that
user.


29

Data
Collection
and
Pre‐processing


c. Time
spent
on
all
other
cells,
i.e.
sections
not
visited
by
that
user
is
made

zero.

d. Insert
 all
 the
 time
 values
 for
 each
 cell
 in
 the
 ‘finalData’
 table
 along
 with

the
user’s
userID.

e. Select
the
final
product
bought
by
the
user
using
a
select
statement
on
the

table
‘bought’.
In
case
the
user
has
not
bought
any
product,
i.e.
the
output

from
 the
 ‘bought’
 table
 for
 that
 user
 is
 empty,
 assign
 him
 a
 product

number
0.

f. Update
 the
 ‘finalData’
 table
 by
 inserting
 the
 value
 for
 the
 ‘bought’
 field

corresponding
to
that
user.


After
successful
execution
of
this
algorithm
in
the
‘alignData.php’
script,
the
‘finalData’

table
contained
all
the
data
collected
from
the
initial
website
in
a
tabular
manner
with

each
row
corresponding
to
a
unique
user.
This
data
can
now
be
used
directly
for
model

building
in
WEKA
but
it
needs
some
cleaning.


The
‘data’
table
had
a
total
of
16401
tuples
with
292
unique
users
whereas
the
‘bought’

table
had
244
tuples.
After
executing
the
above
script,
the
total
number
of
tuples
in
the

‘finalData’
table
was
292.
Out
of
292
tuples,
48

(292
minus
244)
users
were
those
who

left
the
site
without
selecting
any
product.
This
table
‘finalData’
was
then
exported
in
a

spreadsheet
format
(Microsoft
Excel)
for
analysis,
visualization
and
cleaning.


3.3.2.2 Data
cleaning


On
 of
 the
 obtained
 292
 rows
 of
 data
 in
 excel,
 the
 next
 task
 is
 the
 data
 cleaning
 stage.

This
step
is
to
remove
all
the
outliers
and
other
cases
that
can
harm
the
training
of
the

model
 and
 eventually
 can
 harm
 the
 model.
 There
 can
 be
 multiple
 reasons
 behind
 the

occurrences
 of
 such
 unwanted
 cases
 in
 the
 initial
 dataset
 such
 as,
 non
 serious

respondents,
 accidently
 entering
 the
 webpage
 and
 closing
 it
 immediately,
 accidently


30

Data
Collection
and
Pre‐processing


pressing
 the
 enter
 key,
 leaving
 the
 computer
 with
 website
 on
 while
 working
 on

something
else,
etc.


The
following
steps
to
clean
the
collected
data
were
followed:


• All
the
tuples
where
the
value
of
the
attribute
‘bought’
is
0,
i.e.
the
user
has
not

bought
any
product
were
deleted.
This
was
because
the
objective
of
the
project
is

to
 select
 the
 best
 product
 for
 a
 user
 and
 hence
 the
 training
 set
 should
 only

contain
users
who
have
bought
a
product.
Training
the
model
on
data
predicting

that
the
user
would
not
buy
would
make
the
model
inappropriate
for
use
in
the

current
project.


There
were
a
total
of
48
such
tuples
where
the
bought
product
value
as
0.

The
 number
 of
 tuples
 in
 the
 left
 data
 were
 244
 each
 corresponding
 to
 a

unique
visitor.
All
244
users
have
bought
a
product
(dependent
variable
is

not
0)


• The
total
time
spent
by
a
user
was
calculated
for
all
the
users
using
simple
excel

inbuilt
 sum
 function.
 The
 distribution
 of
 the
 total
 time
 spent
 by
 different
 users

on
the
built
webpage
was
studied.

It
 was
 found
 that
 the
 average
 time
 spent
 by
 a
 user
 on
 the
 webpage
 was

33.08
 seconds.
 The
 maximum
 time
 spent
 by
 a
 user
 was
 225.8
 seconds

whereas
the
minimum
was
1.2
seconds.



• The
 minimum
 and
 the
 maximum
 time
 spent
 by
 any
 user
 were
 analyzed
 to
 find

the
outliers.
Since
the
minimum
time
in
the
current
data
is
much
lower
than
the

expected
minimum
time
any
serious
volunteer
would
spend,
a
threshold
value
of

8
seconds
was
selected.
The
maximum
time
of
225.8
seconds
was
found
feasible

and
hence
no
upper
limit
was
calculated.


31

Data
Collection
and
Pre‐processing


This
value
of
8
seconds
was
analyzed
as
a
feasible
value
keeping
in
mind

the
 webpage
 design.
 It
 was
 assumed
 that
 any
 user
 taking
 less
 than
 8

seconds
on
that
webpage
has
given
incorrect
data
and
will
be
considered

as
an
outlier.

There
were
44
users
who
spent
less
than
8
seconds
on
the

initial
 website
 while
 giving
 training
 data
 for
 model
 building.
 Rows

associated
 with
 all
 44
 users
 were
 deleted
 from
 the
 collected
 sample

leaving
the
sample
size
to
200
tuples.

The
 average
 time
 spent
 by
 a
 user
 became
 40.26
 seconds
 and
 the

minimum
time
spent
by
a
user
in
the
new
dataset
became
8.3
seconds.


3.3.2.3 Data
normalization


The
 data
 collected
 from
 the
 volunteers
 have
 the
 132
 time
 fields
 corresponding
 to
 the

time
 spent
 in
 132
 sections
 of
 the
 website
 in
 absolute
 value.
 It
 was
 realized
 that
 data

normalization
would
be
required.
The
reason
behind
this
was
that
different
people
have

spent
 different
 time
 on
 the
 webpage.
 The
 time
 spent
 depends
 upon
 their
 individual

browsing
speed,
reading
speed
and
other
several
personal
attributes.
Since
the
desired

model
 has
 to
 cater
 a
 general
 audience,
 time
 spent
 in
 one
 section
 relative
 to
 the
 time

spent
in
the
other
sections
was
thought
to
be
more
appropriate.


There
are
several
advantages
of
this
step,
primarily
also
that
the
model
now
would
be

capable
of
predicting
in
real
time
for
a
user
who
is
in
process
of
browsing
the
webpage.

Whenever
the
prediction
is
needed,
the
current
times
spent
in
various
sections
could
be

normalized
and
fed
into
the
model.
Since,
the
model
now
would
be
immune
to
absolute

time
value,
with
every
prediction
for
the
same
user,
the
model
would
not
be
biased
on

the
 time
 spent
 by
 him
 but
 would
 depend
 only
 on
 the
 relative
 time
 spent
 on
 different

sections
 of
 the
 webpage.
 Another
 advantage
 is
 that
 all
 the
 data
 used
 for
 training
 the


32

Data
Collection
and
Pre‐processing


model
is
now
equivalent.
The
200
cases
in
the
training
set
are
more
comparable
and
do

not
vary
on
absolute
scale.
This
step
is
expected
to
train
the
model
better.


Implementation


To
carry
out
data
normalization,
the
total
time
spent
by
a
user
was
calculated
in
excel

(also
 done
 in
 data
 cleaning
 step).
 Time
 spent
 in
 individual
 section
 of
 the
 webpage
 by

that
 user
 was
 then
 divided
 by
 the
 total
 time
 spent
 on
 the
 webpage
 by
 him.
 This
 step

gave
the
percentage
time
spent
by
the
user
in
each
section
of
the
webpage.


The
new
dataset
with
200
tuples
and
134
attributes
(132
independent
variables
and
1


dependent
 variable)
 with
 normalized
 time
 data
 was
 saved
 in
 the
 CSV
 format,
 which

could
be
imported
directly
into
WEKA
for
model
building
task.
The
next
chapter
would

explain
the
procedure
of
building
machine
learning
models
on
WEKA
using
the
data
as

collected
in
this
chapter.


33

Building
machine
learning
models


4
BUILDING
MACHINE
LEARNING
MODELS

Using
 the
 collected
 data,
 various
 machine­learning
 models
 were
 built
 and

tested.
This
chapter
explains
the
complete
methodology
followed
along
with

the
 details
 of
 the
 models
 obtained.
 It
 later
 explains
 the
 best
 models
 that

were
selected
and
the
rationale
behind
them.


4.1 Machine
Learning


According
 to
 Wikipedia1
 “Machine
 Learning
 is
 a
 scientific
 discipline
 that
 is
 concerned

with
the
design
and
development
of
algorithms
that
allow
computers
to
learn
based
on

data.
Such
as
from
sensor
data
or
databases.”
It
can
be
defined
as
a
set
of
algorithms
to

automatically
 learn
 and
 recognize
 complex
 patterns
 and
 are
 capable
 of
 making

intelligent
decisions
based
on
data.


There
 are
 several
 softwares
 available
 that
 could
 be
 used
 to
 build
 and
 implement

machine‐learning
models.
MATLAB
and
WEKA
are
two
commonly
used
softwares.
The

models
used
in
the
project
were
built
using
WEKA.




























































1

Wikipedia,
Machine
Learning
‐
Wikipedia,
http://en.wikipedia.org/wiki/Machine_learning.


34

Building
machine
learning
models


4.1.1 WEKA


Weka1
is
open
source
data
mining
software
written
in
Java.
It
is
primarily
a
collection
of

various
 machine‐learning
 algorithms
 that
 could
 be
 applied
 directly
 and
 easily
 on

different
types
of
data.
It
has
a
built‐in
interface
to
visualize
the
data
and
can
perform

tasks
 like
 attribute
 selection,
 clustering
 etc.
 It
 is
 available
 under
 General
 Public
 GPU

License
and
can
be
downloaded
from
its
website.


4.1.2 Why
Machine
Learning?


The
 primary
 objective
 of
 the
 project
 is
 to
 automatically
 learn
 the
 user’s
 mouse

movement
behavior
from
the
collected
training
data.
Machine
learning
as
stated
above

is
a
branch
of
science
that
deals
with
algorithms
that
are
capable
of
learning
patterns.

This
exactly
fits
the
primary
requirement.


The
project
further
demands
capability
to
predict
further
content
for
a
new
user
based

on
 his
 mouse
 movements.
 Machine
 learning
 algorithms
 once
 trained
 on
 a
 large
 set
 of

data
 are
 then
 capable
 of
 predicting
 the
 value
 of
 the
 dependent
 variable
 for
 any
 new

case.
Moreover,
machine‐learning
algorithms
can
be
trained
again
and
again
with
new

data.
The
complete
objective
of
the
project
can
easily
be
catered
using
machine‐learning

algorithms.


4.2 Methods
evaluated


In
machine
learning,
in
order
to
classify/predict
for
any
new
case,
a
model
is
first
made

and
 trained
 on
 training
 data.
 There
 can
 be
 a
 number
 of
 different
 types
 of
 models
 that




























































1

The
University
of
Waikato,
Weka
3:
Data
Mining
Software
in
Java,

http://www.cs.waikato.ac.nz/ml/weka/.


35

Building
machine
learning
models


can
be
built
and
further
a
lot
of
different
algorithms
to
built
a
model.
Different
types
of

machine
learning
models
generally
used
are
Decision
Trees,
Neural
Networks,
Genetic

Algorithms,
Fuzzy
Networks
etc.
To
keep
the
scope
of
this
project
in
mind
only
Decision

Trees
and
Neural
Networks
based
models
were
evaluated.
The
data
was
modeled
using

both
 the
 methods
 using
 J48
 Classification
 algorithm
 for
 decision
 trees
 and
 multilayer

perceptrons
 for
 neural
 network.
 The
 two
 models
 were
 later
 evaluated
 on
 the
 training

data.


4.2.1 Decision
Tree


A
 decision
 tree
 can
 be
 defined
 as
 a
 decision
 support
 classifier
 that
 uses
 a
 tree
 like


structure
 of
 conditions
 and
 their
 possible
 consequences.
 Each
 node
 of
 a
 decision
 tree

can
be
a
leaf
node
or
a
decision
node
where‐


• Leaf
node
–
These
node
mentions
the
value
of
the
dependent
(target)
variable


• Decision
node
–
These
nodes
contain
one
condition
each
specifying
some
test
on

a
 single
 attribute‐value.
 The
 outcome
 of
 the
 condition
 is
 further
 divided
 into

branches
with
sub‐trees
or
leaf
nodes.


The
attribute
that
is
to
be
predicted
is
known
as
the
dependent
variable,
since
its
value

depends
 upon,
 or
 is
 decided
 by,
 the
 values
 of
 all
 the
 other
 attributes.
 The
 other

attributes,
which
help
in
predicting
the
value
of
the
dependent
variable,
are
known
as

the
independent
variables
in
the
dataset.


4.2.2 Neural
Network


“An
 Artificial
 Neural
 Network
 is
 an
 interconnected
 assembly
 of
 simple
 processing

elements,
units
or
nodes
(neurons),
whose
functionality
is
inspired
by
the
functioning
of

the
natural
neuron
from
brain.
The
processing
ability
of
the
neural
network
is
stored
in


36

Building
machine
learning
models


the
inter‐unit
connection
strengths,
or
weights,
obtained
by
a
process
of
learning
from
a

set
of
training
patterns.”1


4.3 Implemented
algorithms


There
are
several
algorithms
for
decision
trees
commonly
used
now
days
namely
ID3,

C4.5,
C5.0
etc.
After
careful
evaluation
of
these
three
algorithms,
C4.5
was
chosen
for
the

project.
The
reason
behind
choosing
C4.5
over
ID3
and
C5.0
were:


• C4.5
 handles
 continuous
 variables
 in
 a
 better
 way
 by
 creating
 a
 threshold
 and

then
 splitting
 the
 list
 on
 that
 value.
 Since
 all
 the
 attributes
 in
 the
 required

decision
tree
are
continuous
whereas
the
target
variable
has
five
discrete
values,

C4.5
was
used.


• C4.5
has
a
capability
to
prune
trees.
Pruning
is
a
method
of
going
backwards
in
a

tree
 to
 remove
 any
 branches
 that
 do
 not
 help
 in
 further
 classifications
 and

replace
them
by
leaf
nodes.


• C5.0
is
generally
ranked
above
C4.5
because
of
its
higher
speed
of
building
a
tree

and
low
memory
requirements.
Since
the
scope
of
the
project
demanded
none
of

these
 features,
 there
 was
 no
 significant
 advantage
 with
 C5.0.
 Also
 C5.0
 can
 be

used
 to
 weighting
 attributes,
 which
 wasn’t
 required
 in
 the
 problem
 under

consideration.



Similarly,
 Neural
 networks
 can
 be
 implemented
 in
 one
 of
 the
 various
 available
 ways

namely‐
 Feedforward
 neural
 network,
 Radial
 basis
 function
 network,
 Kohonen
 self‐
organizing
 network,
 Recurrent
 network,
 Stochastic
 neural
 networks,
 Modular
 neural




























































1

Kevin
N
Gurney,
An
introduction
to
neural
networks,
illustrated
(CRC
Press,
1997).


37

Building
machine
learning
models


networks,
Holographic
associative
memory
etc.
The
neural
network
implemented
in
the

project
was
a
feedforward
neural
network
with
non‐linear
activation
function.


4.3.1 Decision
Tree
(C4.5)


WEKA
implements
Decision
tree
C4.5
algorithm
using
‘J48
Decision
tree
classifier’.
The

explanation
of
the
C4.5
algorithm
as
well
as
the
J48
implementation
is
as
follows:


• Whenever
 a
 set
 of
 items
 (training
 set)
 is
 encountered,
 the
 algorithm
 identifies

the
attribute
that
discriminates
the
various
instances
most
clearly.
This
is
done

using
the
standard
equation
of
information
gain


• Among
the
possible
values
of
this
feature,
if
there
is
any
value
for
which
there
is

no
ambiguity,
that
is,
for
which
the
data
instances
falling
within
its
category
have

the
 same
 value
 for
 the
 target
 variable,
 then
 that
 branch
 is
 terminated
 and
 the

obtained
target
value
is
assigned
to
it.


• For
 all
 other
 cases,
 another
 attributes
 are
 looked
 that
 gives
 the
 highest

information
gain.


• This
is
continued
in
the
same
manner
until
either
a
clear
decision
of
the
value
of

the
 target
 variable
 is
 reached
 with
 a
 combination
 of
 conditions
 on
 various

independent
variables/
attributes,
or
we
run
out
of
attributes.


• In
the
event
of
running
out
of
attributes,
or
getting
an
ambiguous
result
from
the

available
information,
the
branch
is
assigned
a
target
value
that
the
majority
of

the
items
under
this
branch
possess.


The
name
of
the
classifier
in
WEKA
that
follows
the
above
mentioned
C4.5
algorithm
is

‘weka.classifiers.trees.J48’


38

Building
machine
learning
models


4.3.2 Neural
Network
(Multilayer
Perceptron)


Multilayer
 perceptrons
 is
 a
 feedforward
 neural
 network
 based
 classifier
 that
 uses

backpropogation
to
classify
instances.
All
the
nodes
in
this
network
are
sigmoids,
which

means
that
the
activation
function
is
a
sigmoid.


In
 a
 multilayer
 perceptron,
 there
 is
 an
 input
 layer
 with
 a
 node
 each
 for
 all
 the

independent
variables,
at
least
one
hidden
layer
and
an
output
layer
with
a
node
each

for
 different
 classes
 of
 the
 target
 variable.
 The
 network
 is
 trained
 by
 initial
 data
 that

determines
the
appropriate
weights
for
connections
between
all
the
nodes
of
adjacent

layers
and
also
determines
the
bias/
threshold
value
of
each
node.


The
name
of
the
classifier
in
WEKA
is
‘weka.classifiers.functions.MultilayerPerceptron’


4.4 Model
building


WEKA
was
opened
in
Explorer
mode
and
the
saved
CSV
file
was
opened
using
the
open

file
 button
 in
 the
 preprocess
 tab
 of
 WEKA.
 From
 the
 attributes
 pane,
 the
 attribute

userID
 was
 deleted.
 This
 is
 because
 this
 field
 is
 irrelevant
 in
 the
 process
 of
 model

building.
 The
 file
 was
 then
 saved
 in
 Attribute‐Relation
 File
 Format
 (ARFF)
 simply
 by

clicking
the
save
button.
The
saved
ARFF
file
was
opened
in
a
text
editor
to
change
the

properties
 of
 predicted
 variable,
 i.e.
 attribute
 ‘bought’
 from
 number
 to
 nominal
 scale.

This
is
essential
step
because
the
‘bought’
variable
has
only
five
discrete
values
each
for

each
product.
This
will
also
enable
the
use
of
J48
tree
classifier,
as
the
nominal
data
for

the
 predicted
 variable
 is
 a
 requirement.
 To
 convert
 ‘bought’
 from
 number
 to
 nominal

mode,
 the
 property
 ‘numeric’
 was
 changed
 to
 ‘{1,2,3,4,5}’,
 where
 1,2,3,4,5
 were
 the

codes
 for
 the
 five
laptop
 products.
 The
 output
 expected
 from
 the
 models
 is
 one
 of
the

five
laptop
codes.
File
was
saved
and
closed.


39

Building
machine
learning
models


4.4.1 Decision
Tree


The
 saved
 ARFF
 was
 then
 re‐opened
 in
 WEKA
 and
 under
 the
 classify
 tab,
 J48
 tree

classifier
 was
 chosen.
 There
 are
 different
 parameters
 of
 J48
 tree
 classifier
 like
 binary

splits,
 number
 of
 folds,
 pruning
 etc.
 Using
 trial
 and
 error
 method,
 various
 parameters

were
 changed
 and
 each
 model
 was
 tested
 for
 accuracy
 on
 the
 training
 data.
 Models

were
 tested
 using
 two
 methodologies
 namely
 testing
 directly
 on
 training
 data
 and

testing
using
cross
validation.
The
set
of
parameters
giving
the
maximum
percentage
of

correctly
classified
instances
were
chosen.
The
final
model
giving
maximum
accuracy
on

the
training
dataset
was
also
saved
for
later
use.


4.4.1.1 Details
of
the
chosen
decision
tree


The
final
parameters
selected
that
gave
the
best
output
on
training
data
are‐


• binarySplits:
By
WEKA
definition
of
this
parameter,
it
is
considered
for
nominal

variables
 only.
 Since
 the
 dataset
 under
 consideration
 had
 no
 nominal

independent
variable,
the
value
of
this
attribute
had
no
impact
on
the
built
tree.


• confidenceFactor:
This
attribute
defines
the
confidence
factor
used
for
pruning.

It
was
found
that
a
confidence
factor
value
of
0.75,
a
good
accuracy
decision
tree

was
obtained
when
C4.5
pruning
was
used.


• debug:
This
parameter
is
only
used
to
output
some
additional
information
at
the

console.
Its
value
of
either
true
or
false
didn’t
impact
the
final
model.


• minNumObj:
 This
 determines
 the
 minimum
 number
 of
 instances
 at
 every
 leaf

node.
This
attribute
was
set
to
a
value
of
‘2’.



• numFolds:
 This
 parameter
 determines
 the
 amount
 of
 data
 used
 for
 reduced‐
error
pruning.
In
the
decision
tree
built,
numFolds
was
kept
at
‘11’.
This
would

mean
that
one
fold
was
used
for
pruning,
and
rest
for
growing
the
tree.


40

Building
machine
learning
models


• reducedErrorPruning:
 This
 was
 set
 to
 ‘False’
 as
 it
 signifies
 if
 reduced‐error

pruning
should
be
used
instead
of
C.4.5
pruning.


• saveInstanceData:
This
attribute
is
just
to
save
the
instance
for
visualization
in

future


• seed:
 The
 seed
 determines
 the
 number
 of
 seeds
 to
 be
 used
 while
 randomizing

the
data
when
reduced‐error
pruning
is
to
be
used.
Since
reduced‐error‐pruning

was
not
used,
seed
parameter
had
no
relevance.


• subtreeRaising:
 Subtree
 raising
 while
 pruning
 is
 always
 advisable
 when
 used

with
 a
 high
 confidence
 factor.
 Since
 a
 confidence
 factor
 of
 0.75
 was
 used,
 this

parameter
was
set
as
‘true’.


• unpruned:
Since
we
wanted
pruning
to
happen,
the
‘unpruned’
parameter
was

set
to
‘false’.


• useLaplace:
This
parameter
determines
if
counts
at
leaves
are
smoothed
based

on
Laplace.
The
parameter
had
no
influence
on
the
model
output.


All
the
parameters
used
in
the
final
decision
tree
can
be
summarized
as‐


41

Building
machine
learning
models



Figure
8:
Parameters
used
for
building
the
Decision
Tree
model


The
output
from
WEKA
is
as
follow:


===
Run
information
===


Scheme:







weka.classifiers.trees.J48
‐L
‐C
0.75
‐M
2
‐A

Relation:





 MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1

Instances:



 200

Attributes:


133















[list
of
attributes
omitted]


42

Building
machine
learning
models


Test
mode:


evaluate
on
training
data


===
Classifier
model
(full
training
set)
===


J48
pruned
tree

‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐


b5
<=
0.04509

|


k4
<=
0.013828

|


|


v1
<=
0.000362

|


|


|


r0
<=
0.000626

|


|


|


|


d5
<=
0.003481

|


|


|


|


|


d5
<=
0.001586

|


|


|


|


|


|


g4
<=
0.033267

|


|


|


|


|


|


|


s3
<=
0.004874

|


|


|


|


|


|


|


|


u1
<=
0.002108

|


|


|


|


|


|


|


|


|


f1
<=
0.039667

|


|


|


|


|


|


|


|


|


|


f4
<=
0.028894

|


|


|


|


|


|


|


|


|


|


|


i4
<=
0.004699

|


|


|


|


|


|


|


|


|


|


|


|


d2
<=
0.001173

|


|


|


|


|


|


|


|


|


|


|


|


|


e5
<=
0.001377

|


|


|


|


|


|


|


|


|


|


|


|


|


|


e1
<=
0.029566

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


r3
<=
0.000861

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


c1
<=
0.043665

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


a3
<=
0.206815

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b1
<=
0.007319

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


f3
<=
0.001471

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b4
<=
0.00214:
2
(11.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b4
>
0.00214

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


a4
<=
0.004126:
3
(3.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


a4
>
0.004126:
2
(2.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


f3
>
0.001471:
3
(3.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b1
>
0.007319

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b3
<=
0.123969:
2
(12.0/2.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


b3
>
0.123969:
1
(2.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


a3
>
0.206815:
1
(2.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


c1
>
0.043665:
1
(3.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


|


r3
>
0.000861:
3
(3.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


|


e1
>
0.029566:
1
(3.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


e5
>
0.001377:
3
(2.0)

|


|


|


|


|


|


|


|


|


|


|


|


d2
>
0.001173

|


|


|


|


|


|


|


|


|


|


|


|


|


s4
<=
0.002873:
2
(32.0/1.0)

|


|


|


|


|


|


|


|


|


|


|


|


|


s4
>
0.002873:
4
(2.0)

|


|


|


|


|


|


|


|


|


|


|


i4
>
0.004699:
1
(3.0/1.0)

|


|


|


|


|


|


|


|


|


|


f4
>
0.028894:
4
(2.0)

|


|


|


|


|


|


|


|


|


f1
>
0.039667:
3
(3.0)

|


|


|


|


|


|


|


|


u1
>
0.002108:
3
(6.0/1.0)


43

Building
machine
learning
models


|


|


|


|


|


|


|


s3
>
0.004874

|


|


|


|


|


|


|


|


q1
<=
0.004708

|


|


|


|


|


|


|


|


|


r4
<=
0.007391:
3
(16.0)

|


|


|


|


|


|


|


|


|


r4
>
0.007391:
2
(2.0)

|


|


|


|


|


|


|


|


q1
>
0.004708:
2
(2.0/1.0)

|


|


|


|


|


|


g4
>
0.033267

|


|


|


|


|


|


|


g5
<=
0.004141

|


|


|


|


|


|


|


|


k4
<=
0.001354:
4
(8.0)

|


|


|


|


|


|


|


|


k4
>
0.001354:
3
(3.0/1.0)

|


|


|


|


|


|


|


g5
>
0.004141:
2
(3.0/1.0)

|


|


|


|


|


d5
>
0.001586:
4
(4.0)

|


|


|


|


d5
>
0.003481

|


|


|


|


|


g5
<=
0.004141

|


|


|


|


|


|


b5
<=
0.002996

|


|


|


|


|


|


|


g4
<=
0.003922:
2
(4.0)

|


|


|


|


|


|


|


g4
>
0.003922:
1
(2.0)

|


|


|


|


|


|


b5
>
0.002996:
3
(2.0)

|


|


|


|


|


g5
>
0.004141:
5
(3.0)

|


|


|


r0
>
0.000626:
4
(3.0/1.0)

|


|


v1
>
0.000362

|


|


|


s4
<=
0.005561

|


|


|


|


t4
<=
0.002371

|


|


|


|


|


e0
<=
0.001979

|


|


|


|


|


|


h2
<=
0.005305:
1
(18.0/1.0)

|


|


|


|


|


|


h2
>
0.005305:
2
(2.0)

|


|


|


|


|


e0
>
0.001979:
2
(2.0)

|


|


|


|


t4
>
0.002371:
2
(2.0/1.0)

|


|


|


s4
>
0.005561:
2
(2.0/1.0)

|


k4
>
0.013828

|


|


f5
<=
0.001805:
4
(9.0/1.0)

|


|


f5
>
0.001805:
2
(2.0/1.0)

b5
>
0.04509

|


t3
<=
0.000515

|


|


d4
<=
0.008991

|


|


|


e2
<=
0.011901

|


|


|


|


a1
<=
0.001341

|


|


|


|


|


g2
<=
0.001762:
4
(3.0/1.0)

|


|


|


|


|


g2
>
0.001762:
5
(2.0)

|


|


|


|


a1
>
0.001341:
5
(4.0)

|


|


|


e2
>
0.011901:
4
(3.0)

|


|


d4
>
0.008991:
2
(2.0/1.0)

|


t3
>
0.000515:
3
(3.0)


Number
of
Leaves

:

 42


Size
of
the
tree
:

 83


44

Building
machine
learning
models


Time
taken
to
build
model:
0.75
seconds


4.4.1.2 Testing
the
decision
tree


The
model
was
tested
using
two
different
methodologies
namely,
testing
directly
on
the

training
dataset
and
testing
using
cross‐validation
with
10
folds.


Testing
on
the
training
data
gave
a
result
of
89.5%
accuracy
whereas
testing
using
cross

validation
gave
an
accuracy
of
66%.
The
complete
result
along
with
the
discussion
is
as

follows:


4.4.1.2.1 Testing
on
Training
Data

===
Evaluation
on
training
set
===

===
Summary
===


Correctly
Classified
Instances







 179












 89.5



%

Incorrectly
Classified
Instances





 21















10.5



%

Kappa
statistic
































 0.8586

Mean
absolute
error






















 0.1650


Root
mean
squared
error












 0.2382

Relative
absolute
error








































 54.9013
%

Root
relative
squared
error































 61.5103
%

Total
Number
of
Instances













 200







===
Detailed
Accuracy
By
Class
===



 TP
Rate
 FP
Rate
 Precision
 Recall
 F‐Measure
ROC
Area
 Class


 


0.848
 

0.030
 




0.848
 
0.848
 





0.848
 



0.953
 


1


 


0.959



 

0.079










0.875






0.959










0.915

 



0.969




 


2


 


0.932


 

0.019










0.932




 
0.932










0.932


 



0.994




 


3


 


0.795



 

0.019










0.912






0.795










0.849



 



0.987




 


4


 


0.818






0.000










1.000






0.818










0.900



 



0.999




 


5

Weighted
Avg.

0.895


 

0.042









0.897






0.895










0.894



 



0.977


===
Confusion
Matrix
===




 a
 b
 c
 d
 e
 <‐‐
classified
as



 28
 4
 0
 1
 0
 |

a
=
1




1
 70
 1
 1
 0
 |

b
=
2


45

Building
machine
learning
models




 1
 2
 41
 0
 0
 |

c
=
3




3
 3


 2

 31


0

 |

d
=
4




0


 1
 0



 1


 9

 |

e
=
5


4.4.1.2.2 Testing
by
Cross‐Validation
(folds
10)

===
Stratified
cross‐validation
===

===
Summary
===


Correctly
Classified
Instances





 132








 66





%

Incorrectly
Classified
Instances





 68















34





%

Kappa
statistic


























 0.5303

Mean
absolute
error






















 0.2133

Root
mean
squared
error















 0.308


Relative
absolute
error

















 
 70.9833
%

Root
relative
squared
error














 79.5133
%

Total
Number
of
Instances












 200







===
Detailed
Accuracy
By
Class
===




 TP
Rate
 FP
Rate

 Precision

 Recall


 F‐Measure


ROC
Area

Class



















 0.545





 0.072






 0.6
00









0.545







0.571






 0.865




 1



















 0.890





 0.315






 0.619










0.890







0.730






 0.871




 2



















 0.500





 0.000






 1.000










0.500







0.667






 0.874




 3



















 0.538





 0.081






 0.618










0.538







0.575






 0.833




 4



















 0.545





 0.016






 0.667










0.545







0.600






 0.983




 5

Weighted
Avg.

0.66






 0.143






 0.702










0.66









0.653






 0.869



===
Confusion
Matrix
===





 a


 b


 c


 d


 e



 <‐‐
classified
as



 18

 14


 0


 0


 1

 |

a
=
1




 7

 65


 0


 1


 0

 |

b
=
2




 1

 13

 22


 7


 1

 |

c
=
3




 4

 13


 0

 21


 1

 |

d
=
4




 0


 0


 0


 5


 6

 |

e
=
5


4.4.1.2.3 Discussion


Testing
directly
on
the
training
data
classified
179
cases
correctly
out
of
200,
which
is

an
 accuracy
 of
 89.5%.
 Accuracy
 while
 testing
 on
 training
 data
 is
 always
 desired
 very

high
because
it
signifies
the
extent
to
which
the
model
has
learnt
the
training
data.
Since


46

Building
machine
learning
models


there
 were
 5
 classes
 in
 the
 target
 variable
 (5
 products),
 any
 accuracy
 in
 model
 more

than
20%
(equal
probability
of
each
class
is
1/5
=
0.2
=
20%)
has
to
be
considered
good.

Accuracy
 of
 89.5%
 is
 well
 within
 the
 error
 range
 and
 signifies
 that
 the
 built
 decision

tree
has
learnt
the
training
data
quite
accurately.


Testing
 using
 cross‐validation
 is
 a
 process
 of
 dividing
 the
 data
 into
 different
 sub
 sets

and
then
carrying
out
the
analysis
on
one
subset
and
testing
it
on
other.
Doing
this
with

10
folds
is
the
process
of
carrying
out
cross‐validation
10
times
and
averaging
out
the

accuracy
 score.
 Again,
 as
 stated
 above
 any
 accuracy
 of
 more
 than
 20%
 is
 good.
 The

achieved
result
of
an
average
of
132
correct
classifications
out
of
200
with
an
accuracy

of
66%
is
well
within
the
desired
range.


Ideally,
the
model
should
have
been
trained
on
more
data.
Due
to
the
limitation
of
time,

and
 no
 compensation
 available
 to
 volunteers,
 only
 200
 tuples
 of
 useful
 data
 could
 be

collected.
It
is
expected
that
with
the
bigger
training
dataset,
the
accuracy
of
the
models

would
increase.


4.4.2 Neural
Network


The
 saved
 ARFF
 file
 was
 re‐opened
 in
 WEKA
 and
 under
 the
 classify
 tab,

MultilaterPerceptron
 function
 was
 chosen.
 There
 are
 different
 parameters
 associated

with
 this
 neural
 network
 function
 and
 as
 done
 with
 decision
 trees,
 trial
 and
 error

method
was
used
to
find
the
best
set.
The
best
set
of
parameters
was
the
one
that
gave

maximum
 accuracy
 of
 classification
 on
 the
 training
 dataset.
 Each
 obtained
 model
 was

tested
 using
 two
 methodologies
 namely
 testing
 directly
 on
 training
 data
 and
 testing

using
 cross
 validation.
 After
 multiple
 iterations
 using
 trail
 and
 error
 method,
 a
 model

giving
a
good
accuracy
of
classification
was
obtained.
The
model
was
also
saved
for
later

use.


47

Building
machine
learning
models


4.4.2.1 Details
of
the
chosen
neural
network


The
final
parameters
selected
that
gave
the
best
output
on
training
data
are‐


• GUI:
The
GUI
parameter
brings
up
an
interface.
It
doesn’t
really
impact
the
final

model,
 unless
 some
 changes
 in
 the
 learning
 rate
 and
 momentum
 are
 desired

while
training.
It
was
set
as
‘False’
in
the
project.


• autoBuild:
An
ANN
was
built
automatically
and
hence
this
parameter
was
set
to

‘true’


• debug:
This
is
to
view
additional
information
on
the
console.


• decay:
It
was
observed
that
the
‘true’
decay
value
gave
slightly
less
accuracy
and

hence
in
the
final
model,
‘decay’
was
set
to
‘false’


• hiddenLayers:
Since
an
automatic
neural
network
was
desired,
the
WEKA
was

left
to
decide
the
number
of
hidden
layers
and
hence
the
final
set
of
parameters

had
 a
 value
 of
 ‘a’
 in
 the
 field
 of
 hiddenLayers.
 ‘a’
 when
 used
 as
 a
 value
 for

hiddenLayers
mean
‘automatic’.


• learningRate:
 The
 amount
 at
 which
 the
 weights
 should
 be
 updated
 was
 set
 to

0.1


• momentum:
Momentum
of
0.2
was
applied
to
the
weights
during
updating.


• nominalToBinaryFilter:
There
were
no
nominal
variables
in
the
data
and
hence

this
parameter
had
no
impact
on
the
model


• normalizeNumericClass:
Since
the
class
is
not
numeric
but
already
normalized,

there
was
no
use
of
using
this
feature
and
hence
it
was
set
to
‘false’


• reset:
When
the
reset
was
set
to
false,
no
error
message
was
received.
Moreover

the
set
learning
rate
of
0.1
is
already
quite
low
and
hence
this
feature
was
set
as

‘false’


• seed:
Seed
value
of
0
was
used.
As
in
case
of
decision
trees,
this
value
is
used
to

initialize
 the
 random
 number
 generator.
 Random
 numbers
 are
 used
 for
 setting


48

Building
machine
learning
models


the
 initial
 weights
 of
 the
 connections
 between
 nodes,
 and
 also
 for
 shuffling
 the

training
data.


• trainingTime:
The
number
of
epochs
to
train
through
was
set
to
5000.


• validationSetSize:
The
percentage
size
of
the
validation
set
was
made
0
which

signifies
that
no
validation
set
will
be
used
and
instead
the
network
will
train
for

the
specified
number
of
epochs,
i.e.
for
5000
epochs


• validationThreshold:
This
parameter
was
set
to
20
which
dictates
that
20
times

in
a
row
the
validation
set
error
can
get
worse
before
training
is
terminated.


The
parameters
used
in
the
final
neural
network
model
can
be
summarized
as:


49

Building
machine
learning
models



Figure
9:
Parameters
used
for
building
the
Neural
Network
model


50

Building
machine
learning
models


It
 was
 found
 impossible
 to
 include
 the
 complete
 model
 output
 in
 this
 document,
 and

hence
the
summary
of
the
model
obtained
is
as
follows‐


===
Run
information
===


Scheme:






weka.classifiers.functions.MultilayerPerceptron
‐L
0.1
‐M
0.2

‐N
5000
‐V
0
‐S
0
‐E
20
‐H
a
‐R

Relation:

MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1

Instances:



200

Attributes:


133















[list
of
attributes
omitted]

Test
mode:



10‐fold
cross‐validation


===
Classifier
model
(full
training
set)
===



The
 chosen
 neural
 network
 had
 1
 hidden
 layer
 with
 68
 nodes.
 There
 were
 132
 input

nodes
 accepting
 132
 normalized
 time
 values
 corresponding
 to
 each
 section
 of
 the

webpage.
The
model
had
5
output
nodes
each
for
one
of
the
five
laptops.


There
were
a
total
of
73
threshold
values
for
73
nodes
(68
hidden
layer
nodes+5
output

nodes)
and
there
were
9316
weight
values
(132*68
+
68*5)


4.4.2.2 Testing
the
neural
network
model


The
 neural
 network
 model
 was
 also
 tested
 similarly
 as
 decision
 trees
 using
 two

different
 methodologies
 namely,
 tested
 directly
 on
 training
 set
 and
 using
 cross‐
validation
with
10
folds.


It
was
found
that
testing
on
training
dataset
gave
an
exceptionally
good
result
of
95.0%

whereas
 testing
 using
 cross
 validation
 with
 10
 folds
 gave
 a
 classification
 accuracy
 of

41.0%


51

Building
machine
learning
models


4.4.2.2.1 Testing
on
Training
Data

===
Evaluation
on
training
set
===

===
Summary
===


Correctly
Classified
Instances








 190












 95





%

Incorrectly
Classified
Instances




 10












 5





%

Kappa
statistic


























 0.9335

Mean
absolute
error






















 0.0219

Root
mean
squared
error
















 0.1313

Relative
absolute
error


















 
 7.2772
%

Root
relative
squared
error














 33.8899
%

Total
Number
of
Instances














 200







===
Detailed
Accuracy
By
Class
===

















 TP
Rate

 FP
Rate

 Precision

 Recall

F‐Measure
ROC
Area
 Class



















 0.939





 0.012






 0.939





 
 0.939





0.939






 0.966




 1



















 0.918





 0.024






 0.957





 


0.918




0.937






 0.936




 2



















 1









 0.026






 0.917





 


1









 


0.957






 0.993




 3



















 0.949





 0.006






 0.974





 


0.949




0.961






 0.957




 4



















 1









 0










 1









 


1








 


1










 1








 5

Weighted
Avg.


 0.95






 0.017






 0.951





 


0.95







0.95







 0.961



===
Confusion
Matrix
===





a


 b


 c


 d


 e



 <‐‐
classified
as



 31


2


 0


 0


 0

 |

a
=
1




2

 67


3


 1


 0

 |

b
=
2




0


 0

 44


0


 0

 |

c
=
3




0


 1


 1

 37


0

 |

d
=
4



 0


 0


 0


 0

 11

 |

e
=
5


4.4.2.2.2 Testing
by
Cross‐Validation
(folds
10)

===
Stratified
cross‐validation
===

===
Summary
===


Correctly
Classified
Instances










 82













 41





%

Incorrectly
Classified
Instances







 118













59





%

Kappa
statistic


























 0.2165

Mean
absolute
error






















 0.236


Root
mean
squared
error


















 0.4551

Relative
absolute
error

















 
 78.4778
%


52

Building
machine
learning
models


Root
relative
squared
error












 
 117.4608
%

Total
Number
of
Instances








 200







===
Detailed
Accuracy
By
Class
===


















 TP
Rate
FP
Rate

Precision
 Recall
 F‐Measure
ROC
Area
 Class



















 0.333




0.12






 0.355





 0.333




0.344






 0.614




 1



















 0.575




0.22






 0.6







 0.575




0.587






 0.706




 2



















 0.295




0.237




0.26






 0.295




0.277






 0.626




 3



















 0.282





0.155




0.306





 0.282




0.293






 0.652




 4



















 0.455



 0.042




0.385





 0.455




0.417






 0.856




 5

Weighted
Avg.

0.41




 0.185




0.415





 0.41






 0.412






 0.671



===
Confusion
Matrix
===





 a


 b


 c


 d


 e



 <‐‐
classified
as



 11


8


 5


 8


 1

 |

a
=
1




 9

 42

 19


2


 1

 |

b
=
2




 6

 12

 13

 11


2

 |

c
=
3




 5


 7

 12

 11


4

 |

d
=
4




 0


 1


 1


 4


 5

 |

e
=
5


4.4.2.2.3 Discussion


Testing
 on
 the
 training
 data
 classified
 190
 cases
 correctly
 out
 of
 200,
 which
 is
 an

accuracy
of
95.0%.
Such
a
high
value
of
classification
accuracy
clearly
signifies
that
the

built
neural
network
model
has
learnt
the
training
data
with
high
accuracy.



Testing
 using
 cross‐validation
 is
 a
 process
 of
 dividing
 the
 data
 into
 different
 sub
 sets

and
then
carrying
out
the
analysis
on
one
subset
and
testing
it
on
other.
Doing
this
with

10
folds
is
the
process
of
carrying
out
cross‐validation
10
times
and
averaging
out
the

accuracy
 score.
 The
 achieved
 result
 of
 an
 average
 of
 82
 correct
 classifications
 out
 of

200,
i.e.
an
accuracy
of
41.0%
is
comparatively
low
but
is
well
within
the
desired
range.


Ideally,
the
model
should
have
been
trained
on
more
data.
Due
to
the
limitation
of
time,

and
 no
 compensation
 available
 to
 volunteers,
 only
 200
 tuples
 of
 useful
 data
 could
 be


53

Building
machine
learning
models


collected.
Since
there
is
a
hidden
later
with
68
nodes
and
a
total
of
9316
weight
values

are
involved,
a
much
bigger
training
dataset
was
required.
It
is
expected
that
with
the

bigger
training
dataset,
the
accuracy
of
testing
would
increase.


4.4.3 Decision
Tree
Vs
Neural
Networks


Based
on
the
initial
200
data
cases,
one
model
each
of
decision
tree
and
neural
network

was
trained.
Upon
testing
on
the
training
 dataset,
decision
tree
showed
slightly
better

accuracy
 as
 compared
 to
 the
 neural
 network
 model.
 The
 other
 factors
 worth

considering
about
the
two
models
are:


• Building
 a
 neural
 network
 model
 is
 easy
 but
 time
 consuming
 in
 WEKA
 but

moreover,
 it
 slows
 down
 the
 performance
 of
 the
 website
 after
 its

implementation.
The
objective
of
the
project
is
to
determine
the
product
for
the

users
 in
 real‐time
 while
 they
 are
 still
 browsing
 and
 it
 will
 require
 very
 fast

computation.
Decision
trees
are
a
set
of
conditions,
which
can
be
evaluated
much

efficiently
 than
 the
 calculations
 and
 temporary
 variables
 required
 in
 neural

networks.
 However,
 if
 a
 parallel
 web
 server
 is
 used
 which
 is
 capable
 of

performing
 calculations
 faster,
 a
 neural
 network
 could
 also
 be
 considered
 for

implementation.


• With
 time
 the
 website
 would
 keep
 accumulating
 more
 and
 more
 mouse

movement
 data
 and
 the
 model
 should
 be
 improved
 /
 trained
 on
 new
 data

whenever
 required.
 This
 would
 require
 re‐implementing
 the
 new
 model
 every

time
the
updating
is
desired.
As
stated
above
this
would
be
more
difficult,
time

consuming
and
error
prone
in
neural
networks
as
compared
to
decision
trees.


• Decision
 trees
 are
 more
 transparent
 as
 compared
 to
 neural
 network
 models.

This
mean
that
for
a
person
visually
seeing
the
two
models,
a
decision
tree
could

give
 him
 some
 information
 where
 as
 a
 neural
 networks
 can
 visually
 tell
 him


54

Building
machine
learning
models


nothing.
This
was
however
not
one
of
the
points
considered
before
taking
a
final

call
on
the
model
to
be
chosen.


Despite
all
these
points,
final
models
of
both
neural
networks
and
decision
trees
were

implemented
 in
 two
 similar
 copies
 of
 the
 same
 website.
 Further
 tests
 of
 accuracy
 and

performance
were
conducted
later
in
order
to
conclude
a
better
model
for
the
problem

in
hand.


The
next
chapter
will
explain
the
steps
required
to
put
these
models
into
the
website
so

that
they
can
be
used
in
real
time
for
a
user
for
predicting
relevant
content
for
him.


55

Embedding
the
machine
learning
models
in
the
website


5
EMBEDDING
THE
MACHINE
LEARNING
MODELS
IN

THE
WEBSITE


This
chapter
explains
the
complete
methodology
adopted
to
apply
the
built

machine
 learning
 models
 in
 the
 website.
 It
 also
 explains
 the
 interaction

between
the
model
and
the
website
and
how
a
user’s
mouse
movement
data

was
used
to
predict
the
best
content
for
him
in
real
time.


5.1 What
and
Why?


As
 explained
 in
 the
 previous
 chapter,
 a
 decision
 tree
 and
 a
 neural
 network
 model

capable
 of
 predicting
 the
 product
 the
 user
 is
 most
 likely
 to
 buy
 were
 modeled.
 These

models
 needs
 to
 be
 implemented
 in
 the
 website
 so
 that
 they
 can
 take
 further
 mouse

movement
 behavior
 of
 new
 users
 as
 input
 and
 can
 predict
 for
 him
 the
 appropriate

product.


5.2 Specifications


The
 initial
 website
 built
 as
 explained
 in
 Chapter
 3
 for
 collecting
 the
 training
 data
 was

modified
 to
 implement
 the
 decision
 tree
 and
 neural
 network
 models.
 Some
 additional

characteristics
required
from
the
website
were:


56

Embedding
the
machine
learning
models
in
the
website


• The
 model
 should
 reside
 on
 the
 server.
 This
 is
 essential
 from
 security
 point
 of

view
else
any
user
would
have
access
to
the
model
which
by
reverse
engineering

can
give
information
about
the
products
bought
by
other
users.


• Real
time
model
evaluation
on
the
real
time
mouse
movement
data.


• Real
time
transfer
of
model
output
from
the
web
server
to
the
frontend
website

so
that
the
website
can
use
the
model
prediction.


• Determining
 the
 product
 the
 user
 is
 most
 likely
 to
 buy
 using
 the
 embedded

models
periodically
after
every
say
10
seconds.
This
would
involve
including
the

latest
mouse
movement
data
and
transferring
the
output
again
to
the
front
end

HTML
 website
 so
 that
 if
 any
 change
 is
 predicted
 in
 the
 final
 product,
 it
 can
 be

reflected
on
the
front
end.


• All
the
tracking
and
model
evaluation
was
carried
out
in
a
hidden
layer
and
the

user
 was
 not
 asked
 for
 any
 explicit
 information
 or
 was
 not
 compromised
 on

speed
and
performance.


• Not
 to
 mention,
 the
 website
 should
 continue
 to
 track
 mouse
 movement
 as
 was

explained
in
earlier
chapters.


5.3 Implementation


The
 website
 built
 initially
 to
 collect
 training
 data
 had
 mouse
 movement
 tracking

capability.
A
few
new
functions
and
scripts
were
added
to
enable
the
model
evaluation

on
the
captured
mouse
movements.


A
new
JavaScript
function
named
‘predict()’
was
programmed
in
the
JavaScript
file.
The

‘predict()’
 was
 a
 recursive
 function
 that
 was
 made
 to
 call
 itself
 every
 10
 seconds.
 This

was
 because,
 it
 was
 expected
 that
 the
 prediction
 is
 made
 every
 10
 seconds
 using
 the

machine
 learning
 model.
 Every
 subsequent
 10
 seconds,
 the
 database
 would
 contain


57

Embedding
the
machine
learning
models
in
the
website


more
 mouse
 movement
 data
 that
 could
 be
 used
 by
 the
 machine
 learning
 models
 to

ideally
predict
more
accurately.


The
 ‘predict()’
 function
 takes
 no
 arguments
 and
 calls
 a
 PHP
 script
 named
 ‘predict.php’

passing
it
the
userID
of
the
current
user
via
GET
method.
‘predict.php’
file
resides
on
the

server
and
the
calling
from
JavaScript
was
programmed
using
standard
AJAX
protocols.

The
code
snippet
of
the
JavaScript
‘predict()’
function
is:


function autoPredict() {
setTimeout("predict()",10000);
}

function predict() {
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}


The
‘predict.php’
file
connects
to
the
MySQL
databases
and
selects
the
mouse
movement

data
for
the
current
user
using
a
simple
SQL
‘SELECT
*…’
statement.
Mouse
movement

data
 was
 saved
 into
 132
 temporary
 variables
 that
 correspond
 to
 each
 section
 of
 the

webpage.
 The
 total
 time
 spent
 by
 the
 user
 till
 now,
 was
 also
 calculated
 while
 saving

these
temporary
variables.
The
absolute
time
value
spent
in
each
section
as
saved
in
the

132
temporary
variables
was
then
replaced
by
the
normalized
time
spent
in
that
section

by
divided
the
absolute
time
value
by
the
total
time
spent
by
that
user.


Hence,
 after
 this
 step
 the
 132
 temporary
 variables
 in
 ‘predict.php’
 file
 will
 contain
 the

normalized
time
/
relative
time
spent
by
the
user
in
corresponding
132
sections
of
the

webpage.
 These
 132
 temporary
 variables
 are
 the
 132
 independent
 input
 variables
 for

the
model.


58

Embedding
the
machine
learning
models
in
the
website


The
 two
 models
 (decision
 tree
 and
 neural
 network)
 were
 than
 coded
 and
 were
 given

access
to
these
132
temporary
variables
so
that
they
can
evaluate
the
normalized
time

and
can
make
their
respective
predictions.
It
should
however
be
noted
that
for
testing,

only
 one
 of
 the
 models
 was
 used.
 Both
 the
 models
 were
 tested
 separately
 later
 for

comparison
purpose.
The
implementation
of
the
two
models
is
as
follows:


5.3.1 Implementing
the
Decision
Tree
model


A
function
named
‘decisionTree()’
was
coded
in
the
PHP
file
‘predict.php’.
This
function

had
access
to
all
the
132
input
variables
as
stated
above.



The
 model
 made
 in
 WEKA
 had
 a
 set
 of
 83
 if‐else
 statements
 (83
 being
 the
 size
 of
 the

tree).
 All
 these
 83
 if‐else
 statements
 from
 the
 WEKA
 model
 along
 with
 the
 prediction

value
 were
 coded
 in
 PHP.
 The
 if‐else
 statements
 were
 doing
 comparisons
 on
 the
 132

independent
variables
so
as
to
imitate
the
decision
tree.
The
output
of
this
set
of
if‐else

statement
was
one
value
that
is
also
the
output
of
the
decision
tree
model.
This
output
is

the
product
the
user
is
most
likely
to
buy
according
to
the
implemented
decision
tree.

This
value
was
returned
to
the
main
program
by
the
function.
The
complete
code
of
the

function
‘decisionTree()’
and
the
‘predict.php’
file
is
available
in
appendix


5.3.2 Implementing
the
Neural
Network
model


Another
 function
 named
 ‘neuralNetwork()’
 was
 implemented.
 This
 function
 also
 had

access
to
the
132
independent
input
variables
as
stated
above.


The
neural
network
built
in
WEKA
had
one
hidden
layer
with
68
nodes.
To
implement

this
 hidden
 layer,
 68
 new
 temporary
 variables
 named
 ‘Node5’,
 ‘Node6’,
 ‘Node7’,
 …..,

‘Node72’
were
created
with
value
computed
based
on
standard
neural
network
formula.


59

Embedding
the
machine
learning
models
in
the
website


All
 the
 coefficient
 values
 as
 well
 as
 the
 threshold
 limits
 were
 used
 as
 given
 by
 WEKA

while
model
building.


To
implement
the
output
layer
the
same
formula
was
used
based
on
these
temporary
68

variables
(68
hidden
layer
nodes,
i.e.
Node5,
Node6….
Node72).
The
output
layer
of
the

neural
 network
 model
 had
 5
 nodes
 corresponded
 to
 the
 five
 laptop
 products.
 The

product
corresponding
to
the
node
with
highest
value
was
predicted
as
the
laptop
the

current
user
is
most
likely
to
buy.


5.4 Using
model
outputs


As
stated
above,
only
one
of
the
two
models
was
used
at
a
time
for
a
given
user.
After

receiving
the
output
from
the
used
models
(decision
tree
or
neural
network),
the
output

was
sent
back
to
the
frontend
JavaScript
function
named
‘predictResponse()’
via
AJAX.
It

should
be
noted
that
model
output
was
the
code
of
one
of
the
5
laptops
that
the
current

user
is
most
likely
to
buy.


The
 ‘predictResponse()’
 JavaScript
 function
 after
 receiving
 the
 prediction
 can
 now
 be

programmed
 as
 per
 the
 needs.
 In
 the
 current
 project,
 the
 author
 decided
 to
 simply


highlight
the
border
of
the
predicted
laptop
in
red
color.
The
predicted
laptop
is
the
one

the
 user
 is
 most
 likely
 to
 buy
 that
 has
 been
 predicted
 by
 one
 of
 the
 machine‐learning

model
 based
 on
 the
 user’s
 mouse
 movement
 behavior.
 The
 function
 definition
 of
 the

‘predictResponse()’
function
is
as
follows:


60

Embedding
the
machine
learning
models
in
the
website


function predictResponse() {
if (http.readyState == 4) {
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg2").className="";
document.getElementById("cg3").className="";
document.getElementById("cg4").className="";
document.getElementById("cg5").className="";
document.getElementById("cg6").className="";
document.getElementById("cg"+colName).className="oce-predict";
alert("Product : "+predictProduct);
setTimeout("predict()",10000);
}
}


The
 function
 above
 gets
 the
 response
 from
 the
 PHP
 script
 via
 standard
 AJAX

http.responseText
function.
The
output
was
then
used
to
simple
change
the
style
of

the
 column
 containing
 that
 product.
 The
 style
 of
 all
 other
 columns
 is
 first
 reset

before
 changing
 the
 predicted
 laptop
 column
 style.
 The
 JavaScript
 ‘predict()’

function
has
been
called
every
10,000
milliseconds.
In
the
current
demonstration,
a

popup
was
also
shown
to
the
user
with
the
code
of
the
laptop
he
is
most
likely
to

buy.
This
was
done
using
the
alert
statement.


There
can
be
several
other
usages
of
the
prediction.
It
can
be
imagined
that
a
customer

would
be
served
more
easily
and
appropriately
if
the
shopkeeper
knows
the
product
the

customer
is
most
likely
to
buy.
The
customer
could
be
given
other
options
similar
to
the

product
 that
 was
 predicted.
 If
 not
 used
 by
 the
 content
 generator
 of
 the
 website,
 this

prediction
can
always
be
used
by
the
visitors
in
finding
information
he
has
been
looking

for.
 The
 screenshot
 of
 the
 prediction
 made
 by
 the
 Decision
 Tree
 model
 is
 shown
 in

Figure
10


61

Embedding
the
machine
learning
models
in
the
website



Figure
10:
Screenshot
of
the
prediction
done
by
the
model


5.5 What
next


Once
the
website
was
programmed
and
the
machine
learning
models
were
embedded,
it

was
 again
 made
 public
 and
 the
 users
 were
 invited
 to
 visit
 it
 again.
 All
 the
 mouse

movement
data
was
saved
in
the
databases
as
designed
earlier
along
with
the
product

the
 user
 buys.
 The
 users
 were
 also
 shown
 the
 real‐time
 prediction
 as
 per
 the
 model

after
 every
 10
 seconds.
 The
 prediction
 done
 by
 the
 model
 was
 not
 saved
 in
 any

databases
because
of
following
reasons:


• Connecting
the
‘predict.php’
file
with
the
databases
and
saving
data
will
certainly

take
 time.
 This
 time
 used
 up
 in
 saving
 predicted
 output
 would
 effect
 the

performance
of
the
website
mainly
because
it
will
delay
the
return
of
the
model

output
from
‘predict.php’
file
to
the
javaScript
‘predictResponse()’
function.



• The
final
prediction
done
for
any
user
as
per
the
model
can
always
be
calculated

again
 as
 the
 databases
 are
 keeping
 a
 record
 of
 the
 mouse
 movement
 data
 for

every
user.
This
would
be
done
later
in
the
testing
phase
of
the
project.


62

Embedding
the
machine
learning
models
in
the
website


• The
prediction
was
done
every
10
seconds.
This
would
mean,
that
there
would

be
several
predictions
(average
4
predictions)
done
for
every
user.
The
count
of

four
 predictions
 was
 estimated
 because
 it
 was
 earlier
 found
 in
 section
 3.3.2.2

that
 average
 time
 spent
 by
 a
 user
 on
 the
 webpage
 is
 40.26
 seconds.
 Saving
 all

predictions
 per
 user
 is
 again
 a
 performance
 issue,
 as
 the
 table
 saving
 this
 is

expected
to
grow
with
time.


The
 final
 website
 capable
 of
 predicting
 the
 product
 the
 user
 is
 most
 likely
 to
 buy
 was

made
public
and
was
kept
online
for
7
days.
The
users
were
again
invited
using
emails,

social
media,
chats
etc
and
were
asked
to
surf
on
the
final
version
of
the
webpage.
The


volunteers
were
required
to
buy
one
of
the
products
after
evaluating
all
the
options
(5

laptops)
available
on
that
page.
While
doing
so,
the
users
were
shown
the
product
they

are
 most
 likely
 to
 buy.
 It
 was
 told
 by
 the
 visitors
 informally
 via
 email
 and
 in‐person

conversations
that
the
predictions
were
quite
accurate.


The
 next
 chapter
 will
 explain
 a
 much
 formal
 and
 quantitative
 method
 of
 testing
 the

prediction
 done
 by
 the
 two
 models.
 It
 will
 also
 describe
 the
 methodology
 adopted
 to

test
the
time
performances
of
the
two
models.


63

Testing
and
Results


6
TESTING
AND
RESULTS

This
chapter
describes
the
complete
testing
phase
of
the
project.
It
describes

the
 data
 collection
 steps
 and
 the
 parameters
 on
 which
 the
 models
 were

evaluated.
It
also
explains
the
testing
methodology
and
summary
of
the
final

results
obtained.


6.1 Testing
methodology


There
were
two
types
of
tests
conducted
to
evaluate
the
implementation.
One
test
was

conducted
on
WEKA
on
the
collected
test
data
to
check
for
the
classification
accuracy
of

the
 model
 (decision
 tree
 or
 neural
 network).
 The
 other
 test
 was
 conducted
 on
 the

‘predict.php’
 file
 to
 check
 the
 time
 performance
 of
 the
 website
 after
 implementing
 the

model.


Both
 the
 above‐mentioned
 tests
 were
 performed
 on
 both
 the
 models
 separately.
 The

methodology
adopted
and
the
results
obtained
are
mentioned
in
the
following
sections.


6.2 Testing
for
model
accuracy


Testing
data
was
collected
while
the
final
website
was
live
and
was
used
to
further
test

the
two
models
in
WEKA.
It
was
found
that
the
decision
tree
model
gave
an
accuracy
of

84.09%
whereas
the
neural
network
model
gave
an
accuracy
of
34.09%
on
the
collected


test
data.
Details
about
the
test
conducted
are
as
follows:


64

Testing
and
Results


6.2.1 Testing
data
collection


While
 the
 website
 with
 one
 of
 the
 machine
 learning
 model
 was
 live,
 the
 users
 mouse

tracking
data
and
the
final
product
bought
by
the
user
was
getting
saved
in
the
tables

‘data’
and
‘bought’
respectively.
It
was
found
that
in
7
days
time
(duration
for
which
the

test
website
was
live),
49
unique
users
visited
the
webpage.
There
were
1275
tuples
in

the
 ‘data’
 table
 and
 44
 tuples
 in
 the
 ‘bought’
 table.
 The
 difference
 between
 the

cardinality
 of
bought
 table
 and
the
number
of
 visitors
was
because
5
 users
(49
minus

44)
didn’t
click
the
buy
button
and
left
the
site
after
browsing
it
for
a
while.


This
 data
was
 processed
 in
 the
 similar
way
 as
 the
 initial
 data
as
 mentioned
 in
section


3.3.2.
The
steps
followed
to
analyze
and
prepare
the
test
data
are
as
follows:


• The
 data
 was
 converted
 into
 a
 more
 usable
 format
 using
 the
 php
 script
 named

‘alignData.php’.
 The
 details
 of
 this
 script
 are
 mentioned
 in
 section
 3.3.2.1.
 This

step
converted
the
test
data
into
a
‘one
user
per
row
data’
with
the
time
values
of

each
user
in
a
same
row
along
with
the
product
bought.


• This
data
was
exported
into
excel
and
was
normalized.
To
normalize
the
times,

total
time
spent
by
each
user
was
calculated
and
then
time
spent
in
each
section

/
cell
was
divided
by
the
total
time.
This
is
explained
in
details
in
section
3.3.2.3


• It
should
be
noted
that
in
this
step
no
outliers
were
removed.
The
reason
is
that

the
 data
 was
 collected
 from
 the
 actual
 users
 and
 it
 is
 expected
 that
 all
 kinds
 of

people
 will
 use
 the
 website
 in
 all
 possible
 way
 and
 the
 accurate
 measure
 of

accuracy
would
be
when
all
these
cases
are
taken
into
considerations
including

any
outliers.


• This
data
was
saved
in
a
CSV
file
that
is
then
opened
in
WEKA.


65

Testing
and
Results


• Opened
the
CSV
file
in
WEKA
and
was
saved
in
WEKA
default
ARFF
format.
The

ARFF
 format
 was
 opened
 in
 a
 text
 editor
 and
 the
 property
 of
 the
 bought
 table

was
changed
from
number
to
nominal
as
stated
in
section
4.4


This
 data
 was
 then
 opened
 in
 WEKA
 again
 the
 model
 testing
 was
 carried
 out
 as

explained
in
following
sections:


6.2.2 Model
testing
in
WEKA
using
test
data


Using
WEKA
the
saved
files
of
the
two
models
were
opened.
In
the
classifier
tab,
testing

on
 supplied
 test
 dataset
 option
 was
 chosen
 and
 after
 pressing
 the
 set
 button,
 the

collected
and
normalized
test
data
file
was
opened.
Now
the
loaded
model
was
made
to

evaluate
on
this
testing
data
by
right
clicking
the
model
and
selecting
“Re‐evaluate
the

model
 on
 current
 test‐set”.
 This
 method
 would
 evaluate
 the
 model
 on
 the
 test
 dataset

collected
and
would
show
the
accuracy
results
on
this
test
data.


This
method
is
similar
to
running
the
model
on
the
website
using
PHP.
The
output
given

by
the
model
while
testing
in
WEKA
would
be
exactly
same
to
the
one
given
by
the
PHP

script
 online.
 This
 is
 because
 the
 obtained
 WEKA
 model
 was
 the
 one
 which
 was

implemented
in
the
website.
This
was
the
reason
that
the
predictions
were
not
saved
as

stated
 in
 section
 5.5.
 Now
 checking
 for
 accuracy
 is
 simply
 comparing
 the
 model

prediction
with
the
actual
product
bought
by
the
user.


The
 details
 of
 the
 results
 given
 by
 both
 the
 model
 when
 evaluated
 on
 the
 test
 set
 are

explained
in
the
following
sub
sections‐


66

Testing
and
Results


6.2.2.1 Decision
Tree
model


The
test
dataset
with
44
cases
was
evaluated
using
the
built
decision
tree
model.
It
was

found
that
the
tree
was
able
to
correctly
classify
37
out
of
44
cases
with
an
accuracy
of

84.0909%.


The
output
obtained
after
re‐evaluation
from
WEKA
was:


===
Re‐evaluation
on
test
set
===


User
supplied
test
set

Relation:




MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1

Instances:




unknown
(yet).
Reading
incrementally

Attributes:


133


===
Summary
===


Correctly
Classified
Instances










 37















 84.0909
%

Incorrectly
Classified
Instances







 7















 15.9091
%

Kappa
statistic


























 0.7825

Mean
absolute
error






















 0.1916

Root
mean
squared
error


















 0.2814

Total
Number
of
Instances















 44







===
Detailed
Accuracy
By
Class
===




























TP
Rate


FP
Rate


Precision


Recall

F‐Measure


ROC
Area

Class





























0.857




0.027








0.857







0.857




0.857











0.986









1





























0.875




0.143








0.778







0.875




0.824











0.872









2





























0.917




0.031








0.917







0.917




0.917











0.921









3





























0.833




0.026








0.833







0.833




0.833











0.945









4





























0.333




0
















1
















0.333




0.5
















0.89











5

Weighted
Avg.
0.841



0.068








0.851







0.841




0.834











0.915


===
Confusion
Matrix
===





a


 b


 c


 d


 e



<‐‐
classified
as




6


 1


 0


 0


 0

 |

a
=
1




1

 14


0


 1

 0

 |

b
=
2




0


 1

 11


0


 0

 |

c
=
3




0


 0


 1


 5


 0

 |

d
=
4




0


 2


 0


 0


 1

 |

e
=
5


67

Testing
and
Results


6.2.2.2 Neural
Network
model


The
 dataset
 having
 44
 test
 cases
 was
 evaluated
 on
 neural
 network
 model
 and
 it
 was

found
that
it
classified
15
cases
correctly.
This
shows
an
accuracy
of
only
34.0909%
on

testing
data
of
the
neural
network
model.


The
output
obtained
from
WEKA
was:


===
Re‐evaluation
on
test
set
===


User
supplied
test
set

Relation:




MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1

Instances:




unknown
(yet).
Reading
incrementally

Attributes:


133


===
Summary
===


Correctly
Classified
Instances








 15















34.0909
%

Incorrectly
Classified
Instances






29















65.9091
%

Kappa
statistic


























 0.1367

Mean
absolute
error






















 0.2695

Root
mean
squared
error
















 0.5001

Total
Number
of
Instances













 44







===
Detailed
Accuracy
By
Class
===


















 TP
Rate



FP
Rate


Precision
 Recall
 F‐Measure
 ROC
Area
 Class



















 0.429





 0.135






 0.375





 0.429





0.4








 0.695




 


1



















 0.313





 0.25







 0.417





 0.313





0.357






 0.694




 


2



















 0.25






 0.281






 0.25






 0.25






 0.25







 0.505




 


3



















 0.5







 0.184






 0.3







 0.5







 0.375






 0.623




 


4



















 0.333





 0.024






 0.5







 0.333




0.4








 0.78





 


5

Weighted
Avg.


 0.341





 0.216






 0.354





 0.341





0.34







 0.639



===
Confusion
Matrix
===




 a

 b

 c

 d

 e



 <‐‐
classified
as



 3

 3

 0

 1

 0

 |
a
=
1



 2

 5

 7

 2

 0

 |
b
=
2



 3

 4

 3

 2

 0

 |
c
=
3



 0

 0

 2

 3

 1

 |
d
=
4



 0

 0

 0

 2

 1

 |
e
=
5


68

Testing
and
Results


6.2.3 Discussion


The
training
data
gave
an
accuracy
of
89.5%
for
decision
tree
whereas
gave
an
accuracy

of
 95%
 for
 neural
 networks.
 The
 same
 decision
 tree
 and
 neural
 network
 models
 gave

accuracies
of
84.0909%
and
34.0909%
respectively
when
evaluated
on
the
test
dataset.


For
models
comparison,
accuracy
on
the
test
dataset,
i.e.
the
data
on
which
model
has

not
 been
 trained
 is
 the
 one
 of
 the
 most
 important
 parameter.
 As
 discussed
 in
 section

4.4.3,
there
are
several
drawbacks
of
using
neural
networks
in
the
present
situation
but

after
 conducting
 the
 evaluation
 of
 the
 two
 models
 on
 test
 dataset,
 it
 is
 clear
 that

decision
 trees
 have
 clearly
 out
 performed
 neural
 networks
 and
 should
 be
 used
 while


predictions.


This
 however
 depends
 on
 a
 lot
 of
 parameters,
 most
 important
 being
 the
 size
 of
 the

training
and
testing
dataset.
Since
the
scope
of
this
project
was
limited,
a
large
amount

of
 data
 could
 not
 be
 collected
 but
 it
 is
 advised
 that
 both
 decision
 trees
 and
 neural

networks
 should
 be
 evaluated
 along
 with
 other
 machine
 learning
 models
 before
 pin‐
pointing
on
one
of
them.



6.3 Testing
time
performance
of
the
models


A
new
PHP
script
was
written
and
executed
on
the
server
to
estimate
the
average
time

the
 model
 processing
 is
 taking
 when
 executed
 in
 real
 time.
 To
 do
 this,
 the
 PHP
 script

was
connected
to
the
database
containing
the
test
data.
Both
the
model
functions
were

then
called
and
the
time
taken
by
them
to
evaluate
all
the
44
test
cases
was
calculated.

This
 was
 averaged
 out
 over
 44
 cases
 to
 estimate
 the
 average
 time
 each
 model
 takes

while
making
every
prediction
in
PHP.
This
process
was
carried
out
10
times
separately

to
 estimate
 the
 average
 time
 so
 as
 to
 avoid
 any
 clashes
 with
 unforeseen
 tasks
 at
 the

server
that
might
delay
the
model
execution.


69

Testing
and
Results


Time
 taken
 by
 the
 model
 is
 an
 important
 feature
 as
 the
 expectation
 of
 the
 intelligent

website
is
to
predict
the
output
as
soon
as
possible
and
of
course
in
real‐time.
A
model

taking
more
than
some
threshold
value
for
calculations
is
of
no
good
use.
The
process

and
results
are
explained
in
the
following
sections‐


6.3.1 Decision
Tree
model


The
 decision
 tree
 was
 made
 to
 execute
 on
 all
 the
 44
 test
 cases.
 The
 time
 taken
 was

averaged
 out.
 This
 was
 done
 10
 times
 and
 the
 average
 times
 in
 seconds
 taken
 by
 the

script
to
evaluate
decision
tree
were:


0.000929258,

 0.000544337,

 0.000656968,

 0.004135495,



0.000538674,

 0.000537385,

 0.000534681,

 0.000545979,

0.000546981,

 0.007368538


From
the
above
10
time
values,
the
following
insights
can
be
seen:



• Minimum
time
taken
by
the
model
was
approximately
0.00053
seconds


• Maximum
time
taken
by
the
model
was
approximately
0.00737
seconds


• Average
time
taken
by
the
model
was
0.00163
seconds


6.3.2 Neural
Network
model


As
done
for
decision
trees,
the
neural
network
model
was
also
made
to
evaluate
the
44

test
 cases.
 The
 average
 time
 taken
 was
 noted.
 This
 was
 done
 10
 times.
 The
 average

times
of
execution
taken
by
the
neural
network
model
were:


0.658177257,

 0.543146627,

 0.658050104,

 0.746059109,



0.482899054,

 0.536261456,

 0.639314229,

 0.505210876,


0.496645451,

 0.707032805


70

Testing
and
Results


From
the
above
10
time
values,
the
following
insights
can
be
seen:



• Minimum
time
taken
by
the
model
was
approximately
0.4839
seconds


• Maximum
time
taken
by
the
model
was
approximately
0.7461
seconds


• Average
time
taken
by
the
model
was
0.5973
seconds


6.3.3 Discussion


It
 is
 clearly
 seen
 that
 neural
 network
 model
 is
 taking
 far
 more
 time
 to
 execute
 as

compared
 to
 decision
 tree
 model.
 It
 was
 also
 analyzed
 that
 the
 chosen
 decision
 tree

model
runs
at
least
350
times
faster
than
the
chosen
neural
network.


Since,
the
objective
is
to
predict
in
read‐time,
speed
is
a
very
important
parameter
and

decision
tree
model
has
completely
won
the
time
performance
battle.


6.4 Results


After
testing
both
the
models
(decision
tree
and
neural
network)
on
prediction
accuracy

and
time
performance
parameters,
it
was
clearly
found
that
decision
tree
proved
much

better
for
implementation
in
the
current
problem
as
compared
to
neural
network.


The
result
obtained
in
the
tests
is
summarized
below:


• Accuracy
(on
test
dataset):

o Decision
Tree:
84.0909%

o Neural
Network:
34.0909%


• Time
Performance
(PHP
scripts
running
on
apache):

o Decision
Tree:
0.0016
seconds


o Neural
Network:
0.5973
seconds


71

Testing
and
Results


It
 should
 however
 be
 noted
 that
 these
 results
 were
 obtained
 when
 the
 models
 were

trained
on
only
200
cases.
The
neural
network
model
had
a
total
of
73
nodes
including

68
 hidden
 nodes.
 To
 properly
 train
 the
 neural
 network
 a
 few
 thousand
 cases
 were
 at

least
required.
The
neural
network
model
was
built
to
establish
the
fact
that
it
can
be

used
 on
 a
 website
 to
 predict
 relevant
 content
 for
 the
 user.
 The
 decision
 tree
 on
 the

other
 hand
 is
 also
 expected
 to
 give
 better
 results
 when
 larger
 training
 and
 testing

datasets
are
available.


It
should
also
be
noted
that
there
were
five
classes
of
the
dependent
variable
(5
possible

laptop
 products)
 and
 hence
 the
 model
 would
 have
 been
 considered
 void
 only
 if
 the

accuracy
is
close
to
20%
(20%
being
the
equally
likely
chance
of
each
model).
Since
the

accuracy
 obtained
 for
 both
 the
 machine
 learning
 models
 was
 far
 above
 the
 20%

benchmark,
 both
 models
 have
 shown
 some
 promise
 that
 they
 do
 have
 potential
 to

recommend
relevant
content
for
a
user
based
on
his
mouse
movement
behaviors.


The
next
chapter
will
give
a
brief
conclusion
of
the
work
done.


72

Conclusion


7
CONCLUSION

This
 chapter
 gives
 the
 conclusion
 of
 the
 project
 and
 discusses
 the
 scope
 of

future
 work
 possible.
 It
 also
 talks
 about
 some
 other
 implementations

possible
of
the
explained
methodology


It
 has
 been
 successfully
 demonstrated
 that
 by
 building
 a
 machine‐learning
 model
 on

users
mouse
movement
data,
appropriate
content
for
him
can
be
predicted.
The
dummy

shopping
 website
 developed,
 embedded
 with
 a
 decision
 tree
 machine
 learning
 model

gave
a
remarkable
accuracy
of
84.09%
on
the
test
data.
The
accuracy
was
measured
as

the
 ratio
 of
 the
 correct
 predictions
 to
 the
 total
 number
 of
 predictions
 done
 by
 the

model.
 It
 was
 also
 found
 that
 implementing
 a
 decision
 tree
 model
 in
 a
 website
 would

not
affect
the
performance
of
the
page
as
the
average
time
taken
by
the
dummy
model

was
 found
 to
 be
 around
 1.6
 milliseconds.
 A
 Neural
 network
 model
 was
 similarly

evaluated
 and
 it
 gave
 an
 accuracy
 of
 34.09%
 and
 took
 an
 average
 time
 of
 577.3

milliseconds
to
process
a
single
case
of
data.


The
 objective
 of
 the
 project
 was
 to
 use
 the
 mouse
 movement
 behavior
 of
 a
 user
 and

predict
the
appropriate
content
for
him
intelligently
and
in
real
time.
This
objective
was

successfully
achieved
and
several
other
sub‐objectives
were
also
reached
while
working

on
the
project.


User’s
mouse
tracking
was
implemented
successfully
using
a
completely
new
algorithm.


This
was
done
using
PHP,
AJAX,
HTML
and
MySQL.
The
performance
of
the
website
after


73

Conclusion


implementing
 mouse
 tracking
 was
 not
 compromised
 and
 the
 accuracy
 of
 the
 mouse

tracking
data
collected
was
found
to
be
very
high.
A
webpage
was
developed
imitating
a

shopping
portal
and
some
highlighting
techniques
were
applied
to
it
to
make
sure
that

the
user’s
mouse
pointer
is
close
to
his
point
of
gaze.


The
initial
website
developed
in
PHP
was
live
for
around
two
weeks
and
it
collected
200

cases
of
training
data.
The
data
was
then
used
to
train
two
separate
machine‐learning

models,
 namely
 a
 Decision
 Tree
 model
 and
 a
 Neural
 Network
 model.
 Both
 the
 models

gave
promising
results
when
tested
on
the
training
data,
which
proved
that
the
models

built
have
learned
the
mouse
movement
behavior
appropriately.


Both
 the
 machine
 learning
 models
 were
 coded
 back
 into
 the
 website
 using
 PHP
 and

AJAX.
 The
 website
 was
 made
 to
 collect
 mouse
 movement
 data
 which
 was
 dynamically

read
by
the
models
and
an
output
was
generated.
This
predicted
output
was
sent
to
the

webpage
 for
 further
 personalization.
 A
 total
 of
 44
 test
 cases
 were
 also
 collected
 from

the
final
website.


Using
 the
 collected
 44
 test
 cases,
 both
 models
 were
 evaluated
 and
 the
 decision
 tree

model
was
found
to
perform
extremely
well
as
compared
to
the
neural
network
model,

both
from
the
point
of
view
of
accuracy
and
time
performance.
Decision
tree
classified

2.5
times
accurately
in
a
set
of
44
cases,
and
was
350
times
faster
than
neural
network

model.
 This
 however
 cannot
 be
 generalized
 as
 it
 depends
 on
 the
 size
 of
 the
 initial


training
 dataset,
 (which
 was
 small
 in
 the
 current
 scope
 of
 the
 project)
 and
 on
 the

number
of
independent
variables
(which
was
large
in
the
current
implementation).


The
 working
 demonstration
 of
 the
 project,
 along
 with
 its
 documentation
 and
 the
 GNU

General
 Public
 License
 source
 code
 is
 available
 online
 at

http://sparshgupta.name/MSc/Project


74

Conclusion


7.1 Future
Work


The
 proposed
 idea
 has
 shows
 a
 huge
 potential
 and
 there
 is
 a
 lot
 of
 scope
 for
 future

innovations
 and
 improvements
 if
 properly
 explored.
 The
 lack
 of
 data
 was
 the
 prime

limitation
in
the
current
study.
If
a
commercial
website
is
required
to
be
intelligent
then

models
built
on
several
thousands
of
cases
of
training
data
should
be
used
and
once
that

data
is
obtained,
possibilities
of
other
machine
learning
algorithms
could
be
explored.


The
data
collected
in
the
testing
phase
can
later
be
used
to
train
the
models.
There
is
a

never‐ending
 chain
 of
 model
 training
 and
 improvement
 involved
 in
 the
 current

proposed
 concept
 and
 implementation.
 This
 is
 because
 with
 time,
 the
 website
 will

accumulate
 a
 lot
 of
 data
 that
 at
 regular
 intervals
 can
 be
 used
 to
 further
 train
 the

implemented
 model
 or
 to
 make
 a
 new
 model.
 It
 is
 expected
 that
 with
 every

improvement
in
the
model,
its
capability
to
predict
the
relevant
content
for
a
new
user

will
increase.


The
proposed
implementation
requires
that
each
section
of
the
website
calls
the
mouse

tracking
 function
 whenever
 mouse
 enters
 the
 section
 and
 leaves
 it.
 This
 requires

explicit
 coding
 of
 function
 call
 statements
 in
 every
 cell.
 This
 might
 not
 be
 possible
 in

highly
 dynamic
 websites
 and
 hence
 work
 could
 be
 done
 on
 implementing
 the
 idea
 on

any
given
website,
requiring
almost
no
change
in
the
existing
web
coding.


In
the
current
project,
the
information
about
the
predicted
content
(i.e.,
the
laptop
user

is
 most
 likely
 to
 buy)
 was
 not
 exploited.
 Work
 can
 be
 done
 to
 make
 the
 website

interacting
 with
 the
 user
 like
 a
 salesman.
 The
 website
 can
 remove
 all
 the
 products

which
the
user
would
be
least
interested
in
and
can
only
show
him
products
he
is
most

likely
to
buy.


75

Conclusion


Current
implementations
involved
using
only
a
single
machine‐learning
model
at
a
time.

Multiple
 models
 can
 be
 implemented
 in
 the
 webpage
 and
 the
 strength
 of
 prediction

made
 can
 also
 be
 used
 to
 further
 interact
 with
 the
 user.
 Incase
 all
 the
 different

implemented
 models
 gave
 the
 same
 prediction
 than
 it
 can
 be
 assumed
 to
 be
 a
 strong

prediction
and
hence
the
webpage
can
adapt
accordingly
immediately.


Other
implementations
possible


A
Shopping
portal
with
intelligent
prediction
of
the
product
a
user
is
most
likely
to
buy

is
 one
 of
 the
 many
 implementations
 possible
 of
 the
 proposed
 concept.
 Some
 other

possible
implementations
could
be:


• A
Search
Engine
Feedback
System:
Current
search
engines
display
the
results
in

a
form
of
list
of
links
along
with
a
small
text
relevant
to
the
search.
Most
of
the

users
choose
the
links
after
reading
the
text
snippet
associated
with
the
link
and

they
 spend
 different
 times
 on
 different
 links.
 Current
 search
 feedback
 is

completely
based
on
mouse
click
that
in
a
sense
is
a
binary
feedback
(either
Yes

or
 No).
 The
 feedback
 system
 can
 be
 made
 more
 accurate
 by
 determining
 the

relative
time
a
user
spent
on
a
link
compared
to
other
links.


• News
 Content
 Prediction:
 An
 online
 news
 website
 shows
 several
 news
 under

different
heads
on
a
page.
Many
common
users
have
different
priorities
for
news.

Based
on
a
user’s
mouse
movement
activity,
relevant
news
content
can
be
shown

to
him.
For
example,
if
a
user
is
spending
more
time
around
football
and
cricket

news
headlines
than
Political
headlines,
then
it
can
be
predicted
that
he
is
more

interested
in
sport
news
and,
accordingly,
the
website
can
be
molded
for
him.


76



77

<<Bibliography


BIBLIOGRAPHY

Aaltonen, Antti, Aulikki Hyrskykari, and Kari-Jo Räihä. "101 spots, or how do users
read menus?" Conference on Human Factors in Computing Systems, 1998: 132 -
139.

Arroya, Ernesto, Ted Selker, and Willy Wei. "Usability tool for analysis of web
designs using mouse tracks." Conference on Human Factors in Computing Systems,
2006: 484 - 489.

Atterer, Richard, and Albrecht Schmidt. "Tracking the interaction of users with AJAX
applications for usability testing." Conference on Human Factors in Computing
Systems, 2007: 1347 - 1350.

Atterer, Richard, Monica Wnuk, and Albrecht Schmidt. "Knowing the User’s Every
Move – User Activity Tracking for Website Usability Evaluation and Implicit
Interaction." ACM.

Balabanovic, Marko, Yoav Shoham, and Yeogirl Yun. "An Adaptive Agent for
Automated Web Browsing." 1997.

Byrne, Michael D, John R Anderson, Scott Douglass, and Michael Matessa. "Eye
tracking the visual search of click-down menus." Conference on Human Factors in
Computing Systems, 1999.

CERN. Welcome to info.cern.ch/. http://info.cern.ch/.

78

<<Bibliography


Chen, Mon Chu, John R Anderson, and Myeong Ho Sohn. "What can a mouse
cursor tell us more?: correlation of eye/mouse movements on web browsing."
Conference on Human Factors in Computing Systems, 2001.

Dutta, Partha, Sandip Debnath, and Sandip Sen. "A shopper's assistant."
International Conference on Autonomous Agents, 2001.

Edmonds, A, R White, D Morris, and S Drucker. "Instrumenting the Dynamic Web."


Journal of Web Engineering 6, no. 3 (2007): 243-260.

Edmonds, Andy. "Why the Mouse Doesn't Always Keep Up with the Eye." 2008.

Guo, Qi, and Eugene Agichtein. "Exploring mouse movements for inferring query
intent." Annual ACM Conference on Research and Development in Information
Retrieval, 2008: 1.

Gurney, Kevin N. An introduction to neural networks. illustrated. CRC Press, 1997.

Haykin, Simon. Neural Networks: A comprehensive Foundations. Prentice Hall.

Jayaputera, G. T., S. W. Loke, and A. Zaslavsky. "Design, implementation and run-


time evolution of a mission-based multiagent system." Web Intelligence and Agent
Systems 5, no. 2 (2007): 20.

Kohn, Nicholas, and Takashi Yamauchi. "Feature Inference: Tracking Mouse


Movement."

Linden, Greg. "Geeking with Greg Exploring the future of personalized information."

79

<<Bibliography


Mitchell, Tom. Decision Tree Learning, Machine Learning. The McGraw-Hill


Companies, Inc., 1997.

Mueller, Florian, and Andrea Lockerd. "Cheese: tracking mouse movement activity
on websites, a tool for user modeling." Conference on Human Factors in Computing
Systems, 2001.

Pazzani, Michael, and Daniel Billsus. "Learning and Revising User Profiles: The
Identification of Interesting Web Sites." Machine Learning 27, no. 3 (1997): 313 -
331.

Perkowitz, M, and O Etzioni. "Towards adaptive web sites: Conceptual framework


and case study." Artificial Intelligence 118, no. 1 (2000): 245 - 275.

Quinlan, J. R. "Improved Use of Continuous Attributes in C4.5." Journal of Artificial


Intelligence Research 4 (1996): 77-90.

Rodden, Kerry, Xin Fu, Anne Aula, and Ian Spiro. "Eye-Mouse Coordination Patterns
on Web Search Results Pages." Conference on Human Factors in Computing
Systems, 2008: 5.

Salzberg, Steven L. "C4.5: Programs for Machine Learning." Machine Learning 16,
no. 3 (1994): 235-240.

Schafer, J. Ben, Joseph Konstan, and John Riedi. "Recommender systems in e-


commerce." Electronic Commerce, 1999.

The University of Waikato. Weka 3: Data Mining Software in Java.


http://www.cs.waikato.ac.nz/ml/weka/.

80

>>
Appendix:
Source
Code


Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."
http://smt.speedzinemedia.com/smt/docs/smt_IADIS07.pdf.

Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."

Usmani, Zeeshan-ul-hassan, Fawzi A. Alghamdi, and Talal Naveed Puri. "Intelligent


Web Interactions - What, When and How?" Web Intelligence & Intelligent Agent,
2008: 3.

W3Schools. Ajax. http://www.w3schools.com/Ajax/.

Wikipedia. C4.5 Algorithm. http://en.wikipedia.org/wiki/C4.5_algorithm.

—. Machine Learning Wikipedia. http://en.wikipedia.org/wiki/Machine_learning.

—. Multilayer Perceptron. http://en.wikipedia.org/wiki/Multilayer_perceptron.

Winston, P. Learning by building identification trees. Addison-Wesley Publishing


Company, 1992.

Witten, Ian H, and Eibe Frank. Data Mining: Practical machine learning tools and
techniques. San Francisco: Morgan Kaufmann, 2005.

81

>>
Appendix:
Source
Code


APPENDIX:
SOURCE
CODE

HTML
final
webpage


The
 HTML
 code
 of
 the
 final
 website
 developed
 capable
 of
 tracking
 user’s
 mouse

movements
 as
 well
 as
 capable
 of
 predicting
 the
 relevant
 product
 to
 the
 user
 is
 as

follows:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"


"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>MSc Project - Compare Laptops</title>
<link rel="stylesheet" type="text/css" href="mouseover.css"/>
<script type="text/javascript" src="mouseover.js" ></script>
</head>

<body onload="start_It();">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><p class="oce-first"><span class="bold">NOTE:</span> Surf on this page like you do on a
shopping portal comparison page and decide on a model based on its configuration and buy it.
Thanks</p></td>
</tr>
<tr>
<td>&nbsp;</td>
</tr>
<tr>
<td><table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"
onMouseOver="hiliteColumn(event);" onMouseOut="resetColumn(event);" class="one-column-emphasis">
<colgroup class="oce-first" id="na"></colgroup>
<colgroup id="cg2" class=""></colgroup>
<colgroup id="cg3" class=""></colgroup>
<colgroup id="cg4" class=""></colgroup>
<colgroup id="cg5" class=""></colgroup>
<colgroup id="cg6" class=""></colgroup>

<thead>
<tr>
<th onmouseout="movement_out('a0');" onmouseover="movement_in();">Product Name</th>
<th onmouseout="movement_out('a1');" onmouseover="movement_in();">Lenovo IdeaPad Y650
4185</th>
<th onmouseout="movement_out('a2');" onmouseover="movement_in();">HP Pavilion dv7-
1285dx</th>
<th onmouseout="movement_out('a3');" onmouseover="movement_in();">Sony VAIO VGN-P588E</th>
<th onmouseout="movement_out('a4');" onmouseover="movement_in();">Dell Studio XPS 16</th>
<th onmouseout="movement_out('a5');" onmouseover="movement_in();">Toshiba Satellite A205-
S4617</th>
</tr>
</thead>
<tbody>
<tr>
<td class="oce-first" onmouseout="movement_out('b0');"
onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('b1');" onmouseover="movement_in();"><img src="images/1.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b2');" onmouseover="movement_in();"><img src="images/2.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b3');" onmouseover="movement_in();"><img src="images/3.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b4');" onmouseover="movement_in();"><img src="images/4.gif"
width="120" height="90" border="0" /></td>

82

>>
Appendix:
Source
Code


<td onmouseout="movement_out('b5');" onmouseover="movement_in();"><img src="images/5.gif"


width="120" height="90" border="0" /></td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('c0');"
onmouseover="movement_in();">Price</td>
<td onmouseout="movement_out('c1');" onmouseover="movement_in();">$1,249.00</td>
<td onmouseout="movement_out('c2');" onmouseover="movement_in();">$1,199.99</td>
<td onmouseout="movement_out('c3');" onmouseover="movement_in();">$1,133.00</td>
<td onmouseout="movement_out('c4');" onmouseover="movement_in();">$1,224.00</td>
<td onmouseout="movement_out('c5');" onmouseover="movement_in();">$1,249.00</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('d0');" onmouseover="movement_in();">CNET
editors' rating</td>
<td onmouseout="movement_out('d1');" onmouseover="movement_in();">3.5/5.0</td>
<td onmouseout="movement_out('d2');" onmouseover="movement_in();">3.5/5.0</td>
<td onmouseout="movement_out('d3');" onmouseover="movement_in();">3.5/5.0</td>
<td onmouseout="movement_out('d4');" onmouseover="movement_in();">3.5/5.0</td>
<td onmouseout="movement_out('d5');" onmouseover="movement_in();">3.5/5.0</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('e0');" onmouseover="movement_in();">Average
user rating</td>
<td onmouseout="movement_out('e1');" onmouseover="movement_in();">No Data</td>
<td onmouseout="movement_out('e2');" onmouseover="movement_in();">4.0/5.0</td>
<td onmouseout="movement_out('e3');" onmouseover="movement_in();">2.0/5.0</td>
<td onmouseout="movement_out('e4');" onmouseover="movement_in();">No Data</td>
<td onmouseout="movement_out('e5');" onmouseover="movement_in();">3.0/5.0</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('f0');" onmouseover="movement_in();">Release
date</td>
<td onmouseout="movement_out('f1');" onmouseover="movement_in();">April 15, 2009</td>
<td onmouseout="movement_out('f2');" onmouseover="movement_in();">February 01, 2009</td>
<td onmouseout="movement_out('f3');" onmouseover="movement_in();">January 08, 2009</td>
<td onmouseout="movement_out('f4');" onmouseover="movement_in();">January 07, 2009</td>
<td onmouseout="movement_out('f5');" onmouseover="movement_in();">April 16, 2007</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('g0');" onmouseover="movement_in();">The
Bottom Line</td>
<td onmouseout="movement_out('g1');" onmouseover="movement_in();">Online media consumers
who want a portable laptop with high style and plenty of screen real estate should give the Y650 a
look.</td>
<td onmouseout="movement_out('g2');" onmouseover="movement_in();">HP's Pavilion dv7-1245dx
is a slick multimedia machine with great battery life, but for $1,200, we want a full 1080p
display.</td> <td onmouseout="movement_out('g3');" onmouseover="movement_in();">Sony's upscale
Atom-powered Lifestyle PC has the components of a cheaper machine but the design of a more
expensive one. The end result will be a useful travel PC for some and a conversation piece for
others.</td>
<td onmouseout="movement_out('g4');" onmouseover="movement_in();">Dell's new 16:9 Studio
XPS 16 adds upscale extras such as a leather trim and a backlit keyboard to a fairly standard set
of components, without jacking up the price (too much).</td>
<td onmouseout="movement_out('g5');" onmouseover="movement_in();">Toshiba adds faster Draft
N Wi-Fi to this attractive if otherwise fairly conventional laptop. Just be sure you've got an
802.11n router to go along with it.</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('h0');" onmouseover="movement_in();">Similar
Products</td>
<td onmouseout="movement_out('h1');" onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('h2');" onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('h3');" onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('h4');" onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('h5');" onmouseover="movement_in();">&nbsp;</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('i0');"
onmouseover="movement_in();">Networking</td>
<td onmouseout="movement_out('i1');" onmouseover="movement_in();">Network adapter -
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />

83

>>
Appendix:
Source
Code


- IEEE 802.11g<br />


- Fast Ethernet<br />
- Gigabit Ethernet<br />
- Bluetooth 2.1 EDR<br />
- IEEE 802.11n (draft)</td>
<td onmouseout="movement_out('i2');" onmouseover="movement_in();">Network adapter -
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- Fast Ethernet<br />
- Gigabit Ethernet<br />
- IEEE 802.11n (draft)
</td>
<td onmouseout="movement_out('i3');" onmouseover="movement_in();">Network adapter -
Ethernet<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- Fast Ethernet<br />
- Gigabit Ethernet<br />
- Bluetooth 2.1 EDR<br />
- IEEE 802.11n (draft)
</td>
<td onmouseout="movement_out('i4');" onmouseover="movement_in();">Network adapter - Gigabit
Ethernet</td>
<td onmouseout="movement_out('i5');" onmouseover="movement_in();">Network adapter -
Ethernet<br />
- IEEE 802.11a<br />
- IEEE 802.11b<br />
- IEEE 802.11g<br />
- Fast Ethernet<br />
- IEEE 802.11n (draft)</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('j0');"
onmouseover="movement_in();">Graphics Controller</td>
<td onmouseout="movement_out('j1');" onmouseover="movement_in();">NVIDIA GeForce G105M -
256 MB</td>
<td onmouseout="movement_out('j2');" onmouseover="movement_in();">NVIDIA GeForce 9600M GT -
512 MB</td>
<td onmouseout="movement_out('j3');" onmouseover="movement_in();">Intel GMA 500</td>
<td onmouseout="movement_out('j4');" onmouseover="movement_in();">ATI Mobility RADEON? HD
3670 - 512MB - 512 MB</td>
<td onmouseout="movement_out('j5');" onmouseover="movement_in();">Intel GMA 950 - 8 MB</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('k0');"
onmouseover="movement_in();">Notebook Camera</td>
<td onmouseout="movement_out('k1');" onmouseover="movement_in();">Integrated - 1.3
Megapixel</td>
<td onmouseout="movement_out('k2');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('k3');" onmouseover="movement_in();">Integrated</td>
<td onmouseout="movement_out('k4');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('k5');" onmouseover="movement_in();">Info unavailable</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('l0');" onmouseover="movement_in();">Optical
Storage</td>
<td onmouseout="movement_out('l1');" onmouseover="movement_in();">DVD-Writer -
Integrated</td>
<td onmouseout="movement_out('l2');" onmouseover="movement_in();">DVD?RW (?R DL) / DVD-RAM
with LightScribe Technology</td>
<td onmouseout="movement_out('l3');" onmouseover="movement_in();">None</td>
<td onmouseout="movement_out('l4');" onmouseover="movement_in();">8X DVD+/- RW(DVD/CD
read/write) Slot Load Drive</td>
<td onmouseout="movement_out('l5');" onmouseover="movement_in();">DVD?RW (?R DL) / DVD-RAM
- Integrated</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('m0');"
onmouseover="movement_in();">RAM</td>
<td onmouseout="movement_out('m1');" onmouseover="movement_in();">4 GB (installed) / 8 GB
(max) - DDR3 SDRAM - 1066 MHz - PC3-8500 ( 2 x 2 GB )</td>

84

>>
Appendix:
Source
Code


<td onmouseout="movement_out('m2');" onmouseover="movement_in();">6 GB (installed) / 8 GB


(max) - DDR2 SDRAM</td>
<td onmouseout="movement_out('m3');" onmouseover="movement_in();">2 GB (installed) / 2 GB
(max) - DDR2 SDRAM - 533 MHz ( 1 x 2 GB )</td>
<td onmouseout="movement_out('m4');" onmouseover="movement_in();">4 GB DDR3 SDRAM</td>
<td onmouseout="movement_out('m5');" onmouseover="movement_in();">2 GB (installed) / 4 GB
(max) - DDR2 SDRAM - 667 MHz - PC2-5300</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('n0');" onmouseover="movement_in();">Cache
Memory</td>
<td onmouseout="movement_out('n1');" onmouseover="movement_in();">3 MB - L2 cache</td>
<td onmouseout="movement_out('n2');" onmouseover="movement_in();">3 MB - L2 cache</td>
<td onmouseout="movement_out('n3');" onmouseover="movement_in();">512 KB - L2 cache</td>
<td onmouseout="movement_out('n4');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('n5');" onmouseover="movement_in();">2 MB - L2 cache</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('o0');"
onmouseover="movement_in();">Processor</td>
<td onmouseout="movement_out('o1');" onmouseover="movement_in();">Intel Core 2 Duo P8700 /
2.53 GHz ( Dual-Core )</td>
<td onmouseout="movement_out('o2');" onmouseover="movement_in();">Intel Core 2 Duo P8600 /
2.4 GHz ( Dual-Core )</td>
<td onmouseout="movement_out('o3');" onmouseover="movement_in();">Intel 1.33 GHz</td>
<td onmouseout="movement_out('o4');" onmouseover="movement_in();">Intel Core 2 Duo P8700 /
2.53 GHz</td>
<td onmouseout="movement_out('o5');" onmouseover="movement_in();">Intel Core 2 Duo T5500 /
1.66 GHz ( Dual-Core )</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('p0');" onmouseover="movement_in();">Hard
Drive</td>
<td onmouseout="movement_out('p1');" onmouseover="movement_in();">320 GB - Serial ATA-300 -
5400 rpm</td>
<td onmouseout="movement_out('p2');" onmouseover="movement_in();">500 GB - Serial ATA-150 -
5400 rpm</td>
<td onmouseout="movement_out('p3');" onmouseover="movement_in();">64 GB - Serial ATA-
150</td>
<td onmouseout="movement_out('p4');" onmouseover="movement_in();">500 GB - 5400 rpm</td>
<td onmouseout="movement_out('p5');" onmouseover="movement_in();">250 GB - Serial ATA-150 -
4200 rpm</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('q0');"
onmouseover="movement_in();">Display</td>
<td onmouseout="movement_out('q1');" onmouseover="movement_in();">16 in TFT active matrix
1366 x 768 ( WXGA ) - VibrantView</td>
<td onmouseout="movement_out('q2');" onmouseover="movement_in();">17 in TFT active matrix
1440 x 900 ( WXGA+ ) - BrightView</td>
<td onmouseout="movement_out('q3');" onmouseover="movement_in();">8 in TFT active matrix
1600 x 768</td>
<td onmouseout="movement_out('q4');" onmouseover="movement_in();">16.0</td>
<td onmouseout="movement_out('q5');" onmouseover="movement_in();">15.4 in TFT active matrix
1280 x 800 ( WXGA ) - 24-bit (16.7 million colors)</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('r0');"
onmouseover="movement_in();">Battery</td>
<td onmouseout="movement_out('r1');" onmouseover="movement_in();">Lithium ion</td>
<td onmouseout="movement_out('r2');" onmouseover="movement_in();">Lithium ion</td>
<td onmouseout="movement_out('r3');" onmouseover="movement_in();">Lithium ion</td>
<td onmouseout="movement_out('r4');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('r5');" onmouseover="movement_in();">Lithium ion</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('s0');"
onmouseover="movement_in();">Dimensions (WxDxH)</td>
<td onmouseout="movement_out('s1');" onmouseover="movement_in();">15.4 in x 10.2 in x 1
in</td>
<td onmouseout="movement_out('s2');" onmouseover="movement_in();">15.6 in x 11.2 in x 1.7
in</td>
<td onmouseout="movement_out('s3');" onmouseover="movement_in();">9.6 in x 4.7 in x 0.8
in</td>

85

>>
Appendix:
Source
Code


<td onmouseout="movement_out('s4');" onmouseover="movement_in();">Info unavailable</td>


<td onmouseout="movement_out('s5');" onmouseover="movement_in();">14.3 in x 10.6 in x 1.3
in</td>
</tr><tr>
<td class="oce-first" onmouseout="movement_out('t0');"
onmouseover="movement_in();">Weight</td>
<td onmouseout="movement_out('t1');" onmouseover="movement_in();">5.5 lbs</td>
<td onmouseout="movement_out('t2');" onmouseover="movement_in();">7.7 lbs</td>
<td onmouseout="movement_out('t3');" onmouseover="movement_in();">1.4 lbs</td>
<td onmouseout="movement_out('t4');" onmouseover="movement_in();">Info unavailable</td>
<td onmouseout="movement_out('t5');" onmouseover="movement_in();">6.4 lbs</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('u0');" onmouseover="movement_in();">OS
Provided</td>
<td onmouseout="movement_out('u1');" onmouseover="movement_in();">Microsoft Windows Vista
Home Premium 64-bit Edition</td>
<td onmouseout="movement_out('u2');" onmouseover="movement_in();">Microsoft Windows Vista
Home Premium</td>
<td onmouseout="movement_out('u3');" onmouseover="movement_in();">Microsoft Windows Vista
Home Premium Edition</td>
<td onmouseout="movement_out('u4');" onmouseover="movement_in();">Microsoft Windows
Vista</td>
<td onmouseout="movement_out('u5');" onmouseover="movement_in();">Microsoft Windows Vista
Home Premium</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('v0');"
onmouseover="movement_in();">Attribute X</td>
<td onmouseout="movement_out('v1');" onmouseover="movement_in();">x1</td>
<td onmouseout="movement_out('v2');" onmouseover="movement_in();">x2</td>
<td onmouseout="movement_out('v3');" onmouseover="movement_in();">x3</td>
<td onmouseout="movement_out('v4');" onmouseover="movement_in();">x4</td>
<td onmouseout="movement_out('v5');" onmouseover="movement_in();">x5</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('w0');"
onmouseover="movement_in();">Attribute Y</td>
<td onmouseout="movement_out('w1');" onmouseover="movement_in();">y1</td>
<td onmouseout="movement_out('w2');" onmouseover="movement_in();">y2</td>
<td onmouseout="movement_out('w3');" onmouseover="movement_in();">y3</td>
<td onmouseout="movement_out('w4');" onmouseover="movement_in();">y4</td>
<td onmouseout="movement_out('w5');" onmouseover="movement_in();">y5</td>
</tr>
<tr>
<td class="oce-first" onmouseout="movement_out('x0');"
onmouseover="movement_in();">Attribute Z</td>
<td onmouseout="movement_out('x1');" onmouseover="movement_in();">z1</td>
<td onmouseout="movement_out('x2');" onmouseover="movement_in();">z2</td>
<td onmouseout="movement_out('x3');" onmouseover="movement_in();">z3</td>
<td onmouseout="movement_out('x4');" onmouseover="movement_in();">z4</td>
<td onmouseout="movement_out('x5');" onmouseover="movement_in();">z5</td>
</tr> </tbody>
<tfoot>
<tr>
<td class="oce-first" onmouseout="movement_out('z0');"
onmouseover="movement_in();">&nbsp;</td>
<td onmouseout="movement_out('z1');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('1');" value="Buy Now" /></td>
<td onmouseout="movement_out('z2');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('2');" value="Buy Now" /></td>
<td onmouseout="movement_out('z3');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('3');" value="Buy Now" /></td>
<td onmouseout="movement_out('z4');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('4');" value="Buy Now" /></td>
<td onmouseout="movement_out('z5');" onmouseover="movement_in();"><input type="submit"
name="button" onclick="bought('5');" value="Buy Now" /></td>
</tr></tfoot>
</table></td>
</tr>
</table>
</body>
</html>

86

>>
Appendix:
Source
Code


The
JavaScript
file


var http = getHTTPObject();


var cellEntryDate;
var cellExitDate;
var time;
var queue1="";
var queue2="";
var flag=0;
var done = 0;
var tempQueue="";
var userId=new Date();
userId=userId.getTime();
var startpredict=0;
var predictProduct=0;

function autoPredict()
{
setTimeout("predict()",10000);
}

function predict()
{
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}

function predictResponse()
{
if (http.readyState == 4)
{
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg2").className="";
document.getElementById("cg3").className="";
document.getElementById("cg4").className="";
document.getElementById("cg5").className="";
document.getElementById("cg6").className="";
document.getElementById("cg"+colName).className="oce-predict";

alert("Product : "+predictProduct);
setTimeout("predict()",10000);
}
}

function handleHttpResponse()
{
if (http.readyState == 4)
{
startIt();
}
}

function handleHttpResponseBought()
{
if (http.readyState == 4)
{
alert("Thanks for Participating");
}
}

function start_It() {
if(done==0)
{
setTimeout("sendData()",2000);
}
if(startpredict==0)
{
++startpredict;
autoPredict();
}

87

>>
Appendix:
Source
Code


function sendData() {
if(flag==0)
{
queue2="";
flag=1;
var query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else
{
queue1="";
flag=0;
var query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}

function movement_in(){
cellEntryDate = new Date();
}

function movement_out(cell){
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0)
{
if(flag==0)
{
queue1 = queue1+cell+":"+time+"_";
}
else
{
queue2 = queue2+cell+":"+time+"_";
}
}
}

function bought(product){
done=1;
var query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}

function getHTTPObject() {
var xmlhttp;
/*@cc_on
@if (@_jscript_version >= 5)
try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}
@else
xmlhttp = false;
@end @*/
if (!xmlhttp && typeof XMLHttpRequest != 'undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp = false;
}
}
return xmlhttp;

88

>>
Appendix:
Source
Code


function hiliteColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="over";
}

function resetColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="";
}

89

>>
Appendix:
Source
Code


The
CSS
file


body {
margin-left: 0px;
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
text-align: left;
}
colgroup.over {
background: #ebeeff;
}

.oce-first
{
background: #d0dafd;
border-right: 10px solid transparent;
border-left: 10px solid transparent;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}

.oce-predict
{
background: #d0dafd;
border-right: 3px solid #F00;
border-left: 3px solid #F00;
border-top: 3px solid #F00;
border-bottom: 3px solid #F00;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}

table.one-column-emphasis
{
font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
font-size: 12px;
width: 100%;
border-collapse: collapse;
color: #969;
}

table.one-column-emphasis th
{
font-size: 14px;
font-weight: bold;
padding: 12px 15px;
color: #039;
text-align:center;
}

table.one-column-emphasis td
{
padding: 10px 15px;
color: #669;
border-top: 1px solid #e8edff;
min-width:166px;
text-align:center;
}

table.one-column-emphasis tr:hover td
{
background: #ebeeff;
text-align: center;
}
table.one-column-emphasis tr:hover td:hover

90

>>
Appendix:
Source
Code


{
color: #039;
background: #94acff;
}
.bold {
font-weight: bold;
}

.italics {
font-style: italic;
}

.oce-first {
text-align: justify;
}

91

>>
Appendix:
Source
Code


The
PHP
scripts


data.php


<?php
$queue=$HTTP_GET_VARS['queue'];
$userId=$HTTP_GET_VARS['userId'];
include("connect.php");
$queueArray=explode("_",$queue);
for($i=0;$i<substr_count($queue,"_");$i++)
{
$values=explode(":",$queueArray[$i]);
mysql_query("INSERT into data
values(\"".$userId."\",\"".$values[0]."\",\"".$values[1]."\")");
}
mysql_close($conn);
?>

connect.php


<?php
$dbhost = 'localhost:8889';
$dbuser = 'root';
$dbpass = 'root';
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ('Error connecting to mysql');
$dbname = 'MSc';
mysql_select_db($dbname);
?>

bought.php


<?php
$product=$HTTP_GET_VARS['product'];
$userId=$HTTP_GET_VARS['userId'];
include("connect.php");
mysql_query("INSERT into bought values(\"".$userId."\",\"".$product."\")");
mysql_close($conn);
?>

alignData.php


<?php
include("connect.php");
$result=mysql_query("SELECT * FROM `bought`");
while($row = mysql_fetch_array($result))
{
$result_1=mysql_query("SELECT * FROM `data` WHERE `userID`=\"".$row['userId']."\" order by
`cellID`");
$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
$previous_column=$row_1['cellID'];
$previous_value=$row_1['time'];
while($row_1 = mysql_fetch_array($result_1))
{
if($previous_column==$row_1['cellID']) {
$previous_value+=$row_1['time'];
}

92

>>
Appendix:
Source
Code


else {
$columnNames=$columnNames.",".$previous_column;
$values=$values.",\"".$previous_value."\"";
$previous_value=$row_1['time'];
$previous_column=$row_1['cellID'];
}
}
$columnNames=$columnNames.",".$previous_column.",product";
$values=$values.",\"".$previous_value."\",\"".$row['product']."\"";
mysql_query("INSERT INTO finalData(userID".$columnNames.") values
(\"".$row['userId']."\"".$values.")");
mysql_query("DELETE from `bought` where `userId` = \"".$row['userId']."\"");
mysql_query("DELETE from `data` where `userID` = \"".$row['userId']."\"");
}
mysql_close($conn);
?>

predict.php


<?php
include("connect.php");
$totalTime=0;
$result_1=mysql_query("SELECT * FROM `data` WHERE `userID`=\"".$_GET['userId']."\" order by
cellID");

$result_2=mysql_query("SELECT * FROM `data` WHERE `userID`=\"".$_GET['userId']."\" order by


cellID");
while($row_2 = mysql_fetch_array($result_2))
$totalTime+=$row_2['time'];

$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
$previous_column=$row_1['cellID'];
$previous_value=$row_1['time'];
while($row_1 = mysql_fetch_array($result_1))
{
if($previous_column==$row_1['cellID'])
{
$previous_value+=$row_1['time'];
}
else
{
$$previous_column=$previous_value/$totalTime;
$previous_value=$row_1['time'];
$previous_column=$row_1['cellID'];

}
}
$$previous_column=$previous_value/$totalTime;

decisionTree();
//neuralNetwork();

function decisionTree()
{
$model_DT=0;
if($b5 <= 0.04509)
if($k4 <= 0.013828)
if($v1 <= 0.000362)
if($r0 <= 0.000626)
if($d5 <= 0.003481)
if($d5 <= 0.001586)
if($g4 <= 0.033267)
if($s3 <= 0.004874)
if($u1 <= 0.002108)
if($f1 <= 0.039667)
if($f4 <= 0.028894)
if($i4 <= 0.004699)
if($d2 <= 0.001173)
if($e5 <= 0.001377)

93

>>
Appendix:
Source
Code


if($e1 <= 0.029566)


if($r3 <= 0.000861)
if($c1 <= 0.043665)
if($a3 <= 0.206815)
if($b1 <= 0.007319)
if($f3 <= 0.001471)
if($b4 <= 0.00214)
$model_DT=2;
else
if($a4 <=
0.004126) $model_DT=3;
else $model_DT=2;
else $model_DT=3;
else
if($b3 <= 0.123969)
$model_DT=2;
else $model_DT=1;
else $model_DT=1;
else $model_DT=1;
else $model_DT=3;
else $model_DT=1;
else $model_DT=3;
else
if($s4 <= 0.002873) $model_DT=2;
else $model_DT=4;
else $model_DT=1;
else $model_DT=4;
else $model_DT=3;
else $model_DT=3;
else
if($q1 <= 0.004708)
if($r4 <= 0.007391) $model_DT=3;
else $model_DT=2;
else $model_DT=2;
else
if($g5 <= 0.004141)
if($k4 <= 0.001354) $model_DT=4;
else $model_DT=3;
else $model_DT=2;
else $model_DT=4;
else
if($g5 <= 0.004141)
if($b5 <= 0.002996)
if($g4 <= 0.003922) $model_DT=2;
else $model_DT=1;
else $model_DT=3;
else $model_DT=5;
else $model_DT=4;
else
if($s4 <= 0.005561)
if($t4 <= 0.002371)
if($e0 <= 0.001979)
if($h2 <= 0.005305) $model_DT=1;
else $model_DT=2;
else $model_DT=2;
else $model_DT=2;
else $model_DT=2;
else
if($f5 <= 0.001805) $model_DT=4;
else $model_DT=2;
else
if($t3 <= 0.000515)
if($d4 <= 0.008991)
if($e2 <= 0.011901)
if($a1 <= 0.001341)
if($g2 <= 0.001762) $model_DT=4;
else $model_DT=5;
else $model_DT=5;
else $model_DT=4;
else $model_DT=2;
else $model_DT=3;
echo $model_DT;
}

94

>>
Appendix:
Source
Code


function neuralNetwork()
{
$Node5=(-0.0209449762256399)+($a0*0.0120761574490061)+($a1*-0.0174298014185729)+($a2*-
0.0175622955697642)+($a3*-0.000798046164731245)+($a4*-0.00566210278243689)+($a5*-
0.00257021437573848)+($b0*0.0813554156049207)+($b1*-
0.0383601651270091)+($b2*0.0315342748963075)+($b3*0.04750940128612)+($b4*0.00444930879229902)+($b5*
0.0447743155601993)+($c0*0.0127846301489485)+($c1*0.0167829106398277)+($c2*0.0412283962113621)+($c3
*0.0647197008365273)+($c4*0.026137495413712)+($c5*0.0292672102649498)+($d0*0.0575247995032596)+($d1
*-0.0248903478567491)+($d2*-0.0356248056960633)+($d3*0.0131503378763436)+($d4*-
0.00943722882163672)+($d5*0.0254130310753136)+($e0*0.0953293388209953)+($e1*-
0.0358630730881965)+($e2*0.09184645890614)+($e3*0.0879998946588433)+($e4*-
0.0210989430518799)+($e5*0.0236328879965554)+($f0*0.0521255666178908)+($f1*0.0562279524027289)+($f2
*0.0420766593208718)+($f3*0.0219358641315261)+($f4*0.0500915161629286)+($f5*0.0598788090622592)+($g
0*-0.0106339935340819)+($g1*0.0158371741591566)+($g2*0.0828753056435395)+($g3*-
0.0152508552198513)+($g4*-
0.00815101349601804)+($g5*0.0268439313590316)+($h0*0.070123678107641)+($h1*-
0.0147305324346031)+($h2*0.0517135568746786)+($h3*-0.0117294349734072)+($h4*-
0.00594235655570873)+($h5*0.0410639065208286)+($i0*-0.00105630930040345)+($i1*-
0.00543787837624847)+($i2*0.0603755263497366)+($i3*0.0287693595250936)+($i4*0.0554227984526808)+($i
5*0.0600355834517169)+($j0*0.0186135251521197)+($j1*0.00984875030922667)+($j2*0.0193290574626347)+(
$j3*0.021484574396215)+($j4*0.0484829773111019)+($j5*0.0233728871769681)+($k0*0.0410110073637687)+(
$k1*-
0.00743846515678319)+($k2*0.0446579060767132)+($k3*0.00789530586935209)+($k4*0.0185589336156669)+($
k5*0.0178833473514336)+($l0*0.0366297156412459)+($l1*0.0297884220860898)+($l2*0.0450253751867714)+(
$l3*0.0705159823038729)+($l4*0.074643360814636)+($l5*0.049178643898654)+($m0*0.00649293306157912)+(
$m1*0.0235761949995652)+($m2*0.0282972581223614)+($m3*0.00995247757969736)+($m4*0.0635360916248171)
+($m5*-0.0185514952082912)+($n0*0.0798799834823821)+($n1*-
0.0367274799798666)+($n2*0.0461992904934746)+($n3*0.0354383668658634)+($n4*-
0.00123240277220675)+($n5*-0.0150807856098709)+($o0*-
0.0260784636646052)+($o1*0.0553028912171675)+($o2*0.0802089447351997)+($o3*-
0.0235601224487924)+($o4*-0.0281363990127924)+($o5*0.0319917291420718)+($p0*-
0.0257109331590629)+($p1*-0.0279769700636828)+($p2*0.0433907293866429)+($p3*-
0.0310545628159805)+($p4*0.0348153094694314)+($p5*-0.00776438719161176)+($q0*-
0.0069736497593223)+($q1*0.0161811177301145)+($q2*0.0576906924312276)+($q3*0.0441712928131897)+($q4
*0.0165528172670987)+($q5*-0.0274805831321372)+($r0*0.0120430047036489)+($r1*-
0.000892653621313331)+($r2*0.0868045378672117)+($r3*0.0281943074796785)+($r4*0.0670839346752799)+($
r5*0.0110772507057164)+($s0*0.0214207237015366)+($s1*-
0.032511653106313)+($s2*0.0328856849361516)+($s3*0.0313926662260086)+($s4*0.0111177031525771)+($s5*
0.0284289901014687)+($t0*0.0428425565992686)+($t1*0.0534413420371503)+($t2*0.0244766875457709)+($t3
*0.0647078085232812)+($t4*0.0112235270733354)+($t5*0.0097765520400492)+($u0*0.0259846759422365)+($u
1*-
0.0430507927467189)+($u2*0.107107831659775)+($u3*0.0467301403971514)+($u4*0.0571975966844622)+($u5*
-0.0079845822250066)+($v0*0.0303173561775128)+($v1*-
0.0043169837441232)+($v2*0.0866140345320475)+($v3*0.00261036151061667)+($v4*0.00523185366643474)+($
v5*-0.0239702999191261);

// Similar codes for the rest 72 nodes have been omitted. This was done because it was a 40 page
long code. The complete code is available online for reference

$max=max((1/(1+(1/pow(2.718282,$Node0)))),(1/(1+(1/pow(2.718282,$Node1)))),(1/(1+(1/pow(2.718282,$N
ode2)))),(1/(1+(1/pow(2.718282,$Node3)))),(1/(1+(1/pow(2.718282,$Node4)))));

if($max==(1/(1+(1/pow(2.718282,$Node0))))) echo "1";


else if($max==(1/(1+(1/pow(2.718282,$Node1))))) echo "2";
else if($max==(1/(1+(1/pow(2.718282,$Node2))))) echo "3";
else if($max==(1/(1+(1/pow(2.718282,$Node3))))) echo "4";
else if($max==(1/(1+(1/pow(2.718282,$Node4))))) echo "5";
}
mysql_close($conn);
?>


95


You might also like