Professional Documents
Culture Documents
using
Machine
Learning
on
User
Mouse
Tracking
Data
Sparsh Gupta
Pembroke
College
|
Computing
Laboratory
University
of
Oxford
Submitted in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science
September
2009
Abstract
ABSTRACT
The
websites
are
becoming
more
and
more
dynamic
but
not
intelligent.
Based
on
certain
mouse
clicks
or
user
choices,
today’s
dynamic
websites
can
mold
themselves
but
cannot
predict
relevant
data
intelligently.
The
data
contained
in
today’s
websites
is
growing
and
the
number
of
users
demanding
unique
different
information
is
also
ever
increasing.
This
has
created
a
challenging
problem
of
delivering
the
right
content
to
every
user.
This
thesis
is
an
original
work
concentrating
on
solving
this
problem
of
generating
relevant
content
for
each
individual
user.
One
of
the
primary
inputs
used
by
the
project
is
the
mouse
movement
behavior
of
the
user.
If
the
website
capturing
mouse
movements
is
built
in
such
a
way
that
the
mouse
pointer
is
mostly
close
to
the
point
of
gaze
of
the
user,
then
the
mouse
movement
behavior
would
theoretically
mean
tracking
the
eye
of
the
user.
Based
on
this
mouse
movement
data,
further
content
can
be
predicted
and
personalized
for
each
user
using
one
or
more
machine
learning
models.
This
thesis
proposes
a
complete
methodology
of
building
and
implementing
such
a
system.
As
a
proof
of
concept,
an
online
shopping
website
has
been
built
and
further
tests
have
been
conducted
which
gave
a
remarkable
accuracy
of
84.09%
when
compared
with
the
actual
needs
of
the
user.
The
working
demonstration
of
the
project
along
with
its
description
is
available
online
at
http://sparshgupta.name/MSc/Project
Keywords: adaptive web, machine learning, mouse movement, gaze point
ii
Acknowledgement
ACKNOWLEDGEMENT
I
am
heartily
thankful
to
my
supervisor,
Dr.
Vasile
Palade,
whose
encouragement,
guidance,
confidence
in
my
idea
and
support
from
the
initial
to
the
final
level
enabled
me
to
develop
this
project
and
understand
the
subject.
I
am
thankful
to
Computing
Laboratory,
University
of
Oxford
for
accepting
my
proposal
and
giving
me
an
opportunity
to
work
on
this
idea.
I
gratefully
acknowledge
the
support
and
help
of
all
the
volunteers
who
helped
me
collect
the
data
for
my
work.
I
would
like
to
thank
Prof.
Luke
Ong
and
Pembroke
College
for
their
co‐operation
and
readiness
to
always
help
me
when
needed.
I
would
also
like
to
acknowledge
the
efforts
and
facilities
provided
by
the
staff
of
the
Computing
Laboratory
Library,
Radcliff
Science
Library
and
Pembroke
College
Library.
Lastly,
I
offer
my
regard
to
my
parents,
my
sister
and
friends
who
always
supported
me
in
all
respects
during
the
completion
of
this
project.
Sparsh Gupta
iii
Table
of
Contents
TABLE OF CONTENTS
Abstract........................................................................... ii
Acknowledgement.......................................................... iii
Table of Contents........................................................... iv
Table of Figures.............................................................. ix
Introduction........................................................... 1
1.1
A
Primer ..................................................................................1
1.1.1
The
World
Wide
Web ............................................................................... 1
1.2 Motivation.............................................................................. 4
1.3 Objectives............................................................................... 5
Background,
Literature
review
and
Project
overview .................................................................8
2.1
Coordination
of
mouse
and
eye
movements ........................ 8
iv
Table
of
Contents
2.4
Discussion ............................................................................. 11
2.5 Project overview....................................................................12
Data
Collection
and
Pre‐processing.....................15
3.1
The
initial
website................................................................. 15
3.1.1
Specifications ........................................................................................... 15
3.1.2.2 Database Design...............................................................................20
3.1.2.3 Implementing mouse tracking........................................................22
3.1.2.4 Final product bought by the user ...................................................25
3.3.2 Implementation......................................................................................28
3.3.2.1 Data compilation..............................................................................28
3.3.2.2 Data cleaning ...................................................................................30
Building
machine
learning
models..................... 34
4.1
Machine
Learning ................................................................ 34
4.1.1
WEKA....................................................................................................... 35
4.2 Methods evaluated................................................................35
v
Table
of
Contents
4.2.1
Decision
Tree ..........................................................................................36
4.2.2 Neural Network......................................................................................36
4.3
Implemented
algorithms......................................................37
4.3.1
Decision
Tree
(C4.5)...............................................................................38
4.3.2 Neural Network (Multilayer Perceptron).............................................39
4.4
Model
building..................................................................... 39
4.4.1
Decision
Tree ......................................................................................... 40
4.4.1.2 Testing the decision tree.................................................................45
4.4.1.2.1 Testing on Training Data.........................................................45
4.4.2.2.1 Testing on Training Data ........................................................52
4.4.2.2.2 Testing by Cross‐Validation (folds 10)...................................52
4.4.2.2.3 Discussion................................................................................ 53
Embedding
the
machine
learning
models
in
the
website ................................................................. 56
5.1
What
and
Why? .................................................................... 56
vi
Table
of
Contents
5.3.2
Implementing
the
Neural
Network
model...........................................59
5.4 Using model outputs ...........................................................60
Testing
and
Results..............................................64
6.1
Testing
methodology ........................................................... 64
6.2
Testing
for
model
accuracy.................................................. 64
6.2.1
Testing
data
collection...........................................................................65
6.2.2.1 Decision Tree model........................................................................67
6.3
Testing
time
performance
of
the
models............................ 69
6.3.1
Decision
Tree
model ..............................................................................70
Bibliography......................................................... 78
Appendix: Source Code........................................ 82
vii
Table
of
Contents
HTML
final
webpage .........................................................................................82
The JavaScript file ..............................................................................................87
The PHP scripts .................................................................................................92
data.php.........................................................................................................92
connect.php...................................................................................................92
bought.php ....................................................................................................92
alignData.php................................................................................................92
predict.php ....................................................................................................93
viii
Table
of
Figures
TABLE
OF
FIGURES
Figure
1:
Project
outline.................................................................................................................... 14
Figure 2: Screenshot of the top half of the developed webpage...................................... 17
Figure 3: Screenshot of the developed webpage.................................................................... 18
Figure 4: Code given to each section of the webpage........................................................... 19
Figure 6: Database table 'data'....................................................................................................... 21
Figure 7: Database table 'bought'.................................................................................................. 21
ix
Introduction
1
INTRODUCTION
This
chapter
includes
a
brief
overview
of
a
few
terms.
It
then
discusses
the
coordination
between
eye
and
mouse
movement
and
how
mouse
movement
data
can
be
used
as
pseudo
eye
tracking
data.
Later,
this
chapter
talks
about
the
motivation
behind
this
project
and
clarifies
the
objectives
of
the
research
and
the
structure
of
this
document.
1.1 A Primer
This
section
of
the
chapter
will
discuss
a
brief
history
of
the
World
Wide
Web
(WWW),
the
use
of
a
computer
mouse
and
the
current
eye
tracking
technology.
It
will
later
explain
how
the
WWW
can
be
improved
by
using
eye
tracking
data
and
how
a
mouse
pointer
can
be
used
to
collect
pseudo
eye
tracking
data.
1.1.1 The World Wide Web
In
1990,
CERN
launched
the
world’s
first
website1,
which
was
only
a
few
lines
of
text
and
hyperlinks.
In
its
nineteen
years
of
journey,
today’s
websites
have
completely
revolutionized.
The
plain
text
is
now
being
accompanied
with
all
sorts
of
rich
media
1
CERN,
Welcome
to
info.cern.ch/,
http://info.cern.ch/.
1
Introduction
including
images,
music,
videos,
animations,
colours
etc.
Dynamic
data
from
ever‐
increasing
databases
is
rapidly
replacing
the
static
content
of
the
websites.
Web
servers
are
now
capable
of
more
real
time
computing.
Data
cannot
only
be
shown
to
a
user
but
can
be
collected
from
him
easily.
Recently,
the
success
of
AJAX1
has
completely
changed
the
web
experience
by
making
it
much
more
interactive
and
more
data
driven.
Today,
Internet
has
changed
everything,
from
how
we
do
business,
how
we
study,
connect
with
friends
and
in
general,
how
we
live.
1.1.2 The computer mouse device
Most
of
the
people
in
the
world
uses
a
computer‐pointing
device
(generally
a
mouse)
to
navigate
through
a
website.
They
click
hyperlinks
spread
across
different
sections
of
a
webpage,
select
texts
or
scroll
through
a
long
page
using
a
computer
mouse.
Mouse
can
safely
be
called
as
a
personal
assistant
while
working
on
a
computer
and
especially
while
browsing
a
website.
1.1.3 Eye tracking
Eye
tracking
or
Gaze
tracking
is
a
process
of
measuring
the
gaze,
i.e.,
keeping
a
track
of
the
point
at
which
a
user
is
looking.
Most
of
the
websites
have
visual
information
in
the
form
of
texts,
images,
graphics,
etc.,
and
almost
all
the
information
a
user
attains
from
a
website
is
by
perceiving
it
though
his
eyes.
Eye
tracking
when
employed
to
a
website
can
be
imagined
as
a
method
of
determining
the
portion
on
the
screen
at
which
the
user
is
looking.
This
information
can
potentially
1
W3Schools,
Ajax,
http://www.w3schools.com/Ajax/.
2
Introduction
give
a
fair
idea
about
the
sections
most
relevant
to
him.
The
more
time
a
user
spends
looking
at
a
particular
section,
reading
it
or
simply
viewing
it,
the
more
interested
he
is
in
that
section
compared
to
the
others
on
the
same
page.
1.1.4 WWW and the missing gap
Websites
have
started
becoming
dynamic
by
accepting
inputs
from
a
user,
which
are
then
used
to
select
relevant
content
or
information
for
him.
The
kinds
of
input,
current
websites
primarily
employ
are:
mouse
clicks,
key
presses,
text
entered
or
choices
chosen
by
the
user
in
the
form
element
of
the
page.
This,
on
the
contrary,
means
that
incase
the
user
is
not
interested
in
giving
any
data
as
input,
the
website
would
end
by
being static or without any information on user needs.
The
eye
tracking
data,
if
captured
for
a
general
user,
can
be
utilized
vastly
in
making
today’s
websites
more
adaptive
and
intelligent
by
harnessing
the
knowledge
of
users
interest
and
information
he
is
most
interested
in.
Without
seeking
any
external
data
from
the
user,
his
interests
and
needs
can
be
determined
based
on
his
eye
movements
and
he
can
be
served
the
data
he
is
most
interested
in.
1.1.5 Tracking mouse pointer to track user’s eyes
There
has
been
a
lot
of
research
in
improving
the
computing
experience
for
a
user
by
tracking
his
eye
location,
but
there
are
a
few
drawbacks
associated
with
it.
Firstly,
the
tracking
equipment
is
expensive
and
the
user
needs
to
physically
wear
the
tracking
gadget.
Not
everyone
using
Internet
would
want
or
can
wear
the
tracking
equipment
and
hence
the
general
public
websites
cannot
be
made
dependent
on
them.
There
are
also
ongoing
researches
to
determine
the
movement
of
eyes
using
a
camera
device,
but
as
of
now
the
accuracy
of
determining
the
gaze
position
is
low
and
it
depends
on
movements
of
the
user,
lighting
conditions
and,
most
importantly,
the
user
need
to
3
Introduction
download
an
external
software.
Because
of
these
limitations
of
the
eye
tracking
methods,
there
have
been
researches
in
finding
other
alternatives.
Recently
Googlers
Kerry
Rodden
and
Xin
Fu
proposed
in
their
paper
(Rodden,
et
al.
2008)
that
mouse
movements
show
potential
as
a
way
to
estimate
where
the
user
has
considered
before
deciding
where
to
click.
There
have
been
other
studies
that
have
provided
a
reasonable
estimate
of
coordination
of
mouse
and
eye
especially
on
a
page
in
which
a
click
is
likely
to
happen.
Hence,
tracking
a
user
mouse
movement
can
sometimes
be
used
as
a
pseudo
eye
tracking
data.
There
are
several
interface
design
techniques
in
Human
Computer
Interaction
with
which
a
website
can
make
sure
that,
in
most
cases,
user’s
mouse
pointer
is
close
to
his
point
of
gaze.
One
of
the
techniques
that
have
been
employed
in
the
project
is
the
mouse
over
cell
highlighting.
If
the
content
at
the
current
location
of
the
mouse
pointer
is
highlighted
to
make
it
stand
out
of
rest
of
the
page,
then
this
can
almost
always
ensure
that
the
mouse
pointer
movement
is
synchronized
with
the
area
the
user
is
currently
reading
or
gazing
to.
1.2 Motivation
Many
websites
do
not
ask
for
any
explicit
input
from
the
user
but
can
still
adapt
themselves.
They
primarily
use
either
some
geographical
information
(which
can
be
obtained
from
user’s
IP
address)
or
the
browser/operating
system
specifications
to
adapt
the
web
content
for
the
user.
This
adaptation
is
of
course
not
targeted
to
an
individual
user
and
is
only
a
broad
adaptation
to
cater
a
group
of
users
having
similar
demographics
or
preferences.
The
adaptation
of
a
website
can
be
based
on
any
smallest
bit
of
information
from
the
user.
The
more
information
the
website
attains
about
the
user,
the
better
it
is
capable
of
adapting
to
his
needs.
4
Introduction
The
primary
medium
of
interaction
of
a
user
with
a
website
is
a
mouse
device
and
it
produces
a
huge
amount
of
data
in
the
form
of
mouse
movement
behaviour.
The
motivation
behind
this
thesis
and
project
is
the
existing
gap
between
the
demand
of
more
user
data
for
a
website
to
make
it
adaptive
and
the
availability
of
ample
data
from
the
user
in
the
form
of
his
mouse
movements.
Further,
if
a
website
is
designed
in
such
a
way
that
most
often
or
not
the
user’s
mouse
pointer
movement
is
synchronized
with
his
point
of
gaze,
as
discussed
in
Section
1.1.5,
then
the
data
can
also
be
roughly
called
as
eye
tracking
data.
1.3 Objectives
The
objective
of
this
project
is
to
effectively
utilize
the
mouse
movement
data
of
a
user
in
making
the
web
content
more
adaptive
for
him,
by
dynamically
predicting
further
relevant
content
for
him.
In
order
to
achieve
the
above
main
objective,
the
following
sub‐objectives
needs
to
be
catered:
• Collecting
the
initial
training
dataset
of
mouse
movement
behavior
from
a
large
set
of
users
in
order
to
train
and
build
a
model.
This
will
involve
developing
a
website
with
well‐defined
area
or
sections
or
elements
where
mouse
movements
could
be
tracked.
The
website
needs
to
be
such
that
users
mouse
pointer
synchronize
with
his
point
of
gaze.
• Asking
volunteering
users
to
visit
this
site
and
choose
or
select
content
for
themselves
like
they
do
on
any
other
website.
Tracking
the
time
spent
at
each
section
/
element
of
the
page,
while
the
user
is
browsing
on
it
is
the
required
data.
The
target
(predicted
or
dependent)
variable
is
the
relevant
content
for
5
Introduction
him
and
hence
in
order
to
train
the
model,
this
data
point
(as
collected
explicitly
from
the
user)
also
needs
to
be
saved
in
the
databases.
• Building
machine
learning
models
by
using
the
collected
mouse
movement
data
as
the
training
and
initial
testing
dataset.
The
distribution
of
normalized
time
spent
by
the
user
at
each
section
of
the
webpage
would
be
the
independent
variables
and
hence
would
become
the
input
attributes
of
the
model,
and
further
content for the user will be the output of the model, as the dependent variable.
• Embedding
the
machine
learning
models
back
into
the
website
so
that
the
model
can
be
put
into
use.
The
website
would
continue
tracking
users
mouse
movements
and
would
use
the
built
model
to
compute
further
content
for
him
in
real
time.
• Testing
the
accuracy
of
the
implementation.
To
do
this,
the
predicted
content
needs
to
be
compared
with
the
actual
content
desired
by
the
user.
To
demonstrate
the
objectives,
a
sample
shopping
webpage
has
been
developed.
This
webpage
contains
a
comparison
of
the
specifications
of
five
laptop
models.
Based
on
the
mouse
movement
behavior
of
a
user
across
this
page,
the
best
laptop
would
be
recommended
to
him.
This
can
be
visualized
as
follows:
If
a
user
has
a
browsing
pattern
that
signifies
that
he
is
spending
say
40%
time
reading
about
the
RAM
of
the
laptops
(further
distribution
of
time
spent
on
different
RAM
sizes
of
different
models),
30%
time
reading
about
the
processors,
20%
time
about
the
Hard
Disc
Drive
and
the
rest
10%
time
similarly
reading
about
other
specifications,
then
based
on
this
data
and
the
developed
machine
learning
model,
the
6
Introduction
most
suitable
laptop
can
be
recommended
to
him.
The
accuracy
of
the
recommendation
could
be
checked
by
comparing
the
product
finally
bought
by
the
user
and
the
product
recommended
by
the
website.
1.4 Structure of the dissertation
This
document
will
start
with
giving
an
idea
about
the
related
research
being
done
across
the
globe.
It
will
then
explain
the
complete
implementation
outline
as
a
big
picture
of
the
project.
In
Chapter
3,
the
thesis
will
discuss
the
methodology
of
collecting
initial
training
data,
which
would
also
involve
the
complete
description
of
the
development
procedure
of
the
initial
website.
It
will
explain
the
process
of
collecting
data
along
with
the
structures
of
the
databases
and
the
data
cleaning
procedure.
Chapter
4
would
give
the
details
of
the
machine
learning
models
built
and
the
procedure
involved
along
with
the
testing
results
of
the
models
obtained
on
the
training
data.
Chapter
5
would
explain
the
procedure
adopted
to
implement
the
built
model
into
the
website
and
the
details
of
the
AJAX
communication
link
between
the
model,
data
and
the
website.
Then
the
thesis
explains
the
methodology
to
collect
testing
data
and
would
explain
the
testing
methodology
and
results
obtained
on
the
model.
The
thesis
closes
with
some
conclusions
and
the
author’s
view
on
the
possibility
of
future
work.
The
attached
appendix
contains
all
the
source
code.
The
working
demonstration
of
the
project,
along
with
its
documentation
and
the
GNU
General
Public
License
source
code
is
available
online
at
http://sparshgupta.name/MSc/Project
7
Background,
Literature
review
and
Project
overview
2
BACKGROUND,
LITERATURE
REVIEW
AND
PROJECT
OVERVIEW
This
chapter
explains
the
previous
work
related
to
the
problem
already
going
on
around
the
world.
The
chapter
is
divided
into
different
sections
explaining
independent
and
combined
work
going
on
or
being
done
in
each
of
the
heading.
The
chapter
later
summarizes
the
ongoing
work
and
also
presents
an
overview
of
the
project
carried
out
by
the
author.
The
work
done
in
the
project
is
an
original
idea
and
there
is
no
record
of
any
work
being
done
around
using
the
same
methodology.
The
problem
has
been
tackled
to
some
extent
and
has
been
considered
by
a
few
research
groups
but
their
methodology
and
final
conclusions
were
very
different
from
what
have
been
proposed
in
this
thesis.
The
following
parts
of
the
chapter
would
highlight
some
of
the
recent
developments
and
work
done
in
related
fields.
2.1 Coordination of mouse and eye movements
The
prime
question
of
whether
mouse
tracking
can
be
substituted,
or
at
least
partially
replicate,
eye
tracking
is
active.
(Chen,
Anderson
and
Sohn
2001)
studied
the
relationship
between
the
gaze
position
of
a
user
and
his
cursor
position
on
a
computer
screen
during
web
browsing.
They
8
Background,
Literature
review
and
Project
overview
conducted
tests
on
several
websites
and
recorded
the
eye
and
mouse
movements
of
the
uses
and
studied
them
separately.
They
concluded
that
there
is
a
strong
relationship
between
gaze
position
and
cursor
position
and
also
that
there
are
regular
patters
of
the
coordination.
They
have
also
argued
that
a
mouse
could
provide
us
more
information
than
just
x
and
y
coordinates
which
could
be
used
to
design
better
interfaces
for
human
computer
interactions.
They
wrote
in
their
conclusion
that
“Our
data
show
that
the
dwell
time
of
cursor
among
different
regions
has
strong
correlation
to
how
likely
a
user
will
look
at
that
region.
Also,
in
over
75%
of
chances,
a
mouse
saccade
will
move
to
a
meaningful
region
and,
in
these
cases,
it
is
quite
likely
that
the
eye
gaze
is
very
close
to
the
cursor.
This
result
implies
that,
by
predicting
the
users'
interests
on
web
pages,
mousse
device
could
be
a
very
good
alternative
to
an
eye‐tracker
as
a
tool
for
usability
evaluation.”
According
to
the
work
done
at
Google
labs
(Rodden,
et
al.
2008),
several
different
patters
of
coordination
between
eye
and
mouse
pointer
were
observed
on
a
web
search
result
page.
The
identified
behavior
patters
to
indicate
active
usages
were
–
following
the
eye
horizontally,
following
the
eye
vertically
and
marking
a
particular
result.
This
work
was
completely
done
on
a
search
results
page
but
clearly
concludes
that
coordination
between
user’s
eye
and
his
mouse
pointer
exists.
coordination
between
eye
movements
and
mouse
movements
on
the
web.
They
have
found
that
some
users
will
use
the
mouse
pointer
to
help
them
read
the
page,
or
to
help
them
make
a
decision
about
where
to
click.
If
was
concluded
that
given
an
intent
/
opportunity
to
click
in
the
current
user
activity,
the
mouse
is
much
more
likely
to
be
close
to
the
eye.
Eye
tracking
can
provide
insights
into
users’
behavior
while
using
the
search
results
page,
but
eye‐tracking
equipment
is
expensive
and
can
only
be
used
for
studies
where
the
user
is
physically
present.
The
equipment
also
requires
calibration,
9
Background,
Literature
review
and
Project
overview
adding
overhead
to
studies.
In
contrast,
the
coordinates
of
mouse
movements
on
a
web
page
can
be
collected
accurately
and
easily,
in
a
way
that
is
transparent
to
the
user.
This
means
that
it
can
be
used
in
studies
involving
a
number
of
participants
working
simultaneously,
or
remotely
by
client‐side
implementations
–
greatly
increasing
the
volume
and
variety
of
data
available.
There
is
a
basic
rationality
that
states
"If
I
might
click,
I
might
as
well
keep
the
mouse
close
to
my
eyes."
Where
there's
no
potential
to
click,
either
because
the
user
is
in
an
evaluative
mode
or
the
content
of
interest
is
devoid
of
links,
the
mouse
and
eye
diverge.
2.2 Capturing mouse movements
There
can
be
several
different
methodologies
to
capture
mouse
movement
behavior
of
a
user
over
a
webpage.
This
primarily
depends
upon
the
type
of
data
required
and
the
mouse
movement
expected.
(Arroya,
Selker
and
Wei
2006)
proposed
a
tool
that
need
no
installation
and
is
capable
of
tracking
users
mouse
movement.
This
mouse
movement
data
can
be
visualized
in
an
inbuilt
system
and
can
be
used
to
further
refine
the
usability
of
the
webpage.
They
however
have
not
proposed
any
methodology
to
automatically
refine the webpage.
(Edmonds,
et
al.
2007)
talks
about
technique
and
uses
of
mouse
tracking
on
a
website
but
completely
from
usability
point
of
view.
It
handles
the
capturing
of
the
mouse
movements
data
of
a
user
in
a
more
detailed
way
capturing
the
coordinates,
row
and
column
ID
along
with
many
other
parameters.
This
methodology
was
found
effective
but
showed
no
significance
from
the
current
problem
point
of
view.
The
paper
(Torres
and
Hernando,
Real
time
mouse
tracking
registration
and
visualization
tool
for
usability
evaluation
on
websites
n.d.),
proposes
a
methodology
to
track mouse movements on a webpage and visualize them on a tool that they have
10
Background,
Literature
review
and
Project
overview
developed.
They
have
used
the
HTML
and
AJAX
languages
and
have
proposed
a
method
to
link
the
mouse
movements
with
the
server
logs
and
web‐stat
data
to
get
add‐on
information
of
the
user’s
behavior.
behaviour
There
was
a
famous
project
named
‘Cheese’
done
at
the
MIT
(Mueller
and
Lockerd
2001),
which
extended
the
conventional
web
interface
user
model
(based
on
responds
of
only
mouse
clicks)
to
account
all
mouse
movements
on
a
page
as
an
additional
layer
of
information
for
inferring
user
interest.
They
developed
a
straightforward
way
to
record
all
mouse
movements
on
a
page,
and
conducted
a
user
study
to
analyze
and
investigate
mouse
behavior
trends
and
found
certain
mouse
behaviors,
common
across
many
users.
They
also
proposed
that
there
are
certain
categories
of
mouse
behavior
and
after
tracking
them,
the
website
could
be
molded
accordingly.
2.4 Discussion
It
was
found
after
literature
review
that
a
lot
of
work
has
been
done
to
prove
and
support
the
coordination
of
eye
and
mouse
movement
of
a
user
on
a
website.
The
eye
tracking
data
has
been
used
by
Google
to
improve
the
usability
of
their
search
pages.
There
are
several
ongoing
discussions
on
the
effective
use
of
eye
or
mouse
tracking
data
to
manually
refine
the
content
and
usability
design
of
a
webpage.
It
was
however
found
that
no
work
has
been
done
in
using
the
mouse
tracking
data
in
a
machine
learning
model
to
automatically
refine
or
predict
content
for
a
website
for
a
user
based
on
his
mouse
movement
or
eye
movement
behavior.
11
Background,
Literature
review
and
Project
overview
2.5 Project overview
The
project
undertaken
can
be
stated
as
a
method
proposed
to
automatically
refine
or
predict
the
contents
of
a
webpage,
for
a
user,
based
on
his
mouse
movement
behavior.
From
earlier
studies,
as
stated
in
Section
2.1,
it
has
been
assumed
that
there
is
certainly
some
coordination
between
a
user’s
eye
movement
and
his
mouse
movement.
Based
on
the
mouse
movements
of
an
individual
user,
his
preferences
for
content
and
his
needs
can
be
predicted
and
this
information
can
further
be
used
by
the
owners
of
the
website.
If
not
the
owners,
this
information
can
definitely
help
the
user
in
finding
the
right
content
for
him.
To
do
this,
the
first
task
was
to
device
a
methodology
to
track
user’s
mouse
movements
on
a
webpage.
There
can
be
several
ways
in
which
tracking
could
be
done,
and
further
there
can
be
several
different
data
points
that
can
be
saved
for
a
user
based
on
his
mouse
movements.
The
thesis
proposed
a
method
to
track
the
time
spent
by
a
user
in
every
section
of
a
webpage.
There
were
several
JavaScript
functions
written,
and
modifications
done
to
a
standard
website
to
enable
mouse
tracking
in
a
hidden
layer.
AJAX
was
used
to
connect
the
JavaScript
functions
with
the
server
end
PHP
scripts,
which
were
further
connected
to
MySQL
databases
for
storing
the
data.
To
demonstrate
all
this,
a
new
dummy
website
imitating
a
shopping
portal
was
developed.
Once
the
website
was
developed
with
mouse
tracking
capabilities,
it
was
made
available
to
public
for
two
weeks.
This
was
done
to
collect
some
initial
data
on
user’s
mouse
movement
behavior.
The
data
collected
was
processed
and
cleaned
before
analyzing
and
modeling
it.
This
complete
step
of
initial
website
development
and
data
collection
has
been
explained
in
details
in
Chapter
3
(Data
Collection
and
Pre‐processing).
It
was
then
required
to
study
and
analyze
the
collected
data
and
make
a
model
on
it
so
that
it
could
be
used
in
the
future
for
new
visitors.
To
do
this,
WEKA
was
used
and
12
Background,
Literature
review
and
Project
overview
different
types
of
models
were
made.
The
models
took
the
independent
variables
as
the
time
spent
in
different
sections
of
the
webpage
by
the
mouse
pointer
and
predicted
the
relevant
content
for
the
user
as
the
dependent
variable.
They
all
were
built
and
trained
on
the
initially
collected
data
and
were
tested
on
the
same
training
data.
After
several
iterations,
two
models,
one
based
on
Decision
Tree
and
the
other
on
Neural
Network
were
obtained
that
gave
significant
accuracy
on
the
training
data.
The
complete
model‐
building
phase
of
the
project
along
with
the
test
results
obtained
are
explained
in
Chapter
4
(Building
machine
learning
models)
Once
the
two
models
(each
of
Decision
Tree
and
Neural
Network)
were
obtained,
the
task
was
to
embed
them
both
into
the
initial
website.
This
was
necessary
so
that
the
built
models
could
be
used
for
future
visitors
and
the
content
relevant
to
them
can
be
predicted
based
on
their
mouse
movement
activities.
The
models
were
coded
in
PHP
on
an
apache
server
and
were
connected
with
the
front‐end
HTML
page
using
AJAX.
The
PHP
script
was
made
to
read
the
real
time
mouse
movement
data
of
a
given
user
directly
from
the
MySQL
databases
and
execute
the
model
on
it
to
predict
further
content
for
him.
The
whole
procedure
is
explained
in
details
in
Chapter
5
(Embedding
the
machine
learning
models
in
the
website)
After
embedding
the
two
models
into
the
website,
volunteers
were
again
asked
to
visit
the
website.
This
time
not
only
the
user’s
mouse
movements
were
captured
but
also
he
was
recommended
appropriate
content
based
on
one
of
the
two
machine
learning
models.
The
mouse
movement
data
was
saved
in
the
MySQL
databases
to
be
analyzed
for
accuracy
later.
This
step
is
explained
in
Chapter
6
(Testing
and
Results)
The
collected
data
was
used
as
the
test
dataset
and
the
two
models
were
evaluated
on
their
accuracy
as
well
as
time
performances.
It
was
found
that
under
the
present
limitations
of
lack
of
data,
the
Decision
tree
model
edged
over
the
Neural
Network
13
Background,
Literature
review
and
Project
overview
model
both
on
the
accuracy
as
well
as
on
the
time
performance
front.
The
details
of
this
step
are
mentioned
in
Chapter
7(Conclusion)
The whole project can be outlined as follows:
Asking
volunteering
users
to
Building
the
Initial
website
visit
the
website
and
capturing
capable
of
tracking
mouse
their
mouse
movements.
movements
of
the
visitors
Cleaning
and
compiling
the
collected
data.
Using
the
captured
mouse
Coding
the
obtained
machine
movement
data
of
the
users,
learning
models
back
into
the
building
and
training
machine
website
learning
models
Collecting
test
dataset
from
the
Testing
the
accuracy
of
the
built
final
website.
The
website
now
models
using
the
collected
test
is
capable
of
recommending
the
data.
Also
evaluating
the
time
appropriate
content
for
a
user
performances
of
the
models
on
based
on
his
mouse
movement
behavior
the
web
server
Figure
1:
Project
outline
14
Data
Collection
and
Pre‐processing
3
DATA
COLLECTION
AND
PRE‐PROCESSING
This
chapter
will
explain
the
complete
training
dataset
collection
steps.
This
would
involve
details
of
the
initial
website
developed
and
explanation
of
the
steps
followed
to
obtain
the
required
training
data
from
it.
Later,
this
chapter
will
explain
the
data
compilation
and
cleaning
steps
performed
on
the
initial
collected
data.
3.1 The initial website
To
analyze
the
mouse
movement
behavior
of
the
users
on
a
webpage,
the
first
step
would
be
the
development
of
the
website
under
consideration.
Since
the
proposed
method
of
analysis
and
modeling
the
data
is
machine
learning,
some
initial
training
data
is
also
required.
To
cater
both
the
needs,
a
dummy
website
capable
of
tracking
the
user’s
mouse
movements
was
built
and
made
public.
The
website
was
kept
live
until
required
data
was
achieved.
The
specifications
and
details
of
the
implementation
are
as
follows:
3.1.1 Specifications
The functionalities, requirements and specifications of the initial webpage built are:
• The
user
interface
design
of
the
initial
webpage
needs
to
be
exactly
same
as
that
of
the
required
final
website.
This
is
important
because
user’s
mouse
movements
15
Data
Collection
and
Pre‐processing
depend
on
the
interface
of
the
webpage.
It
is
necessary
that
the
data
collected
to
build
and
train
the
machine‐learning
model
is
of
the
same
webpage
where
the
model
is
finally
required
to
be
implemented.
• The
mouse
tracking
needs
to
be
implemented
in
a
hidden
layer
so
that
the
user
can
experience
the
web
in
the
same
rich
way
without
any
compromise
on
speed,
performance.
He
should
not
be
asked
any
explicit
information
at
any
time.
• As
stated
in
section
1.3,
the
webpage
developed
was
a
dummy
shopping
portal
showing
five
laptop
models
comparing
them
on
their
configurations.
• There
were
5
laptops
with
22
attributes
of
each
of
them.
There
was
an
empty
(no
laptop)
specification
heading
information
space
on
the
left
hand
side
of
the
page.
( )
Total
sections
in
the
built
page
were
5 + 1 × 22 = 132 ,
where
5
are
the
number
of
laptops,
1
is
for
specification
heading
category
(no
laptop
space)
and
22
being
the
count
of
attributes
per
laptop.
• Each
of
these
132
sections
of
the
webpage
gets
highlighted
as
soon
as
the
mouse
pointer
reaches
it.
This
ensured
that
the
user
is
most
likely
to
read
the
highlighted
section
of
the
webpage
and
hence
ensures
that
the
user’s
mouse
pointer
is
close
to
his
point
of
gaze.
This
step
ensured
that
the
mouse
movement
data
provides
pseudo
eye
tracking
data
of
the
user.
The
cell‐highlighting
feature
was
implemented
using
Cascading
Style
Sheets
where
the
cell
color
was
changed
as
soon
as
mouse
pointer
enters
the
cell.
• A
MySQL
database
was
connected
for
recording
the
mouse
pointer
time
on
each
section
of
the
webpage.
The
final
product
bought
by
that
user
was
also
saved
in
the
databases.
3.1.2 Implementation
The webpage was developed in HTML using PHP as the server side scripting language.
JavaScript and Ajax was used to dynamically transfer data from the HTML fields to the
16
Data
Collection
and
Pre‐processing
PHP
scripts.
Database
was
designed
in
MySQL
and
PHP
scripts
were
written
to
connect
and
transfer
data
between
MySQL
and
the
Apache
server.
3.1.2.1 Webpage Design
The
webpage
was
designed
in
HTML
in
a
tabular
format
with
6
columns
and
22
rows.
Column
1
had
the
heading
of
the
specifications
and
rest
5
columns
had
specifications
of
each
laptop
and
every
row
had
a
specification.
Each
of
the
132
cells
hence
obtained
in
the
table
were
corresponding
to
an
independent
variable
for
the
model
(input
variables).
The
screenshot
of
the
top
half
of
the
developed
page
is
shown
in
Figure
2
and
the
screenshot
of
the
complete
webpage
is
shown
in
Figure
3.
Figure
2:
Screenshot
of
the
top
half
of
the
developed
webpage
It
can
be
seen
clearly
that
there
are
6
columns
on
the
webpage
and
22
rows
and
hence
132
cells.
Since
each
of
these
cells
is
an
input
variable
to
the
model,
they
all
were
given
a
code.
Each
laptop
was
given
a
number
from
1
to
5
and
the
specification
heading
space
was
given
the
code
0.
Each
specification
was
given
an
alphabetic
code
from
‘a’
to
‘v’.
Hence,
each
of
the
132
sections
of
the
webpage
got
the
code
as
the
combination
of
the
alphabet of the specification and the number of the laptop like a0, a1, a2, a3, a4, a5, b0,
17
Data
Collection
and
Pre‐processing
b1,
b2,
…,
v3,
v4,
v5.
The
coding
methodology
for
the
first
few
cells
is
shown
in
Figure
4.
These
codes
were
not
added
anywhere
on
the
webpage
but
were
only
used
while
calling
the
mouse
tracking
functions
as
will
be
explained
in
subsequent
sections.
Figure
3:
Screenshot
of
the
developed
webpage
18
Data
Collection
and
Pre‐processing
Figure
4:
Code
given
to
each
section
of
the
webpage
To
make
sure
that
in
most
cases,
the
user’s
mouse
pointer
is
close
to
his
point
of
gaze,
a
Cascading
Style
Sheet
was
attached
with
the
HTML
webpage.
The
CSS
file
had
two
different
style
formats
that
could
be
applied
to
each
cell.
One
of
the
styles
was
the
normal
white
background
whereas
the
other
format
was
with
blue
background
to
enable
cell
highlighting.
As
soon
as
mouse
enters
a
cell,
the
normal
style
was
replaced
by
the
highlighting
style
for
that
cell.
This
was
again
reset
as
soon
as
the
mouse
leaves
the
highlighted
cell.
Similarly,
the
row
and
the
column
in
which
the
mouse
pointer
is
currently
present
are
also
highlighted
in
a
light
shade
of
blue.
The
CSS
code
of
the
different
styles
is
available
in
the
appendix
of
this
thesis.
The
screenshot
with
a
cell
‘g2’
highlighted
is
shown
in
Figure
5.
For
every
visitor
of
the
website
a
unique
user
id
was
generated
as
soon
as
the
page
loads.
To
keep
the
user
id
simple,
it
was
kept
as
the
current
JavaScript
Time
value
at
page
load.
JavaScript
time
function
returns
the
current
time
in
milliseconds
since
January
1,
1970.
This
ensured
that
in
the
current
scope
of
the
project,
all
visiting
user
would
have
a
unique
user
id.
The
JavaScript
code
to
generate
user
id
is:
19
Data
Collection
and
Pre‐processing
A
JavaScript
file
named
‘mouseover.js’
was
associated
with
this
webpage
with
several
JavaScript
variables
and
functions
required
to
track
and
record
mouse
movements.
The
HTML
code
of
the
website
was
also
given
an
‘onload’
event
to
call
a
JavaScript
function
named
‘start_It()’
which
triggers
the
mouse
tracking
functionality
of
the
website.
<body onload="start_It();">
The
algorithms
of
mouse
tracking
and
the
complete
implementation
would
be
explained
later
after
the
details
about
the
database
design.
Figure
5:
Screenshot
with
a
cell
highlighted
3.1.2.2 Database Design
A
database
was
created
in
MySQL
with
two
tables
namely
‘data’
and
‘bought’.
The
attributes
of
the
two
tables
are:
20
Data
Collection
and
Pre‐processing
Figure
6:
Database
table
'data'
Figure
7:
Database
table
'bought'
Table: data
• userID To record the user id of the user
• cellID To save the cell ID that was assigned to each sub section of the webpage
• time contains the time in milliseconds spent in the cellID
Table: bought
• userID To record the user id of the user
• bought To save the code of the final product bought by the user
The
table
‘data’
would
save
the
time
spent
in
each
cell,
i.e.
section
of
the
webpage
by
a
user.
There
could
be
132
different
sections
/
CellIDs
for
each
user
and
they
all
can
appear
multiple
times.
The
time
spent
in
each
section
by
a
user
will
be
the
independent
variable
for
the
model.
The
table
‘bought’
is
made
to
record
the
final
product
selected
by
the
user.
The
attribute
‘userID’
in
both
the
tables
is
the
foreign
key
and
is
the
primary
key
in
the
‘bought’
table.
21
Data
Collection
and
Pre‐processing
The
rationale
behind
such
a
design
was
to
implement
database
normalization
so
that
all
data
repetition
could
be
avoided.
Also,
the
insert
queries
would
be
simple
and
short
and
hence
would
be
efficient
and
wont
slow
the
webpage
while
tracking
the
mouse
and
interacting
with
the
databases
simultaneously.
The
only
drawback
of
such
a
design
is
that
the
data
would
need
merging
before
it
could
be
used
for
training
the
model.
3.1.2.3 Implementing mouse tracking
Each
of
the
132
cells
of
the
webpage
had
a
JavaScript
‘onmouseover’
and
‘onmouseout’
event
statements.
OnMouseOver
specifies
that
the
‘movement_in()’
JavaScript
function
be
called
every
time
the
mouse
comes
over
that
cell.
OnMouseOut
similarly
specifies
that
the
‘movement_out(‘cellID’)’
JavaScript
function
be
called
when
mouse
pointer
leaves
the
cell.
The
code
snippet
demonstrating
these
function
calls
is:
As
soon
as
mouse
pointer
enters
a
cell,
the
current
DateTime
was
recorded
in
a
temporary
variable
named
‘cellEntryDate’
in
the
function
‘movement_in()’.
This
function
was
not
passed
any
attribute.
As
soon
as
the
mouse
pointer
exists
a
cell,
the
time
spent
in
that
cell
in
milliseconds
was
calculated
by
subtracting
the
‘cellEntryDate’
from
the
current
DateTime
in
the
function
‘movement_out(‘CellID’)’.
The
movement_out()
function
was
also
passed
the
unique
2‐letter
cell
code
to
record
the
cell
ID.
The
time
spent
in
the
cell
along
with
the
cell
ID
was
concatenated
in
the
data
queue
variable
named
‘queue1’
or
’queue2’.
The
JavaScript
function
definitions
are
as
follows:
function movement_in() {
cellEntryDate = new Date();
}
22
Data
Collection
and
Pre‐processing
function movement_out(cell) {
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0)
{
if(flag==0) {
queue1 = queue1+cell+":"+time+"_";
}
else {
queue2 = queue2+cell+":"+time+"_";
}
}
}
The
done
variable
in
the
above
code
was
to
check
if
the
current
user
is
still
active
and
has
not
bought
a
product
already.
Flag
was
a
variable
to
check
which
queue
variable is currently available.
Two
instances
of
the
queue
variables
were
made
to
ensure
that
while
transmitting
one
of
the
queue
data
to
the
server
via
AJAX,
the
other
queue
variable
can
record
the
cell
movements.
This
is
of
great
importance
specially
when
the
Internet
bandwidth
speed
is
low
and
data
transfer
in
worst
case
can
take
a
lot
of
time.
This
step
also
ensures
that
the
interaction
experience
of
the
user
will
not
be
affected
while
mouse
tracking
is
going
on
in
the
background.
As
stated
above,
the
built
website
had
an
‘onload’
JavaScript
event
calling
a
function
named
‘start_It()’.
The
start_it()
function
is
a
recursive
function
which
calls
the
‘sendData()’
function
every
2
seconds.
The
sendData()
function
contains
the
AJAX
statement
to
transfer
the
generated
user
ID
(variable
‘userID’)
and
the
queue
variables
namely
‘queue1’
or
‘queue2’
to
the
‘data.php’
file
at
the
backend
server.
The
self‐
explanatory
JavaScript
functions
definitions
are
as
follows:
23
Data
Collection
and
Pre‐processing
function start_It(){
if(done==0) {
setTimeout("sendData()",2000);
}
}
function sendData(){
var query_string;
if(flag==0) {
queue2="";
flag=1;
query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else {
queue1="";
flag=0;
query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}
The
‘sendData()’
JavaScript
function
uses
standard
AJAX
calls
and
standard
‘http’
open,
onreadystatechange
and
send
functions.
The
‘query_string’
variable
contains
the
PHP
file
to
which
the
arguments
were
passed
via
GET
method.
The
‘data.php’
files
was
coded
such
that
it
takes
the
queue
variable
as
sent
by
the
JavaScript
‘sendData()’
function
and
explodes
the
string
to
extract
the
various
cell
IDs
and
time
values
associated
with
them.
It
then
opens
a
connection
with
the
MySQL
database
and
inserts
records
with
cell
information
in
the
‘data’
table
using
the
received
user
ID.
The
complete
code
of
the
‘data.php’
file
is
available
in
appendix
of
the
thesis.
24
Data
Collection
and
Pre‐processing
3.1.2.4 Final product bought by the user
Once
the
user
browse
through
the
webpage
and
scrolled
on
the
table
reading
about
the
various
configurations
of
the
five
laptops
giving
us
one
case
of
the
training
data,
he
was
required
to
select
one
of
the
products.
This
is
to
simulate
the
actual
shopping
portal
scenario
where
a
person
read
about
various
products
and
finally
buy
one
of
it.
To
select
a
product,
he
performs
a
mouse
click
operation
on
the
‘Buy
Now’
button
associated
with
the
product
as
shown
in
Figure
3.
As
soon
as
any
‘Buy
Now’
button
on
the
webpage
is
triggered
by
the
user,
a
JavaScript
function
named
‘bought(‘ProductID’)’
is
invoked.
This
function
uses
the
AJAX
protocols
and
sends
the
userID
and
the
ID
of
the
product
clicked
to
the
‘bought.php’
file
on
the
server.
The
‘bought.php’
file
on
the
web
server
connects
to
the
MySQL
database
and
inserts
this
information
as
a
row
in
the
bought
table
‘bought’.
The
complete
code
of
the
PHP
script
‘bought.php’
is
available
in
appendix
and
the
JavaScript
function
is
as
follows:
function bought(product){
done=1;
var query_bought;
query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}
Once
the
user
selects
the
product,
further
mouse
tracking
is
disabled.
Changing
the
value
of
the
JavaScript
‘done’
variable
does
this.
3.1.3 Testing the initial website
The
website
once
completed
was
hosted
on
a
public
web
server
and
was
tested
thoroughly
for
bugs
and
errors.
The
main
points
in
the
checklist
were:
25
Data
Collection
and
Pre‐processing
• The
queue
variables
(‘queue1’
and
‘queue2’)
in
the
JavaScript
file
are
recording
the
cellID
and
time
appropriately
and
the
data
is
getting
extracted
accurately
at
the
server.
• Data
is
being
sent
properly
from
the
frontend
JavaScript
functions
to
the
backend
PHP
files
via
AJAX.
• The link between the database and PHP files is working correctly.
• Both
the
tables
in
the
database
are
getting
data
and
are
inserting
it
properly
without
any
error.
3.2 Data collection
When
the
website
as
explained
in
the
previous
section
was
developed
and
tested
completely,
it
was
made
open
for
the
general
public.
Volunteers
via
email
and
social
media
were
invited
to
visit
the
webpage.
The
selection
of
the
volunteers
was
completely
random
and
was
primarily
the
contact
group
of
the
author.
All
the
volunteers
/
visitors
were
asked
to
browse
the
webpage
and
buy
a
product
on
it
(at
cost
zero,
virtually)
similar
to
the
way
they
do
on
a
real
shopping
site.
From
this
sample
the
initial
training
data
for
the
model
was
collected
and
saved
into
the
databases
as
explained
in
the
previous
section.
No
personal
information
or
any
other
data
was
asked
from
any
visitor.
The
duration
of
this
step
depends
on
the
requirements
of
the
initial
training
data
for
building
the
model.
The
more
the
number
of
sections
in
the
website,
i.e.
more
the
independent
variables
of
the
model,
more
number
of
cases
in
the
initial
training
data
would
be
required
to
build
a
relevant
model.
In
a
short
span
of
14
days,
292
unique
visitors
accessed
the
webpage.
244
rows
were
collected
in
the
‘bought’
table
and
16401
tuples
were
saved
in
the
‘data’
table.
The
expected
users
were
around
350‐400
but
due
to
lack
of
visibility
of
the
project
and
no
26
Data
Collection
and
Pre‐processing
compensation
available
to
the
volunteers,
the
number
could
not
be
reached
and
in
lieu
of
the
time,
the
website
was
taken
off
and
the
data
was
exported
for
further
analysis
and
cleaning.
3.3 Data compilation and cleaning
3.3.1 Need and Specifications
The
collected
data
in
the
two
tables
needs
to
be
merged
in
such
a
way
that
each
row
of
the
new
table
correspond
to
a
single
user
and
contains
all
information
about
him,
i.e.
each
row
is
one
case
of
the
training
data.
Each
case
would
include
all
the
times
spent
in
132
sections
of
the
webpage
along
with
the
user
id
and
the
product
finally
bought
by
the
user.
This
is
also
the
required
format
to
train
a
machine‐learning
model
in
WEKA.
Moreover,
the
collected
data
needs
to
be
analyzed
properly
and
checked
for
any
errors
in
the
data.
There
might
be
some
users
who
wouldn’t
have
provided
the
information
on
the
actual
product
bought
and
hence
the
data
related
to
them
needs
to
be
scrapped.
Some
users
are
likely
to
spend
absolutely
no
time
as
they
might
have
accidently
visited
the
webpage
and
hence
all
users
spending
less
than
some
calculated
threshold
time,
needs
to
be
scrapped.
Similarly
users
waiting
on
a
section
for
more
than
certain
fixed
time
should
be
removed.
These
steps
are
important
to
ensure
that
there
are
no
outliers
in
the
collected
data
and
the
model
that
would
be
built
and
trained
on
this
data
is
best
suited
for
general
usage
on
the
website.
Since
the
absolute
time
spent
on
different
element
of
the
webpage
depends
on
a
number
of
other
features
primarily
the
speed
of
an
individual
user,
the
data
needs
to
be
normalized.
Dividing
the
time
spent
by
a
user
on
an
individual
section
by
the
total
time
27
Data
Collection
and
Pre‐processing
spent
by
that
user
on
the
website
would
give
the
proportion
of
time
spent
by
him
reading
that
section
of
the
webpage.
Hence
the
final
training
data
should
only
contain
valid
users
responses
of
the
product
bought
along
with
the
normalized
breakup
of
the
time
spent
by
them
on
various
sections
of
the
webpage.
3.3.2 Implementation
First
all
the
data
needs
to
be
compiled
into
a
single
table
as
stated
above
and
then
needs
to
be
cleaned.
3.3.2.1 Data compilation
A
new
PHP
script
named
‘alignData.php’
was
written
to
compile
the
data
into
more
usable
format.
This
file
would
write
all
the
data
to
a
new
table
named
‘finalData’
with
following
134
attributes
(132
corresponding
to
the
time
spent
in
132
sections
of
the
webpage
(independent
variables),
1
to
record
the
userID
of
the
user
and
1
is
to
save
the
code
of
the
final
product
bought
(target
/
dependent
variable).
The
final
product
bought
would
be
the
predicted
variable
in
our
model
that
shall
be
discussed
in
the
next
chapter.
The
attributes
of
the
‘finalData’
table
are:
• userID To record the user id of the user
• a0 Time in milliseconds spent in cell ‘a0’ of the webpage
• a1 Time in milliseconds spent in cell ‘a1’ of the webpage
• a2 Time in milliseconds spent in cell ‘a2’ of the webpage
• a3 Time in milliseconds spent in cell ‘a3’ of the webpage
• a4 Time in milliseconds spent in cell ‘a4’ of the webpage
• a5 Time in milliseconds spent in cell ‘a5’ of the webpage
28
Data
Collection
and
Pre‐processing
• b0 Time in milliseconds spent in cell ‘b0’ of the webpage
• b1 Time in milliseconds spent in cell ‘b1’ of the webpage
• b2 Time in milliseconds spent in cell ‘b2’ of the webpage
• .
• . Similarly from ‘b3’ to ‘u3’
• .
• u4 Time in milliseconds spent in cell ‘u4’ of the webpage
• u5 Time in milliseconds spent in cell ‘u5’ of the webpage
• v0 Time in milliseconds spent in cell ‘v0’ of the webpage
• v1 Time in milliseconds spent in cell ‘v1’ of the webpage
• v2 Time in milliseconds spent in cell ‘v2’ of the webpage
• v3 Time in milliseconds spent in cell ‘v3’ of the webpage
• v4 Time in milliseconds spent in cell ‘v4’ of the webpage
• v5 Time in milliseconds spent in cell ‘v5’ of the webpage
• bought To save the code of the final product bought by the user
The
‘alignData.php’
file
selects
all
the
responses
stored
in
the
tables
‘data’
and
‘bought’
and
save
them
in
the
table
‘finalData’.
The
attribute
‘userID’
is
the
primary
key
of
the
table.
The
algorithm
that
was
implemented
in
the
‘alignData.php’
file
was:
1. Select
a
list
of
unique
users
from
the
table
‘data’
2. For
each
user
with
id
‘userID’,
do‐
a. Select
all
the
data
(cellIDs
and
associated
time)
corresponding
to
that
user
from
the
table
‘data’.
Use
the
sum
aggregate
function
in
the
SQL
on
time
and
group
them
by
cellIDs.
b. This
will
give
the
total
time
spent
on
each
visited
cell,
i.e.
section
of
the
webpage
visited
by
that
user.
29
Data
Collection
and
Pre‐processing
c. Time
spent
on
all
other
cells,
i.e.
sections
not
visited
by
that
user
is
made
zero.
d. Insert
all
the
time
values
for
each
cell
in
the
‘finalData’
table
along
with
the
user’s
userID.
e. Select
the
final
product
bought
by
the
user
using
a
select
statement
on
the
table
‘bought’.
In
case
the
user
has
not
bought
any
product,
i.e.
the
output
from
the
‘bought’
table
for
that
user
is
empty,
assign
him
a
product
number
0.
f. Update
the
‘finalData’
table
by
inserting
the
value
for
the
‘bought’
field
corresponding
to
that
user.
After
successful
execution
of
this
algorithm
in
the
‘alignData.php’
script,
the
‘finalData’
table
contained
all
the
data
collected
from
the
initial
website
in
a
tabular
manner
with
each
row
corresponding
to
a
unique
user.
This
data
can
now
be
used
directly
for
model
building
in
WEKA
but
it
needs
some
cleaning.
The
‘data’
table
had
a
total
of
16401
tuples
with
292
unique
users
whereas
the
‘bought’
table
had
244
tuples.
After
executing
the
above
script,
the
total
number
of
tuples
in
the
‘finalData’
table
was
292.
Out
of
292
tuples,
48
(292
minus
244)
users
were
those
who
left
the
site
without
selecting
any
product.
This
table
‘finalData’
was
then
exported
in
a
spreadsheet
format
(Microsoft
Excel)
for
analysis,
visualization
and
cleaning.
3.3.2.2 Data cleaning
On
of
the
obtained
292
rows
of
data
in
excel,
the
next
task
is
the
data
cleaning
stage.
This
step
is
to
remove
all
the
outliers
and
other
cases
that
can
harm
the
training
of
the
model
and
eventually
can
harm
the
model.
There
can
be
multiple
reasons
behind
the
occurrences
of
such
unwanted
cases
in
the
initial
dataset
such
as,
non
serious
respondents,
accidently
entering
the
webpage
and
closing
it
immediately,
accidently
30
Data
Collection
and
Pre‐processing
pressing
the
enter
key,
leaving
the
computer
with
website
on
while
working
on
something
else,
etc.
The following steps to clean the collected data were followed:
• All
the
tuples
where
the
value
of
the
attribute
‘bought’
is
0,
i.e.
the
user
has
not
bought
any
product
were
deleted.
This
was
because
the
objective
of
the
project
is
to
select
the
best
product
for
a
user
and
hence
the
training
set
should
only
contain
users
who
have
bought
a
product.
Training
the
model
on
data
predicting
that
the
user
would
not
buy
would
make
the
model
inappropriate
for
use
in
the
current
project.
There
were
a
total
of
48
such
tuples
where
the
bought
product
value
as
0.
The
number
of
tuples
in
the
left
data
were
244
each
corresponding
to
a
unique
visitor.
All
244
users
have
bought
a
product
(dependent
variable
is
not
0)
• The
total
time
spent
by
a
user
was
calculated
for
all
the
users
using
simple
excel
inbuilt
sum
function.
The
distribution
of
the
total
time
spent
by
different
users
on
the
built
webpage
was
studied.
It
was
found
that
the
average
time
spent
by
a
user
on
the
webpage
was
33.08
seconds.
The
maximum
time
spent
by
a
user
was
225.8
seconds
whereas
the
minimum
was
1.2
seconds.
• The
minimum
and
the
maximum
time
spent
by
any
user
were
analyzed
to
find
the
outliers.
Since
the
minimum
time
in
the
current
data
is
much
lower
than
the
expected
minimum
time
any
serious
volunteer
would
spend,
a
threshold
value
of
8
seconds
was
selected.
The
maximum
time
of
225.8
seconds
was
found
feasible
and
hence
no
upper
limit
was
calculated.
31
Data
Collection
and
Pre‐processing
This
value
of
8
seconds
was
analyzed
as
a
feasible
value
keeping
in
mind
the
webpage
design.
It
was
assumed
that
any
user
taking
less
than
8
seconds
on
that
webpage
has
given
incorrect
data
and
will
be
considered
as
an
outlier.
There
were
44
users
who
spent
less
than
8
seconds
on
the
initial
website
while
giving
training
data
for
model
building.
Rows
associated
with
all
44
users
were
deleted
from
the
collected
sample
leaving
the
sample
size
to
200
tuples.
The
average
time
spent
by
a
user
became
40.26
seconds
and
the
minimum
time
spent
by
a
user
in
the
new
dataset
became
8.3
seconds.
3.3.2.3 Data normalization
The
data
collected
from
the
volunteers
have
the
132
time
fields
corresponding
to
the
time
spent
in
132
sections
of
the
website
in
absolute
value.
It
was
realized
that
data
normalization
would
be
required.
The
reason
behind
this
was
that
different
people
have
spent
different
time
on
the
webpage.
The
time
spent
depends
upon
their
individual
browsing
speed,
reading
speed
and
other
several
personal
attributes.
Since
the
desired
model
has
to
cater
a
general
audience,
time
spent
in
one
section
relative
to
the
time
spent
in
the
other
sections
was
thought
to
be
more
appropriate.
There
are
several
advantages
of
this
step,
primarily
also
that
the
model
now
would
be
capable
of
predicting
in
real
time
for
a
user
who
is
in
process
of
browsing
the
webpage.
Whenever
the
prediction
is
needed,
the
current
times
spent
in
various
sections
could
be
normalized
and
fed
into
the
model.
Since,
the
model
now
would
be
immune
to
absolute
time
value,
with
every
prediction
for
the
same
user,
the
model
would
not
be
biased
on
the
time
spent
by
him
but
would
depend
only
on
the
relative
time
spent
on
different
sections
of
the
webpage.
Another
advantage
is
that
all
the
data
used
for
training
the
32
Data
Collection
and
Pre‐processing
model
is
now
equivalent.
The
200
cases
in
the
training
set
are
more
comparable
and
do
not
vary
on
absolute
scale.
This
step
is
expected
to
train
the
model
better.
Implementation
To
carry
out
data
normalization,
the
total
time
spent
by
a
user
was
calculated
in
excel
(also
done
in
data
cleaning
step).
Time
spent
in
individual
section
of
the
webpage
by
that
user
was
then
divided
by
the
total
time
spent
on
the
webpage
by
him.
This
step
gave
the
percentage
time
spent
by
the
user
in
each
section
of
the
webpage.
The new dataset with 200 tuples and 134 attributes (132 independent variables and 1
dependent
variable)
with
normalized
time
data
was
saved
in
the
CSV
format,
which
could
be
imported
directly
into
WEKA
for
model
building
task.
The
next
chapter
would
explain
the
procedure
of
building
machine
learning
models
on
WEKA
using
the
data
as
collected
in
this
chapter.
33
Building
machine
learning
models
4
BUILDING
MACHINE
LEARNING
MODELS
Using
the
collected
data,
various
machinelearning
models
were
built
and
tested.
This
chapter
explains
the
complete
methodology
followed
along
with
the
details
of
the
models
obtained.
It
later
explains
the
best
models
that
were
selected
and
the
rationale
behind
them.
4.1 Machine Learning
According
to
Wikipedia1
“Machine
Learning
is
a
scientific
discipline
that
is
concerned
with
the
design
and
development
of
algorithms
that
allow
computers
to
learn
based
on
data.
Such
as
from
sensor
data
or
databases.”
It
can
be
defined
as
a
set
of
algorithms
to
automatically
learn
and
recognize
complex
patterns
and
are
capable
of
making
intelligent
decisions
based
on
data.
There
are
several
softwares
available
that
could
be
used
to
build
and
implement
machine‐learning
models.
MATLAB
and
WEKA
are
two
commonly
used
softwares.
The
models
used
in
the
project
were
built
using
WEKA.
1
Wikipedia,
Machine
Learning
‐
Wikipedia,
http://en.wikipedia.org/wiki/Machine_learning.
34
Building
machine
learning
models
4.1.1 WEKA
Weka1
is
open
source
data
mining
software
written
in
Java.
It
is
primarily
a
collection
of
various
machine‐learning
algorithms
that
could
be
applied
directly
and
easily
on
different
types
of
data.
It
has
a
built‐in
interface
to
visualize
the
data
and
can
perform
tasks
like
attribute
selection,
clustering
etc.
It
is
available
under
General
Public
GPU
License
and
can
be
downloaded
from
its
website.
4.1.2 Why Machine Learning?
The
primary
objective
of
the
project
is
to
automatically
learn
the
user’s
mouse
movement
behavior
from
the
collected
training
data.
Machine
learning
as
stated
above
is
a
branch
of
science
that
deals
with
algorithms
that
are
capable
of
learning
patterns.
This
exactly
fits
the
primary
requirement.
The
project
further
demands
capability
to
predict
further
content
for
a
new
user
based
on
his
mouse
movements.
Machine
learning
algorithms
once
trained
on
a
large
set
of
data
are
then
capable
of
predicting
the
value
of
the
dependent
variable
for
any
new
case.
Moreover,
machine‐learning
algorithms
can
be
trained
again
and
again
with
new
data.
The
complete
objective
of
the
project
can
easily
be
catered
using
machine‐learning
algorithms.
4.2 Methods evaluated
In
machine
learning,
in
order
to
classify/predict
for
any
new
case,
a
model
is
first
made
and
trained
on
training
data.
There
can
be
a
number
of
different
types
of
models
that
1
The
University
of
Waikato,
Weka
3:
Data
Mining
Software
in
Java,
http://www.cs.waikato.ac.nz/ml/weka/.
35
Building
machine
learning
models
can
be
built
and
further
a
lot
of
different
algorithms
to
built
a
model.
Different
types
of
machine
learning
models
generally
used
are
Decision
Trees,
Neural
Networks,
Genetic
Algorithms,
Fuzzy
Networks
etc.
To
keep
the
scope
of
this
project
in
mind
only
Decision
Trees
and
Neural
Networks
based
models
were
evaluated.
The
data
was
modeled
using
both
the
methods
using
J48
Classification
algorithm
for
decision
trees
and
multilayer
perceptrons
for
neural
network.
The
two
models
were
later
evaluated
on
the
training
data.
4.2.1 Decision Tree
A decision tree can be defined as a decision support classifier that uses a tree like
structure
of
conditions
and
their
possible
consequences.
Each
node
of
a
decision
tree
can
be
a
leaf
node
or
a
decision
node
where‐
• Leaf node – These node mentions the value of the dependent (target) variable
• Decision
node
–
These
nodes
contain
one
condition
each
specifying
some
test
on
a
single
attribute‐value.
The
outcome
of
the
condition
is
further
divided
into
branches
with
sub‐trees
or
leaf
nodes.
The
attribute
that
is
to
be
predicted
is
known
as
the
dependent
variable,
since
its
value
depends
upon,
or
is
decided
by,
the
values
of
all
the
other
attributes.
The
other
attributes,
which
help
in
predicting
the
value
of
the
dependent
variable,
are
known
as
the
independent
variables
in
the
dataset.
4.2.2 Neural Network
“An
Artificial
Neural
Network
is
an
interconnected
assembly
of
simple
processing
elements,
units
or
nodes
(neurons),
whose
functionality
is
inspired
by
the
functioning
of
the
natural
neuron
from
brain.
The
processing
ability
of
the
neural
network
is
stored
in
36
Building
machine
learning
models
the
inter‐unit
connection
strengths,
or
weights,
obtained
by
a
process
of
learning
from
a
set
of
training
patterns.”1
4.3 Implemented algorithms
There
are
several
algorithms
for
decision
trees
commonly
used
now
days
namely
ID3,
C4.5,
C5.0
etc.
After
careful
evaluation
of
these
three
algorithms,
C4.5
was
chosen
for
the
project.
The
reason
behind
choosing
C4.5
over
ID3
and
C5.0
were:
• C4.5
handles
continuous
variables
in
a
better
way
by
creating
a
threshold
and
then
splitting
the
list
on
that
value.
Since
all
the
attributes
in
the
required
decision
tree
are
continuous
whereas
the
target
variable
has
five
discrete
values,
C4.5
was
used.
• C4.5
has
a
capability
to
prune
trees.
Pruning
is
a
method
of
going
backwards
in
a
tree
to
remove
any
branches
that
do
not
help
in
further
classifications
and
replace
them
by
leaf
nodes.
• C5.0
is
generally
ranked
above
C4.5
because
of
its
higher
speed
of
building
a
tree
and
low
memory
requirements.
Since
the
scope
of
the
project
demanded
none
of
these
features,
there
was
no
significant
advantage
with
C5.0.
Also
C5.0
can
be
used
to
weighting
attributes,
which
wasn’t
required
in
the
problem
under
consideration.
Similarly,
Neural
networks
can
be
implemented
in
one
of
the
various
available
ways
namely‐
Feedforward
neural
network,
Radial
basis
function
network,
Kohonen
self‐
organizing
network,
Recurrent
network,
Stochastic
neural
networks,
Modular
neural
1
Kevin
N
Gurney,
An
introduction
to
neural
networks,
illustrated
(CRC
Press,
1997).
37
Building
machine
learning
models
networks,
Holographic
associative
memory
etc.
The
neural
network
implemented
in
the
project
was
a
feedforward
neural
network
with
non‐linear
activation
function.
4.3.1 Decision Tree (C4.5)
WEKA
implements
Decision
tree
C4.5
algorithm
using
‘J48
Decision
tree
classifier’.
The
explanation
of
the
C4.5
algorithm
as
well
as
the
J48
implementation
is
as
follows:
• Whenever
a
set
of
items
(training
set)
is
encountered,
the
algorithm
identifies
the
attribute
that
discriminates
the
various
instances
most
clearly.
This
is
done
using
the
standard
equation
of
information
gain
• Among
the
possible
values
of
this
feature,
if
there
is
any
value
for
which
there
is
no
ambiguity,
that
is,
for
which
the
data
instances
falling
within
its
category
have
the
same
value
for
the
target
variable,
then
that
branch
is
terminated
and
the
obtained
target
value
is
assigned
to
it.
• For
all
other
cases,
another
attributes
are
looked
that
gives
the
highest
information
gain.
• This
is
continued
in
the
same
manner
until
either
a
clear
decision
of
the
value
of
the
target
variable
is
reached
with
a
combination
of
conditions
on
various
independent
variables/
attributes,
or
we
run
out
of
attributes.
• In
the
event
of
running
out
of
attributes,
or
getting
an
ambiguous
result
from
the
available
information,
the
branch
is
assigned
a
target
value
that
the
majority
of
the
items
under
this
branch
possess.
The
name
of
the
classifier
in
WEKA
that
follows
the
above
mentioned
C4.5
algorithm
is
‘weka.classifiers.trees.J48’
38
Building
machine
learning
models
4.3.2 Neural Network (Multilayer Perceptron)
Multilayer
perceptrons
is
a
feedforward
neural
network
based
classifier
that
uses
backpropogation
to
classify
instances.
All
the
nodes
in
this
network
are
sigmoids,
which
means
that
the
activation
function
is
a
sigmoid.
In
a
multilayer
perceptron,
there
is
an
input
layer
with
a
node
each
for
all
the
independent
variables,
at
least
one
hidden
layer
and
an
output
layer
with
a
node
each
for
different
classes
of
the
target
variable.
The
network
is
trained
by
initial
data
that
determines
the
appropriate
weights
for
connections
between
all
the
nodes
of
adjacent
layers
and
also
determines
the
bias/
threshold
value
of
each
node.
The name of the classifier in WEKA is ‘weka.classifiers.functions.MultilayerPerceptron’
4.4 Model building
WEKA
was
opened
in
Explorer
mode
and
the
saved
CSV
file
was
opened
using
the
open
file
button
in
the
preprocess
tab
of
WEKA.
From
the
attributes
pane,
the
attribute
userID
was
deleted.
This
is
because
this
field
is
irrelevant
in
the
process
of
model
building.
The
file
was
then
saved
in
Attribute‐Relation
File
Format
(ARFF)
simply
by
clicking
the
save
button.
The
saved
ARFF
file
was
opened
in
a
text
editor
to
change
the
properties
of
predicted
variable,
i.e.
attribute
‘bought’
from
number
to
nominal
scale.
This
is
essential
step
because
the
‘bought’
variable
has
only
five
discrete
values
each
for
each
product.
This
will
also
enable
the
use
of
J48
tree
classifier,
as
the
nominal
data
for
the
predicted
variable
is
a
requirement.
To
convert
‘bought’
from
number
to
nominal
mode,
the
property
‘numeric’
was
changed
to
‘{1,2,3,4,5}’,
where
1,2,3,4,5
were
the
codes
for
the
five
laptop
products.
The
output
expected
from
the
models
is
one
of
the
five
laptop
codes.
File
was
saved
and
closed.
39
Building
machine
learning
models
4.4.1 Decision Tree
The
saved
ARFF
was
then
re‐opened
in
WEKA
and
under
the
classify
tab,
J48
tree
classifier
was
chosen.
There
are
different
parameters
of
J48
tree
classifier
like
binary
splits,
number
of
folds,
pruning
etc.
Using
trial
and
error
method,
various
parameters
were
changed
and
each
model
was
tested
for
accuracy
on
the
training
data.
Models
were
tested
using
two
methodologies
namely
testing
directly
on
training
data
and
testing
using
cross
validation.
The
set
of
parameters
giving
the
maximum
percentage
of
correctly
classified
instances
were
chosen.
The
final
model
giving
maximum
accuracy
on
the
training
dataset
was
also
saved
for
later
use.
4.4.1.1 Details of the chosen decision tree
The final parameters selected that gave the best output on training data are‐
• binarySplits:
By
WEKA
definition
of
this
parameter,
it
is
considered
for
nominal
variables
only.
Since
the
dataset
under
consideration
had
no
nominal
independent
variable,
the
value
of
this
attribute
had
no
impact
on
the
built
tree.
• confidenceFactor:
This
attribute
defines
the
confidence
factor
used
for
pruning.
It
was
found
that
a
confidence
factor
value
of
0.75,
a
good
accuracy
decision
tree
was
obtained
when
C4.5
pruning
was
used.
• debug:
This
parameter
is
only
used
to
output
some
additional
information
at
the
console.
Its
value
of
either
true
or
false
didn’t
impact
the
final
model.
• minNumObj:
This
determines
the
minimum
number
of
instances
at
every
leaf
node.
This
attribute
was
set
to
a
value
of
‘2’.
• numFolds:
This
parameter
determines
the
amount
of
data
used
for
reduced‐
error
pruning.
In
the
decision
tree
built,
numFolds
was
kept
at
‘11’.
This
would
mean
that
one
fold
was
used
for
pruning,
and
rest
for
growing
the
tree.
40
Building
machine
learning
models
• reducedErrorPruning:
This
was
set
to
‘False’
as
it
signifies
if
reduced‐error
pruning
should
be
used
instead
of
C.4.5
pruning.
• saveInstanceData:
This
attribute
is
just
to
save
the
instance
for
visualization
in
future
• seed:
The
seed
determines
the
number
of
seeds
to
be
used
while
randomizing
the
data
when
reduced‐error
pruning
is
to
be
used.
Since
reduced‐error‐pruning
was
not
used,
seed
parameter
had
no
relevance.
• subtreeRaising:
Subtree
raising
while
pruning
is
always
advisable
when
used
with
a
high
confidence
factor.
Since
a
confidence
factor
of
0.75
was
used,
this
parameter
was
set
as
‘true’.
• unpruned:
Since
we
wanted
pruning
to
happen,
the
‘unpruned’
parameter
was
set
to
‘false’.
• useLaplace:
This
parameter
determines
if
counts
at
leaves
are
smoothed
based
on
Laplace.
The
parameter
had
no
influence
on
the
model
output.
All the parameters used in the final decision tree can be summarized as‐
41
Building
machine
learning
models
Figure
8:
Parameters
used
for
building
the
Decision
Tree
model
The output from WEKA is as follow:
===
Run
information
===
Scheme:
weka.classifiers.trees.J48
‐L
‐C
0.75
‐M
2
‐A
Relation:
MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1
Instances:
200
Attributes:
133
[list
of
attributes
omitted]
42
Building
machine
learning
models
Test
mode:
evaluate
on
training
data
===
Classifier
model
(full
training
set)
===
J48
pruned
tree
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
b5
<=
0.04509
|
k4
<=
0.013828
|
|
v1
<=
0.000362
|
|
|
r0
<=
0.000626
|
|
|
|
d5
<=
0.003481
|
|
|
|
|
d5
<=
0.001586
|
|
|
|
|
|
g4
<=
0.033267
|
|
|
|
|
|
|
s3
<=
0.004874
|
|
|
|
|
|
|
|
u1
<=
0.002108
|
|
|
|
|
|
|
|
|
f1
<=
0.039667
|
|
|
|
|
|
|
|
|
|
f4
<=
0.028894
|
|
|
|
|
|
|
|
|
|
|
i4
<=
0.004699
|
|
|
|
|
|
|
|
|
|
|
|
d2
<=
0.001173
|
|
|
|
|
|
|
|
|
|
|
|
|
e5
<=
0.001377
|
|
|
|
|
|
|
|
|
|
|
|
|
|
e1
<=
0.029566
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
r3
<=
0.000861
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
c1
<=
0.043665
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a3
<=
0.206815
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b1
<=
0.007319
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
f3
<=
0.001471
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b4
<=
0.00214:
2
(11.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b4
>
0.00214
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a4
<=
0.004126:
3
(3.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a4
>
0.004126:
2
(2.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
f3
>
0.001471:
3
(3.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b1
>
0.007319
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b3
<=
0.123969:
2
(12.0/2.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b3
>
0.123969:
1
(2.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a3
>
0.206815:
1
(2.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
c1
>
0.043665:
1
(3.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
r3
>
0.000861:
3
(3.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
e1
>
0.029566:
1
(3.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
e5
>
0.001377:
3
(2.0)
|
|
|
|
|
|
|
|
|
|
|
|
d2
>
0.001173
|
|
|
|
|
|
|
|
|
|
|
|
|
s4
<=
0.002873:
2
(32.0/1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
s4
>
0.002873:
4
(2.0)
|
|
|
|
|
|
|
|
|
|
|
i4
>
0.004699:
1
(3.0/1.0)
|
|
|
|
|
|
|
|
|
|
f4
>
0.028894:
4
(2.0)
|
|
|
|
|
|
|
|
|
f1
>
0.039667:
3
(3.0)
|
|
|
|
|
|
|
|
u1
>
0.002108:
3
(6.0/1.0)
43
Building
machine
learning
models
|
|
|
|
|
|
|
s3
>
0.004874
|
|
|
|
|
|
|
|
q1
<=
0.004708
|
|
|
|
|
|
|
|
|
r4
<=
0.007391:
3
(16.0)
|
|
|
|
|
|
|
|
|
r4
>
0.007391:
2
(2.0)
|
|
|
|
|
|
|
|
q1
>
0.004708:
2
(2.0/1.0)
|
|
|
|
|
|
g4
>
0.033267
|
|
|
|
|
|
|
g5
<=
0.004141
|
|
|
|
|
|
|
|
k4
<=
0.001354:
4
(8.0)
|
|
|
|
|
|
|
|
k4
>
0.001354:
3
(3.0/1.0)
|
|
|
|
|
|
|
g5
>
0.004141:
2
(3.0/1.0)
|
|
|
|
|
d5
>
0.001586:
4
(4.0)
|
|
|
|
d5
>
0.003481
|
|
|
|
|
g5
<=
0.004141
|
|
|
|
|
|
b5
<=
0.002996
|
|
|
|
|
|
|
g4
<=
0.003922:
2
(4.0)
|
|
|
|
|
|
|
g4
>
0.003922:
1
(2.0)
|
|
|
|
|
|
b5
>
0.002996:
3
(2.0)
|
|
|
|
|
g5
>
0.004141:
5
(3.0)
|
|
|
r0
>
0.000626:
4
(3.0/1.0)
|
|
v1
>
0.000362
|
|
|
s4
<=
0.005561
|
|
|
|
t4
<=
0.002371
|
|
|
|
|
e0
<=
0.001979
|
|
|
|
|
|
h2
<=
0.005305:
1
(18.0/1.0)
|
|
|
|
|
|
h2
>
0.005305:
2
(2.0)
|
|
|
|
|
e0
>
0.001979:
2
(2.0)
|
|
|
|
t4
>
0.002371:
2
(2.0/1.0)
|
|
|
s4
>
0.005561:
2
(2.0/1.0)
|
k4
>
0.013828
|
|
f5
<=
0.001805:
4
(9.0/1.0)
|
|
f5
>
0.001805:
2
(2.0/1.0)
b5
>
0.04509
|
t3
<=
0.000515
|
|
d4
<=
0.008991
|
|
|
e2
<=
0.011901
|
|
|
|
a1
<=
0.001341
|
|
|
|
|
g2
<=
0.001762:
4
(3.0/1.0)
|
|
|
|
|
g2
>
0.001762:
5
(2.0)
|
|
|
|
a1
>
0.001341:
5
(4.0)
|
|
|
e2
>
0.011901:
4
(3.0)
|
|
d4
>
0.008991:
2
(2.0/1.0)
|
t3
>
0.000515:
3
(3.0)
Number
of
Leaves
:
42
Size
of
the
tree
:
83
44
Building
machine
learning
models
Time taken to build model: 0.75 seconds
4.4.1.2 Testing the decision tree
The
model
was
tested
using
two
different
methodologies
namely,
testing
directly
on
the
training
dataset
and
testing
using
cross‐validation
with
10
folds.
Testing
on
the
training
data
gave
a
result
of
89.5%
accuracy
whereas
testing
using
cross
validation
gave
an
accuracy
of
66%.
The
complete
result
along
with
the
discussion
is
as
follows:
4.4.1.2.1 Testing
on
Training
Data
===
Evaluation
on
training
set
===
===
Summary
===
Correctly
Classified
Instances
179
89.5
%
Incorrectly
Classified
Instances
21
10.5
%
Kappa
statistic
0.8586
Mean
absolute
error
0.1650
Root
mean
squared
error
0.2382
Relative
absolute
error
54.9013
%
Root
relative
squared
error
61.5103
%
Total
Number
of
Instances
200
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.848
0.030
0.848
0.848
0.848
0.953
1
0.959
0.079
0.875
0.959
0.915
0.969
2
0.932
0.019
0.932
0.932
0.932
0.994
3
0.795
0.019
0.912
0.795
0.849
0.987
4
0.818
0.000
1.000
0.818
0.900
0.999
5
Weighted
Avg.
0.895
0.042
0.897
0.895
0.894
0.977
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
28
4
0
1
0
|
a
=
1
1
70
1
1
0
|
b
=
2
45
Building
machine
learning
models
1
2
41
0
0
|
c
=
3
3
3
2
31
0
|
d
=
4
0
1
0
1
9
|
e
=
5
4.4.1.2.2 Testing
by
Cross‐Validation
(folds
10)
===
Stratified
cross‐validation
===
===
Summary
===
Correctly
Classified
Instances
132
66
%
Incorrectly
Classified
Instances
68
34
%
Kappa
statistic
0.5303
Mean
absolute
error
0.2133
Root
mean
squared
error
0.308
Relative
absolute
error
70.9833
%
Root
relative
squared
error
79.5133
%
Total
Number
of
Instances
200
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.545
0.072
0.6
00
0.545
0.571
0.865
1
0.890
0.315
0.619
0.890
0.730
0.871
2
0.500
0.000
1.000
0.500
0.667
0.874
3
0.538
0.081
0.618
0.538
0.575
0.833
4
0.545
0.016
0.667
0.545
0.600
0.983
5
Weighted
Avg.
0.66
0.143
0.702
0.66
0.653
0.869
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
18
14
0
0
1
|
a
=
1
7
65
0
1
0
|
b
=
2
1
13
22
7
1
|
c
=
3
4
13
0
21
1
|
d
=
4
0
0
0
5
6
|
e
=
5
4.4.1.2.3 Discussion
Testing
directly
on
the
training
data
classified
179
cases
correctly
out
of
200,
which
is
an
accuracy
of
89.5%.
Accuracy
while
testing
on
training
data
is
always
desired
very
high
because
it
signifies
the
extent
to
which
the
model
has
learnt
the
training
data.
Since
46
Building
machine
learning
models
there
were
5
classes
in
the
target
variable
(5
products),
any
accuracy
in
model
more
than
20%
(equal
probability
of
each
class
is
1/5
=
0.2
=
20%)
has
to
be
considered
good.
Accuracy
of
89.5%
is
well
within
the
error
range
and
signifies
that
the
built
decision
tree
has
learnt
the
training
data
quite
accurately.
Testing
using
cross‐validation
is
a
process
of
dividing
the
data
into
different
sub
sets
and
then
carrying
out
the
analysis
on
one
subset
and
testing
it
on
other.
Doing
this
with
10
folds
is
the
process
of
carrying
out
cross‐validation
10
times
and
averaging
out
the
accuracy
score.
Again,
as
stated
above
any
accuracy
of
more
than
20%
is
good.
The
achieved
result
of
an
average
of
132
correct
classifications
out
of
200
with
an
accuracy
of
66%
is
well
within
the
desired
range.
Ideally,
the
model
should
have
been
trained
on
more
data.
Due
to
the
limitation
of
time,
and
no
compensation
available
to
volunteers,
only
200
tuples
of
useful
data
could
be
collected.
It
is
expected
that
with
the
bigger
training
dataset,
the
accuracy
of
the
models
would
increase.
4.4.2 Neural Network
The
saved
ARFF
file
was
re‐opened
in
WEKA
and
under
the
classify
tab,
MultilaterPerceptron
function
was
chosen.
There
are
different
parameters
associated
with
this
neural
network
function
and
as
done
with
decision
trees,
trial
and
error
method
was
used
to
find
the
best
set.
The
best
set
of
parameters
was
the
one
that
gave
maximum
accuracy
of
classification
on
the
training
dataset.
Each
obtained
model
was
tested
using
two
methodologies
namely
testing
directly
on
training
data
and
testing
using
cross
validation.
After
multiple
iterations
using
trail
and
error
method,
a
model
giving
a
good
accuracy
of
classification
was
obtained.
The
model
was
also
saved
for
later
use.
47
Building
machine
learning
models
4.4.2.1 Details of the chosen neural network
The final parameters selected that gave the best output on training data are‐
• GUI:
The
GUI
parameter
brings
up
an
interface.
It
doesn’t
really
impact
the
final
model,
unless
some
changes
in
the
learning
rate
and
momentum
are
desired
while
training.
It
was
set
as
‘False’
in
the
project.
• autoBuild:
An
ANN
was
built
automatically
and
hence
this
parameter
was
set
to
‘true’
• debug: This is to view additional information on the console.
• decay:
It
was
observed
that
the
‘true’
decay
value
gave
slightly
less
accuracy
and
hence
in
the
final
model,
‘decay’
was
set
to
‘false’
• hiddenLayers:
Since
an
automatic
neural
network
was
desired,
the
WEKA
was
left
to
decide
the
number
of
hidden
layers
and
hence
the
final
set
of
parameters
had
a
value
of
‘a’
in
the
field
of
hiddenLayers.
‘a’
when
used
as
a
value
for
hiddenLayers
mean
‘automatic’.
• learningRate:
The
amount
at
which
the
weights
should
be
updated
was
set
to
0.1
• momentum: Momentum of 0.2 was applied to the weights during updating.
• nominalToBinaryFilter:
There
were
no
nominal
variables
in
the
data
and
hence
this
parameter
had
no
impact
on
the
model
• normalizeNumericClass:
Since
the
class
is
not
numeric
but
already
normalized,
there
was
no
use
of
using
this
feature
and
hence
it
was
set
to
‘false’
• reset:
When
the
reset
was
set
to
false,
no
error
message
was
received.
Moreover
the
set
learning
rate
of
0.1
is
already
quite
low
and
hence
this
feature
was
set
as
‘false’
• seed:
Seed
value
of
0
was
used.
As
in
case
of
decision
trees,
this
value
is
used
to
initialize
the
random
number
generator.
Random
numbers
are
used
for
setting
48
Building
machine
learning
models
the
initial
weights
of
the
connections
between
nodes,
and
also
for
shuffling
the
training
data.
• trainingTime: The number of epochs to train through was set to 5000.
• validationSetSize:
The
percentage
size
of
the
validation
set
was
made
0
which
signifies
that
no
validation
set
will
be
used
and
instead
the
network
will
train
for
the
specified
number
of
epochs,
i.e.
for
5000
epochs
• validationThreshold:
This
parameter
was
set
to
20
which
dictates
that
20
times
in
a
row
the
validation
set
error
can
get
worse
before
training
is
terminated.
The parameters used in the final neural network model can be summarized as:
49
Building
machine
learning
models
Figure
9:
Parameters
used
for
building
the
Neural
Network
model
50
Building
machine
learning
models
It
was
found
impossible
to
include
the
complete
model
output
in
this
document,
and
hence
the
summary
of
the
model
obtained
is
as
follows‐
===
Run
information
===
Scheme:
weka.classifiers.functions.MultilayerPerceptron
‐L
0.1
‐M
0.2
‐N
5000
‐V
0
‐S
0
‐E
20
‐H
a
‐R
Relation:
MLData_Normalized‐weka.filters.unsupervised.attribute.Remove‐R1
Instances:
200
Attributes:
133
[list
of
attributes
omitted]
Test
mode:
10‐fold
cross‐validation
===
Classifier
model
(full
training
set)
===
The
chosen
neural
network
had
1
hidden
layer
with
68
nodes.
There
were
132
input
nodes
accepting
132
normalized
time
values
corresponding
to
each
section
of
the
webpage.
The
model
had
5
output
nodes
each
for
one
of
the
five
laptops.
There
were
a
total
of
73
threshold
values
for
73
nodes
(68
hidden
layer
nodes+5
output
nodes)
and
there
were
9316
weight
values
(132*68
+
68*5)
4.4.2.2 Testing the neural network model
The
neural
network
model
was
also
tested
similarly
as
decision
trees
using
two
different
methodologies
namely,
tested
directly
on
training
set
and
using
cross‐
validation
with
10
folds.
It
was
found
that
testing
on
training
dataset
gave
an
exceptionally
good
result
of
95.0%
whereas
testing
using
cross
validation
with
10
folds
gave
a
classification
accuracy
of
41.0%
51
Building
machine
learning
models
4.4.2.2.1 Testing
on
Training
Data
===
Evaluation
on
training
set
===
===
Summary
===
Correctly
Classified
Instances
190
95
%
Incorrectly
Classified
Instances
10
5
%
Kappa
statistic
0.9335
Mean
absolute
error
0.0219
Root
mean
squared
error
0.1313
Relative
absolute
error
7.2772
%
Root
relative
squared
error
33.8899
%
Total
Number
of
Instances
200
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.939
0.012
0.939
0.939
0.939
0.966
1
0.918
0.024
0.957
0.918
0.937
0.936
2
1
0.026
0.917
1
0.957
0.993
3
0.949
0.006
0.974
0.949
0.961
0.957
4
1
0
1
1
1
1
5
Weighted
Avg.
0.95
0.017
0.951
0.95
0.95
0.961
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
31
2
0
0
0
|
a
=
1
2
67
3
1
0
|
b
=
2
0
0
44
0
0
|
c
=
3
0
1
1
37
0
|
d
=
4
0
0
0
0
11
|
e
=
5
4.4.2.2.2 Testing
by
Cross‐Validation
(folds
10)
===
Stratified
cross‐validation
===
===
Summary
===
Correctly
Classified
Instances
82
41
%
Incorrectly
Classified
Instances
118
59
%
Kappa
statistic
0.2165
Mean
absolute
error
0.236
Root
mean
squared
error
0.4551
Relative
absolute
error
78.4778
%
52
Building
machine
learning
models
Root
relative
squared
error
117.4608
%
Total
Number
of
Instances
200
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.333
0.12
0.355
0.333
0.344
0.614
1
0.575
0.22
0.6
0.575
0.587
0.706
2
0.295
0.237
0.26
0.295
0.277
0.626
3
0.282
0.155
0.306
0.282
0.293
0.652
4
0.455
0.042
0.385
0.455
0.417
0.856
5
Weighted
Avg.
0.41
0.185
0.415
0.41
0.412
0.671
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
11
8
5
8
1
|
a
=
1
9
42
19
2
1
|
b
=
2
6
12
13
11
2
|
c
=
3
5
7
12
11
4
|
d
=
4
0
1
1
4
5
|
e
=
5
4.4.2.2.3 Discussion
Testing
on
the
training
data
classified
190
cases
correctly
out
of
200,
which
is
an
accuracy
of
95.0%.
Such
a
high
value
of
classification
accuracy
clearly
signifies
that
the
built
neural
network
model
has
learnt
the
training
data
with
high
accuracy.
Testing
using
cross‐validation
is
a
process
of
dividing
the
data
into
different
sub
sets
and
then
carrying
out
the
analysis
on
one
subset
and
testing
it
on
other.
Doing
this
with
10
folds
is
the
process
of
carrying
out
cross‐validation
10
times
and
averaging
out
the
accuracy
score.
The
achieved
result
of
an
average
of
82
correct
classifications
out
of
200,
i.e.
an
accuracy
of
41.0%
is
comparatively
low
but
is
well
within
the
desired
range.
Ideally,
the
model
should
have
been
trained
on
more
data.
Due
to
the
limitation
of
time,
and
no
compensation
available
to
volunteers,
only
200
tuples
of
useful
data
could
be
53
Building
machine
learning
models
collected.
Since
there
is
a
hidden
later
with
68
nodes
and
a
total
of
9316
weight
values
are
involved,
a
much
bigger
training
dataset
was
required.
It
is
expected
that
with
the
bigger
training
dataset,
the
accuracy
of
testing
would
increase.
4.4.3 Decision Tree Vs Neural Networks
Based
on
the
initial
200
data
cases,
one
model
each
of
decision
tree
and
neural
network
was
trained.
Upon
testing
on
the
training
dataset,
decision
tree
showed
slightly
better
accuracy
as
compared
to
the
neural
network
model.
The
other
factors
worth
considering
about
the
two
models
are:
• Building
a
neural
network
model
is
easy
but
time
consuming
in
WEKA
but
moreover,
it
slows
down
the
performance
of
the
website
after
its
implementation.
The
objective
of
the
project
is
to
determine
the
product
for
the
users
in
real‐time
while
they
are
still
browsing
and
it
will
require
very
fast
computation.
Decision
trees
are
a
set
of
conditions,
which
can
be
evaluated
much
efficiently
than
the
calculations
and
temporary
variables
required
in
neural
networks.
However,
if
a
parallel
web
server
is
used
which
is
capable
of
performing
calculations
faster,
a
neural
network
could
also
be
considered
for
implementation.
• With
time
the
website
would
keep
accumulating
more
and
more
mouse
movement
data
and
the
model
should
be
improved
/
trained
on
new
data
whenever
required.
This
would
require
re‐implementing
the
new
model
every
time
the
updating
is
desired.
As
stated
above
this
would
be
more
difficult,
time
consuming
and
error
prone
in
neural
networks
as
compared
to
decision
trees.
• Decision
trees
are
more
transparent
as
compared
to
neural
network
models.
This
mean
that
for
a
person
visually
seeing
the
two
models,
a
decision
tree
could
give
him
some
information
where
as
a
neural
networks
can
visually
tell
him
54
Building
machine
learning
models
nothing.
This
was
however
not
one
of
the
points
considered
before
taking
a
final
call
on
the
model
to
be
chosen.
Despite
all
these
points,
final
models
of
both
neural
networks
and
decision
trees
were
implemented
in
two
similar
copies
of
the
same
website.
Further
tests
of
accuracy
and
performance
were
conducted
later
in
order
to
conclude
a
better
model
for
the
problem
in
hand.
The
next
chapter
will
explain
the
steps
required
to
put
these
models
into
the
website
so
that
they
can
be
used
in
real
time
for
a
user
for
predicting
relevant
content
for
him.
55
Embedding
the
machine
learning
models
in
the
website
5
EMBEDDING
THE
MACHINE
LEARNING
MODELS
IN
THE
WEBSITE
This
chapter
explains
the
complete
methodology
adopted
to
apply
the
built
machine
learning
models
in
the
website.
It
also
explains
the
interaction
between
the
model
and
the
website
and
how
a
user’s
mouse
movement
data
was
used
to
predict
the
best
content
for
him
in
real
time.
5.1 What and Why?
As
explained
in
the
previous
chapter,
a
decision
tree
and
a
neural
network
model
capable
of
predicting
the
product
the
user
is
most
likely
to
buy
were
modeled.
These
models
needs
to
be
implemented
in
the
website
so
that
they
can
take
further
mouse
movement
behavior
of
new
users
as
input
and
can
predict
for
him
the
appropriate
product.
5.2 Specifications
The
initial
website
built
as
explained
in
Chapter
3
for
collecting
the
training
data
was
modified
to
implement
the
decision
tree
and
neural
network
models.
Some
additional
characteristics
required
from
the
website
were:
56
Embedding
the
machine
learning
models
in
the
website
• The
model
should
reside
on
the
server.
This
is
essential
from
security
point
of
view
else
any
user
would
have
access
to
the
model
which
by
reverse
engineering
can
give
information
about
the
products
bought
by
other
users.
• Real time model evaluation on the real time mouse movement data.
• Real
time
transfer
of
model
output
from
the
web
server
to
the
frontend
website
so
that
the
website
can
use
the
model
prediction.
• Determining
the
product
the
user
is
most
likely
to
buy
using
the
embedded
models
periodically
after
every
say
10
seconds.
This
would
involve
including
the
latest
mouse
movement
data
and
transferring
the
output
again
to
the
front
end
HTML
website
so
that
if
any
change
is
predicted
in
the
final
product,
it
can
be
reflected
on
the
front
end.
• All
the
tracking
and
model
evaluation
was
carried
out
in
a
hidden
layer
and
the
user
was
not
asked
for
any
explicit
information
or
was
not
compromised
on
speed
and
performance.
• Not
to
mention,
the
website
should
continue
to
track
mouse
movement
as
was
explained
in
earlier
chapters.
5.3 Implementation
The
website
built
initially
to
collect
training
data
had
mouse
movement
tracking
capability.
A
few
new
functions
and
scripts
were
added
to
enable
the
model
evaluation
on
the
captured
mouse
movements.
A
new
JavaScript
function
named
‘predict()’
was
programmed
in
the
JavaScript
file.
The
‘predict()’
was
a
recursive
function
that
was
made
to
call
itself
every
10
seconds.
This
was
because,
it
was
expected
that
the
prediction
is
made
every
10
seconds
using
the
machine
learning
model.
Every
subsequent
10
seconds,
the
database
would
contain
57
Embedding
the
machine
learning
models
in
the
website
more
mouse
movement
data
that
could
be
used
by
the
machine
learning
models
to
ideally
predict
more
accurately.
The
‘predict()’
function
takes
no
arguments
and
calls
a
PHP
script
named
‘predict.php’
passing
it
the
userID
of
the
current
user
via
GET
method.
‘predict.php’
file
resides
on
the
server
and
the
calling
from
JavaScript
was
programmed
using
standard
AJAX
protocols.
The
code
snippet
of
the
JavaScript
‘predict()’
function
is:
function autoPredict()
{
setTimeout("predict()",10000);
}
function predict()
{
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}
The
‘predict.php’
file
connects
to
the
MySQL
databases
and
selects
the
mouse
movement
data
for
the
current
user
using
a
simple
SQL
‘SELECT
*…’
statement.
Mouse
movement
data
was
saved
into
132
temporary
variables
that
correspond
to
each
section
of
the
webpage.
The
total
time
spent
by
the
user
till
now,
was
also
calculated
while
saving
these
temporary
variables.
The
absolute
time
value
spent
in
each
section
as
saved
in
the
132
temporary
variables
was
then
replaced
by
the
normalized
time
spent
in
that
section
by
divided
the
absolute
time
value
by
the
total
time
spent
by
that
user.
Hence,
after
this
step
the
132
temporary
variables
in
‘predict.php’
file
will
contain
the
normalized
time
/
relative
time
spent
by
the
user
in
corresponding
132
sections
of
the
webpage.
These
132
temporary
variables
are
the
132
independent
input
variables
for
the
model.
58
Embedding
the
machine
learning
models
in
the
website
The
two
models
(decision
tree
and
neural
network)
were
than
coded
and
were
given
access
to
these
132
temporary
variables
so
that
they
can
evaluate
the
normalized
time
and
can
make
their
respective
predictions.
It
should
however
be
noted
that
for
testing,
only
one
of
the
models
was
used.
Both
the
models
were
tested
separately
later
for
comparison
purpose.
The
implementation
of
the
two
models
is
as
follows:
5.3.1 Implementing the Decision Tree model
A
function
named
‘decisionTree()’
was
coded
in
the
PHP
file
‘predict.php’.
This
function
had
access
to
all
the
132
input
variables
as
stated
above.
The
model
made
in
WEKA
had
a
set
of
83
if‐else
statements
(83
being
the
size
of
the
tree).
All
these
83
if‐else
statements
from
the
WEKA
model
along
with
the
prediction
value
were
coded
in
PHP.
The
if‐else
statements
were
doing
comparisons
on
the
132
independent
variables
so
as
to
imitate
the
decision
tree.
The
output
of
this
set
of
if‐else
statement
was
one
value
that
is
also
the
output
of
the
decision
tree
model.
This
output
is
the
product
the
user
is
most
likely
to
buy
according
to
the
implemented
decision
tree.
This
value
was
returned
to
the
main
program
by
the
function.
The
complete
code
of
the
function
‘decisionTree()’
and
the
‘predict.php’
file
is
available
in
appendix
5.3.2 Implementing the Neural Network model
Another
function
named
‘neuralNetwork()’
was
implemented.
This
function
also
had
access
to
the
132
independent
input
variables
as
stated
above.
The
neural
network
built
in
WEKA
had
one
hidden
layer
with
68
nodes.
To
implement
this
hidden
layer,
68
new
temporary
variables
named
‘Node5’,
‘Node6’,
‘Node7’,
…..,
‘Node72’
were
created
with
value
computed
based
on
standard
neural
network
formula.
59
Embedding
the
machine
learning
models
in
the
website
All
the
coefficient
values
as
well
as
the
threshold
limits
were
used
as
given
by
WEKA
while
model
building.
To
implement
the
output
layer
the
same
formula
was
used
based
on
these
temporary
68
variables
(68
hidden
layer
nodes,
i.e.
Node5,
Node6….
Node72).
The
output
layer
of
the
neural
network
model
had
5
nodes
corresponded
to
the
five
laptop
products.
The
product
corresponding
to
the
node
with
highest
value
was
predicted
as
the
laptop
the
current
user
is
most
likely
to
buy.
5.4 Using model outputs
As
stated
above,
only
one
of
the
two
models
was
used
at
a
time
for
a
given
user.
After
receiving
the
output
from
the
used
models
(decision
tree
or
neural
network),
the
output
was
sent
back
to
the
frontend
JavaScript
function
named
‘predictResponse()’
via
AJAX.
It
should
be
noted
that
model
output
was
the
code
of
one
of
the
5
laptops
that
the
current
user
is
most
likely
to
buy.
The
‘predictResponse()’
JavaScript
function
after
receiving
the
prediction
can
now
be
programmed
as
per
the
needs.
In
the
current
project,
the
author
decided
to
simply
highlight
the
border
of
the
predicted
laptop
in
red
color.
The
predicted
laptop
is
the
one
the
user
is
most
likely
to
buy
that
has
been
predicted
by
one
of
the
machine‐learning
model
based
on
the
user’s
mouse
movement
behavior.
The
function
definition
of
the
‘predictResponse()’
function
is
as
follows:
60
Embedding
the
machine
learning
models
in
the
website
function predictResponse()
{
if (http.readyState == 4)
{
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg2").className="";
document.getElementById("cg3").className="";
document.getElementById("cg4").className="";
document.getElementById("cg5").className="";
document.getElementById("cg6").className="";
document.getElementById("cg"+colName).className="oce-predict";
alert("Product : "+predictProduct);
setTimeout("predict()",10000);
}
}
The
function
above
gets
the
response
from
the
PHP
script
via
standard
AJAX
http.responseText
function.
The
output
was
then
used
to
simple
change
the
style
of
the
column
containing
that
product.
The
style
of
all
other
columns
is
first
reset
before
changing
the
predicted
laptop
column
style.
The
JavaScript
‘predict()’
function
has
been
called
every
10,000
milliseconds.
In
the
current
demonstration,
a
popup
was
also
shown
to
the
user
with
the
code
of
the
laptop
he
is
most
likely
to
buy.
This
was
done
using
the
alert
statement.
There
can
be
several
other
usages
of
the
prediction.
It
can
be
imagined
that
a
customer
would
be
served
more
easily
and
appropriately
if
the
shopkeeper
knows
the
product
the
customer
is
most
likely
to
buy.
The
customer
could
be
given
other
options
similar
to
the
product
that
was
predicted.
If
not
used
by
the
content
generator
of
the
website,
this
prediction
can
always
be
used
by
the
visitors
in
finding
information
he
has
been
looking
for.
The
screenshot
of
the
prediction
made
by
the
Decision
Tree
model
is
shown
in
Figure
10
61
Embedding
the
machine
learning
models
in
the
website
Figure
10:
Screenshot
of
the
prediction
done
by
the
model
5.5 What next
Once
the
website
was
programmed
and
the
machine
learning
models
were
embedded,
it
was
again
made
public
and
the
users
were
invited
to
visit
it
again.
All
the
mouse
movement
data
was
saved
in
the
databases
as
designed
earlier
along
with
the
product
the
user
buys.
The
users
were
also
shown
the
real‐time
prediction
as
per
the
model
after
every
10
seconds.
The
prediction
done
by
the
model
was
not
saved
in
any
databases
because
of
following
reasons:
• Connecting
the
‘predict.php’
file
with
the
databases
and
saving
data
will
certainly
take
time.
This
time
used
up
in
saving
predicted
output
would
effect
the
performance
of
the
website
mainly
because
it
will
delay
the
return
of
the
model
output
from
‘predict.php’
file
to
the
javaScript
‘predictResponse()’
function.
• The
final
prediction
done
for
any
user
as
per
the
model
can
always
be
calculated
again
as
the
databases
are
keeping
a
record
of
the
mouse
movement
data
for
every
user.
This
would
be
done
later
in
the
testing
phase
of
the
project.
62
Embedding
the
machine
learning
models
in
the
website
• The
prediction
was
done
every
10
seconds.
This
would
mean,
that
there
would
be
several
predictions
(average
4
predictions)
done
for
every
user.
The
count
of
four
predictions
was
estimated
because
it
was
earlier
found
in
section
3.3.2.2
that
average
time
spent
by
a
user
on
the
webpage
is
40.26
seconds.
Saving
all
predictions
per
user
is
again
a
performance
issue,
as
the
table
saving
this
is
expected
to
grow
with
time.
The
final
website
capable
of
predicting
the
product
the
user
is
most
likely
to
buy
was
made
public
and
was
kept
online
for
7
days.
The
users
were
again
invited
using
emails,
social
media,
chats
etc
and
were
asked
to
surf
on
the
final
version
of
the
webpage.
The
volunteers
were
required
to
buy
one
of
the
products
after
evaluating
all
the
options
(5
laptops)
available
on
that
page.
While
doing
so,
the
users
were
shown
the
product
they
are
most
likely
to
buy.
It
was
told
by
the
visitors
informally
via
email
and
in‐person
conversations
that
the
predictions
were
quite
accurate.
The
next
chapter
will
explain
a
much
formal
and
quantitative
method
of
testing
the
prediction
done
by
the
two
models.
It
will
also
describe
the
methodology
adopted
to
test
the
time
performances
of
the
two
models.
63
Testing
and
Results
6
TESTING
AND
RESULTS
This
chapter
describes
the
complete
testing
phase
of
the
project.
It
describes
the
data
collection
steps
and
the
parameters
on
which
the
models
were
evaluated.
It
also
explains
the
testing
methodology
and
summary
of
the
final
results
obtained.
6.1 Testing methodology
There
were
two
types
of
tests
conducted
to
evaluate
the
implementation.
One
test
was
conducted
on
WEKA
on
the
collected
test
data
to
check
for
the
classification
accuracy
of
the
model
(decision
tree
or
neural
network).
The
other
test
was
conducted
on
the
‘predict.php’
file
to
check
the
time
performance
of
the
website
after
implementing
the
model.
Both
the
above‐mentioned
tests
were
performed
on
both
the
models
separately.
The
methodology
adopted
and
the
results
obtained
are
mentioned
in
the
following
sections.
6.2 Testing for model accuracy
Testing
data
was
collected
while
the
final
website
was
live
and
was
used
to
further
test
the
two
models
in
WEKA.
It
was
found
that
the
decision
tree
model
gave
an
accuracy
of
84.09%
whereas
the
neural
network
model
gave
an
accuracy
of
34.09%
on
the
collected
test data. Details about the test conducted are as follows:
64
Testing
and
Results
6.2.1 Testing data collection
While
the
website
with
one
of
the
machine
learning
model
was
live,
the
users
mouse
tracking
data
and
the
final
product
bought
by
the
user
was
getting
saved
in
the
tables
‘data’
and
‘bought’
respectively.
It
was
found
that
in
7
days
time
(duration
for
which
the
test
website
was
live),
49
unique
users
visited
the
webpage.
There
were
1275
tuples
in
the
‘data’
table
and
44
tuples
in
the
‘bought’
table.
The
difference
between
the
cardinality
of
bought
table
and
the
number
of
visitors
was
because
5
users
(49
minus
44)
didn’t
click
the
buy
button
and
left
the
site
after
browsing
it
for
a
while.
This data was processed in the similar way as the initial data as mentioned in section
3.3.2. The steps followed to analyze and prepare the test data are as follows:
• The
data
was
converted
into
a
more
usable
format
using
the
php
script
named
‘alignData.php’.
The
details
of
this
script
are
mentioned
in
section
3.3.2.1.
This
step
converted
the
test
data
into
a
‘one
user
per
row
data’
with
the
time
values
of
each
user
in
a
same
row
along
with
the
product
bought.
• This
data
was
exported
into
excel
and
was
normalized.
To
normalize
the
times,
total
time
spent
by
each
user
was
calculated
and
then
time
spent
in
each
section
/
cell
was
divided
by
the
total
time.
This
is
explained
in
details
in
section
3.3.2.3
• It
should
be
noted
that
in
this
step
no
outliers
were
removed.
The
reason
is
that
the
data
was
collected
from
the
actual
users
and
it
is
expected
that
all
kinds
of
people
will
use
the
website
in
all
possible
way
and
the
accurate
measure
of
accuracy
would
be
when
all
these
cases
are
taken
into
considerations
including
any
outliers.
• This data was saved in a CSV file that is then opened in WEKA.
65
Testing
and
Results
• Opened
the
CSV
file
in
WEKA
and
was
saved
in
WEKA
default
ARFF
format.
The
ARFF
format
was
opened
in
a
text
editor
and
the
property
of
the
bought
table
was
changed
from
number
to
nominal
as
stated
in
section
4.4
This
data
was
then
opened
in
WEKA
again
the
model
testing
was
carried
out
as
explained
in
following
sections:
6.2.2 Model testing in WEKA using test data
Using
WEKA
the
saved
files
of
the
two
models
were
opened.
In
the
classifier
tab,
testing
on
supplied
test
dataset
option
was
chosen
and
after
pressing
the
set
button,
the
collected
and
normalized
test
data
file
was
opened.
Now
the
loaded
model
was
made
to
evaluate
on
this
testing
data
by
right
clicking
the
model
and
selecting
“Re‐evaluate
the
model
on
current
test‐set”.
This
method
would
evaluate
the
model
on
the
test
dataset
collected
and
would
show
the
accuracy
results
on
this
test
data.
This
method
is
similar
to
running
the
model
on
the
website
using
PHP.
The
output
given
by
the
model
while
testing
in
WEKA
would
be
exactly
same
to
the
one
given
by
the
PHP
script
online.
This
is
because
the
obtained
WEKA
model
was
the
one
which
was
implemented
in
the
website.
This
was
the
reason
that
the
predictions
were
not
saved
as
stated
in
section
5.5.
Now
checking
for
accuracy
is
simply
comparing
the
model
prediction
with
the
actual
product
bought
by
the
user.
The
details
of
the
results
given
by
both
the
model
when
evaluated
on
the
test
set
are
explained
in
the
following
sub
sections‐
66
Testing
and
Results
6.2.2.1 Decision Tree model
The
test
dataset
with
44
cases
was
evaluated
using
the
built
decision
tree
model.
It
was
found
that
the
tree
was
able
to
correctly
classify
37
out
of
44
cases
with
an
accuracy
of
84.0909%.
The output obtained after re‐evaluation from WEKA was:
===
Re‐evaluation
on
test
set
===
User
supplied
test
set
Relation:
MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1
Instances:
unknown
(yet).
Reading
incrementally
Attributes:
133
===
Summary
===
Correctly
Classified
Instances
37
84.0909
%
Incorrectly
Classified
Instances
7
15.9091
%
Kappa
statistic
0.7825
Mean
absolute
error
0.1916
Root
mean
squared
error
0.2814
Total
Number
of
Instances
44
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.857
0.027
0.857
0.857
0.857
0.986
1
0.875
0.143
0.778
0.875
0.824
0.872
2
0.917
0.031
0.917
0.917
0.917
0.921
3
0.833
0.026
0.833
0.833
0.833
0.945
4
0.333
0
1
0.333
0.5
0.89
5
Weighted
Avg.
0.841
0.068
0.851
0.841
0.834
0.915
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
6
1
0
0
0
|
a
=
1
1
14
0
1
0
|
b
=
2
0
1
11
0
0
|
c
=
3
0
0
1
5
0
|
d
=
4
0
2
0
0
1
|
e
=
5
67
Testing
and
Results
6.2.2.2 Neural Network model
The
dataset
having
44
test
cases
was
evaluated
on
neural
network
model
and
it
was
found
that
it
classified
15
cases
correctly.
This
shows
an
accuracy
of
only
34.0909%
on
testing
data
of
the
neural
network
model.
The output obtained from WEKA was:
===
Re‐evaluation
on
test
set
===
User
supplied
test
set
Relation:
MasterData‐1‐weka.filters.unsupervised.attribute.Remove‐R1
Instances:
unknown
(yet).
Reading
incrementally
Attributes:
133
===
Summary
===
Correctly
Classified
Instances
15
34.0909
%
Incorrectly
Classified
Instances
29
65.9091
%
Kappa
statistic
0.1367
Mean
absolute
error
0.2695
Root
mean
squared
error
0.5001
Total
Number
of
Instances
44
===
Detailed
Accuracy
By
Class
===
TP
Rate
FP
Rate
Precision
Recall
F‐Measure
ROC
Area
Class
0.429
0.135
0.375
0.429
0.4
0.695
1
0.313
0.25
0.417
0.313
0.357
0.694
2
0.25
0.281
0.25
0.25
0.25
0.505
3
0.5
0.184
0.3
0.5
0.375
0.623
4
0.333
0.024
0.5
0.333
0.4
0.78
5
Weighted
Avg.
0.341
0.216
0.354
0.341
0.34
0.639
===
Confusion
Matrix
===
a
b
c
d
e
<‐‐
classified
as
3
3
0
1
0
|
a
=
1
2
5
7
2
0
|
b
=
2
3
4
3
2
0
|
c
=
3
0
0
2
3
1
|
d
=
4
0
0
0
2
1
|
e
=
5
68
Testing
and
Results
6.2.3 Discussion
The
training
data
gave
an
accuracy
of
89.5%
for
decision
tree
whereas
gave
an
accuracy
of
95%
for
neural
networks.
The
same
decision
tree
and
neural
network
models
gave
accuracies
of
84.0909%
and
34.0909%
respectively
when
evaluated
on
the
test
dataset.
For
models
comparison,
accuracy
on
the
test
dataset,
i.e.
the
data
on
which
model
has
not
been
trained
is
the
one
of
the
most
important
parameter.
As
discussed
in
section
4.4.3,
there
are
several
drawbacks
of
using
neural
networks
in
the
present
situation
but
after
conducting
the
evaluation
of
the
two
models
on
test
dataset,
it
is
clear
that
decision
trees
have
clearly
out
performed
neural
networks
and
should
be
used
while
predictions.
This
however
depends
on
a
lot
of
parameters,
most
important
being
the
size
of
the
training
and
testing
dataset.
Since
the
scope
of
this
project
was
limited,
a
large
amount
of
data
could
not
be
collected
but
it
is
advised
that
both
decision
trees
and
neural
networks
should
be
evaluated
along
with
other
machine
learning
models
before
pin‐
pointing
on
one
of
them.
6.3 Testing time performance of the models
A
new
PHP
script
was
written
and
executed
on
the
server
to
estimate
the
average
time
the
model
processing
is
taking
when
executed
in
real
time.
To
do
this,
the
PHP
script
was
connected
to
the
database
containing
the
test
data.
Both
the
model
functions
were
then
called
and
the
time
taken
by
them
to
evaluate
all
the
44
test
cases
was
calculated.
This
was
averaged
out
over
44
cases
to
estimate
the
average
time
each
model
takes
while
making
every
prediction
in
PHP.
This
process
was
carried
out
10
times
separately
to
estimate
the
average
time
so
as
to
avoid
any
clashes
with
unforeseen
tasks
at
the
server
that
might
delay
the
model
execution.
69
Testing
and
Results
Time
taken
by
the
model
is
an
important
feature
as
the
expectation
of
the
intelligent
website
is
to
predict
the
output
as
soon
as
possible
and
of
course
in
real‐time.
A
model
taking
more
than
some
threshold
value
for
calculations
is
of
no
good
use.
The
process
and
results
are
explained
in
the
following
sections‐
6.3.1 Decision Tree model
The
decision
tree
was
made
to
execute
on
all
the
44
test
cases.
The
time
taken
was
averaged
out.
This
was
done
10
times
and
the
average
times
in
seconds
taken
by
the
script
to
evaluate
decision
tree
were:
From the above 10 time values, the following insights can be seen:
• Minimum time taken by the model was approximately 0.00053 seconds
• Maximum time taken by the model was approximately 0.00737 seconds
• Average time taken by the model was 0.00163 seconds
6.3.2 Neural Network model
As
done
for
decision
trees,
the
neural
network
model
was
also
made
to
evaluate
the
44
test
cases.
The
average
time
taken
was
noted.
This
was
done
10
times.
The
average
times
of
execution
taken
by
the
neural
network
model
were:
0.496645451, 0.707032805
70
Testing
and
Results
From the above 10 time values, the following insights can be seen:
• Minimum time taken by the model was approximately 0.4839 seconds
• Maximum time taken by the model was approximately 0.7461 seconds
• Average time taken by the model was 0.5973 seconds
6.3.3 Discussion
It
is
clearly
seen
that
neural
network
model
is
taking
far
more
time
to
execute
as
compared
to
decision
tree
model.
It
was
also
analyzed
that
the
chosen
decision
tree
model
runs
at
least
350
times
faster
than
the
chosen
neural
network.
Since,
the
objective
is
to
predict
in
read‐time,
speed
is
a
very
important
parameter
and
decision
tree
model
has
completely
won
the
time
performance
battle.
6.4 Results
After
testing
both
the
models
(decision
tree
and
neural
network)
on
prediction
accuracy
and
time
performance
parameters,
it
was
clearly
found
that
decision
tree
proved
much
better
for
implementation
in
the
current
problem
as
compared
to
neural
network.
The result obtained in the tests is summarized below:
• Accuracy
(on
test
dataset):
o Decision
Tree:
84.0909%
o Neural
Network:
34.0909%
• Time
Performance
(PHP
scripts
running
on
apache):
o Decision
Tree:
0.0016
seconds
o Neural Network: 0.5973 seconds
71
Testing
and
Results
It
should
however
be
noted
that
these
results
were
obtained
when
the
models
were
trained
on
only
200
cases.
The
neural
network
model
had
a
total
of
73
nodes
including
68
hidden
nodes.
To
properly
train
the
neural
network
a
few
thousand
cases
were
at
least
required.
The
neural
network
model
was
built
to
establish
the
fact
that
it
can
be
used
on
a
website
to
predict
relevant
content
for
the
user.
The
decision
tree
on
the
other
hand
is
also
expected
to
give
better
results
when
larger
training
and
testing
datasets
are
available.
It
should
also
be
noted
that
there
were
five
classes
of
the
dependent
variable
(5
possible
laptop
products)
and
hence
the
model
would
have
been
considered
void
only
if
the
accuracy
is
close
to
20%
(20%
being
the
equally
likely
chance
of
each
model).
Since
the
accuracy
obtained
for
both
the
machine
learning
models
was
far
above
the
20%
benchmark,
both
models
have
shown
some
promise
that
they
do
have
potential
to
recommend
relevant
content
for
a
user
based
on
his
mouse
movement
behaviors.
The next chapter will give a brief conclusion of the work done.
72
Conclusion
7
CONCLUSION
This
chapter
gives
the
conclusion
of
the
project
and
discusses
the
scope
of
future
work
possible.
It
also
talks
about
some
other
implementations
possible
of
the
explained
methodology
It
has
been
successfully
demonstrated
that
by
building
a
machine‐learning
model
on
users
mouse
movement
data,
appropriate
content
for
him
can
be
predicted.
The
dummy
shopping
website
developed,
embedded
with
a
decision
tree
machine
learning
model
gave
a
remarkable
accuracy
of
84.09%
on
the
test
data.
The
accuracy
was
measured
as
the
ratio
of
the
correct
predictions
to
the
total
number
of
predictions
done
by
the
model.
It
was
also
found
that
implementing
a
decision
tree
model
in
a
website
would
not
affect
the
performance
of
the
page
as
the
average
time
taken
by
the
dummy
model
was
found
to
be
around
1.6
milliseconds.
A
Neural
network
model
was
similarly
evaluated
and
it
gave
an
accuracy
of
34.09%
and
took
an
average
time
of
577.3
milliseconds
to
process
a
single
case
of
data.
The
objective
of
the
project
was
to
use
the
mouse
movement
behavior
of
a
user
and
predict
the
appropriate
content
for
him
intelligently
and
in
real
time.
This
objective
was
successfully
achieved
and
several
other
sub‐objectives
were
also
reached
while
working
on
the
project.
User’s mouse tracking was implemented successfully using a completely new algorithm.
This was done using PHP, AJAX, HTML and MySQL. The performance of the website after
73
Conclusion
implementing
mouse
tracking
was
not
compromised
and
the
accuracy
of
the
mouse
tracking
data
collected
was
found
to
be
very
high.
A
webpage
was
developed
imitating
a
shopping
portal
and
some
highlighting
techniques
were
applied
to
it
to
make
sure
that
the
user’s
mouse
pointer
is
close
to
his
point
of
gaze.
The
initial
website
developed
in
PHP
was
live
for
around
two
weeks
and
it
collected
200
cases
of
training
data.
The
data
was
then
used
to
train
two
separate
machine‐learning
models,
namely
a
Decision
Tree
model
and
a
Neural
Network
model.
Both
the
models
gave
promising
results
when
tested
on
the
training
data,
which
proved
that
the
models
built
have
learned
the
mouse
movement
behavior
appropriately.
Both
the
machine
learning
models
were
coded
back
into
the
website
using
PHP
and
AJAX.
The
website
was
made
to
collect
mouse
movement
data
which
was
dynamically
read
by
the
models
and
an
output
was
generated.
This
predicted
output
was
sent
to
the
webpage
for
further
personalization.
A
total
of
44
test
cases
were
also
collected
from
the
final
website.
Using
the
collected
44
test
cases,
both
models
were
evaluated
and
the
decision
tree
model
was
found
to
perform
extremely
well
as
compared
to
the
neural
network
model,
both
from
the
point
of
view
of
accuracy
and
time
performance.
Decision
tree
classified
2.5
times
accurately
in
a
set
of
44
cases,
and
was
350
times
faster
than
neural
network
model.
This
however
cannot
be
generalized
as
it
depends
on
the
size
of
the
initial
training
dataset,
(which
was
small
in
the
current
scope
of
the
project)
and
on
the
number
of
independent
variables
(which
was
large
in
the
current
implementation).
The
working
demonstration
of
the
project,
along
with
its
documentation
and
the
GNU
General
Public
License
source
code
is
available
online
at
http://sparshgupta.name/MSc/Project
74
Conclusion
7.1 Future Work
The
proposed
idea
has
shows
a
huge
potential
and
there
is
a
lot
of
scope
for
future
innovations
and
improvements
if
properly
explored.
The
lack
of
data
was
the
prime
limitation
in
the
current
study.
If
a
commercial
website
is
required
to
be
intelligent
then
models
built
on
several
thousands
of
cases
of
training
data
should
be
used
and
once
that
data
is
obtained,
possibilities
of
other
machine
learning
algorithms
could
be
explored.
The
data
collected
in
the
testing
phase
can
later
be
used
to
train
the
models.
There
is
a
never‐ending
chain
of
model
training
and
improvement
involved
in
the
current
proposed
concept
and
implementation.
This
is
because
with
time,
the
website
will
accumulate
a
lot
of
data
that
at
regular
intervals
can
be
used
to
further
train
the
implemented
model
or
to
make
a
new
model.
It
is
expected
that
with
every
improvement
in
the
model,
its
capability
to
predict
the
relevant
content
for
a
new
user
will
increase.
The
proposed
implementation
requires
that
each
section
of
the
website
calls
the
mouse
tracking
function
whenever
mouse
enters
the
section
and
leaves
it.
This
requires
explicit
coding
of
function
call
statements
in
every
cell.
This
might
not
be
possible
in
highly
dynamic
websites
and
hence
work
could
be
done
on
implementing
the
idea
on
any
given
website,
requiring
almost
no
change
in
the
existing
web
coding.
In
the
current
project,
the
information
about
the
predicted
content
(i.e.,
the
laptop
user
is
most
likely
to
buy)
was
not
exploited.
Work
can
be
done
to
make
the
website
interacting
with
the
user
like
a
salesman.
The
website
can
remove
all
the
products
which
the
user
would
be
least
interested
in
and
can
only
show
him
products
he
is
most
likely
to
buy.
75
Conclusion
Current
implementations
involved
using
only
a
single
machine‐learning
model
at
a
time.
Multiple
models
can
be
implemented
in
the
webpage
and
the
strength
of
prediction
made
can
also
be
used
to
further
interact
with
the
user.
Incase
all
the
different
implemented
models
gave
the
same
prediction
than
it
can
be
assumed
to
be
a
strong
prediction
and
hence
the
webpage
can
adapt
accordingly
immediately.
Other implementations possible
A
Shopping
portal
with
intelligent
prediction
of
the
product
a
user
is
most
likely
to
buy
is
one
of
the
many
implementations
possible
of
the
proposed
concept.
Some
other
possible
implementations
could
be:
• A
Search
Engine
Feedback
System:
Current
search
engines
display
the
results
in
a
form
of
list
of
links
along
with
a
small
text
relevant
to
the
search.
Most
of
the
users
choose
the
links
after
reading
the
text
snippet
associated
with
the
link
and
they
spend
different
times
on
different
links.
Current
search
feedback
is
completely
based
on
mouse
click
that
in
a
sense
is
a
binary
feedback
(either
Yes
or
No).
The
feedback
system
can
be
made
more
accurate
by
determining
the
relative
time
a
user
spent
on
a
link
compared
to
other
links.
• News
Content
Prediction:
An
online
news
website
shows
several
news
under
different
heads
on
a
page.
Many
common
users
have
different
priorities
for
news.
Based
on
a
user’s
mouse
movement
activity,
relevant
news
content
can
be
shown
to
him.
For
example,
if
a
user
is
spending
more
time
around
football
and
cricket
news
headlines
than
Political
headlines,
then
it
can
be
predicted
that
he
is
more
interested
in
sport
news
and,
accordingly,
the
website
can
be
molded
for
him.
76
77
<<Bibliography
BIBLIOGRAPHY
Aaltonen, Antti, Aulikki Hyrskykari, and Kari-Jo Räihä. "101 spots, or how do users
read menus?" Conference on Human Factors in Computing Systems, 1998: 132 -
139.
Arroya, Ernesto, Ted Selker, and Willy Wei. "Usability tool for analysis of web
designs using mouse tracks." Conference on Human Factors in Computing Systems,
2006: 484 - 489.
Atterer, Richard, and Albrecht Schmidt. "Tracking the interaction of users with AJAX
applications for usability testing." Conference on Human Factors in Computing
Systems, 2007: 1347 - 1350.
Atterer, Richard, Monica Wnuk, and Albrecht Schmidt. "Knowing the User’s Every
Move – User Activity Tracking for Website Usability Evaluation and Implicit
Interaction." ACM.
Balabanovic, Marko, Yoav Shoham, and Yeogirl Yun. "An Adaptive Agent for
Automated Web Browsing." 1997.
Byrne, Michael D, John R Anderson, Scott Douglass, and Michael Matessa. "Eye
tracking the visual search of click-down menus." Conference on Human Factors in
Computing Systems, 1999.
78
<<Bibliography
Chen, Mon Chu, John R Anderson, and Myeong Ho Sohn. "What can a mouse
cursor tell us more?: correlation of eye/mouse movements on web browsing."
Conference on Human Factors in Computing Systems, 2001.
Dutta, Partha, Sandip Debnath, and Sandip Sen. "A shopper's assistant."
International Conference on Autonomous Agents, 2001.
Edmonds, Andy. "Why the Mouse Doesn't Always Keep Up with the Eye." 2008.
Guo, Qi, and Eugene Agichtein. "Exploring mouse movements for inferring query
intent." Annual ACM Conference on Research and Development in Information
Retrieval, 2008: 1.
Linden, Greg. "Geeking with Greg Exploring the future of personalized information."
79
<<Bibliography
Mueller, Florian, and Andrea Lockerd. "Cheese: tracking mouse movement activity
on websites, a tool for user modeling." Conference on Human Factors in Computing
Systems, 2001.
Pazzani, Michael, and Daniel Billsus. "Learning and Revising User Profiles: The
Identification of Interesting Web Sites." Machine Learning 27, no. 3 (1997): 313 -
331.
Rodden, Kerry, Xin Fu, Anne Aula, and Ian Spiro. "Eye-Mouse Coordination Patterns
on Web Search Results Pages." Conference on Human Factors in Computing
Systems, 2008: 5.
Salzberg, Steven L. "C4.5: Programs for Machine Learning." Machine Learning 16,
no. 3 (1994): 235-240.
80
>>
Appendix:
Source
Code
Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."
http://smt.speedzinemedia.com/smt/docs/smt_IADIS07.pdf.
Torres, Luis A. Leiva, and Roberto Vivo Hernando. "Real time mouse tracking
registration and visualization tool for usability evaluation on websites."
Witten, Ian H, and Eibe Frank. Data Mining: Practical machine learning tools and
techniques. San Francisco: Morgan Kaufmann, 2005.
81
>>
Appendix:
Source
Code
APPENDIX:
SOURCE
CODE
HTML
final
webpage
The
HTML
code
of
the
final
website
developed
capable
of
tracking
user’s
mouse
movements
as
well
as
capable
of
predicting
the
relevant
product
to
the
user
is
as
follows:
<body onload="start_It();">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><p class="oce-first"><span class="bold">NOTE:</span> Surf on this page like you do on a
shopping portal comparison page and decide on a model based on its configuration and buy it.
Thanks</p></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"
onMouseOver="hiliteColumn(event);" onMouseOut="resetColumn(event);" class="one-column-emphasis">
<colgroup class="oce-first" id="na"></colgroup>
<colgroup id="cg2" class=""></colgroup>
<colgroup id="cg3" class=""></colgroup>
<colgroup id="cg4" class=""></colgroup>
<colgroup id="cg5" class=""></colgroup>
<colgroup id="cg6" class=""></colgroup>
<thead>
<tr>
<th onmouseout="movement_out('a0');" onmouseover="movement_in();">Product Name</th>
<th onmouseout="movement_out('a1');" onmouseover="movement_in();">Lenovo IdeaPad Y650
4185</th>
<th onmouseout="movement_out('a2');" onmouseover="movement_in();">HP Pavilion dv7-
1285dx</th>
<th onmouseout="movement_out('a3');" onmouseover="movement_in();">Sony VAIO VGN-P588E</th>
<th onmouseout="movement_out('a4');" onmouseover="movement_in();">Dell Studio XPS 16</th>
<th onmouseout="movement_out('a5');" onmouseover="movement_in();">Toshiba Satellite A205-
S4617</th>
</tr>
</thead>
<tbody>
<tr>
<td class="oce-first" onmouseout="movement_out('b0');"
onmouseover="movement_in();"> </td>
<td onmouseout="movement_out('b1');" onmouseover="movement_in();"><img src="images/1.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b2');" onmouseover="movement_in();"><img src="images/2.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b3');" onmouseover="movement_in();"><img src="images/3.gif"
width="120" height="90" border="0" /></td>
<td onmouseout="movement_out('b4');" onmouseover="movement_in();"><img src="images/4.gif"
width="120" height="90" border="0" /></td>
82
>>
Appendix:
Source
Code
83
>>
Appendix:
Source
Code
84
>>
Appendix:
Source
Code
85
>>
Appendix:
Source
Code
86
>>
Appendix:
Source
Code
The JavaScript file
function autoPredict()
{
setTimeout("predict()",10000);
}
function predict()
{
http.open("GET", "predict.php?userId="+userId, true);
http.onreadystatechange = predictResponse;
http.send(null);
}
function predictResponse()
{
if (http.readyState == 4)
{
predictProduct = http.responseText;
var colName=Number(predictProduct)+1;
document.getElementById("cg2").className="";
document.getElementById("cg3").className="";
document.getElementById("cg4").className="";
document.getElementById("cg5").className="";
document.getElementById("cg6").className="";
document.getElementById("cg"+colName).className="oce-predict";
alert("Product : "+predictProduct);
setTimeout("predict()",10000);
}
}
function handleHttpResponse()
{
if (http.readyState == 4)
{
startIt();
}
}
function handleHttpResponseBought()
{
if (http.readyState == 4)
{
alert("Thanks for Participating");
}
}
function start_It() {
if(done==0)
{
setTimeout("sendData()",2000);
}
if(startpredict==0)
{
++startpredict;
autoPredict();
}
87
>>
Appendix:
Source
Code
function sendData() {
if(flag==0)
{
queue2="";
flag=1;
var query_string = "data.php?userId="+userId+"&queue="+queue1;
queue1="";
}
else
{
queue1="";
flag=0;
var query_string = "data.php?userId="+userId+"&queue="+queue2;
queue2="";
}
http.open("GET", query_string, true);
http.onreadystatechange = handleHttpResponse;
http.send(null);
}
function movement_in(){
cellEntryDate = new Date();
}
function movement_out(cell){
cellExitDate = new Date();
time = cellExitDate.getTime()-cellEntryDate.getTime();
if(done==0)
{
if(flag==0)
{
queue1 = queue1+cell+":"+time+"_";
}
else
{
queue2 = queue2+cell+":"+time+"_";
}
}
}
function bought(product){
done=1;
var query_bought = "bought.php?userId="+userId+"&product="+product;
http.open("GET", query_bought, true);
http.onreadystatechange = handleHttpResponseBought;
http.send(null);
}
function getHTTPObject() {
var xmlhttp;
/*@cc_on
@if (@_jscript_version >= 5)
try {
xmlhttp = new ActiveXObject("Msxml2.XMLHTTP");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
} catch (E) {
xmlhttp = false;
}
}
@else
xmlhttp = false;
@end @*/
if (!xmlhttp && typeof XMLHttpRequest != 'undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp = false;
}
}
return xmlhttp;
88
>>
Appendix:
Source
Code
function hiliteColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="over";
}
function resetColumn(e) {
var o = (document.all) ? e.srcElement : e.target;
if (o.nodeName != "TD") return;
document.getElementById("cg"+(o.cellIndex+1)).className="";
}
89
>>
Appendix:
Source
Code
The CSS file
body {
margin-left: 0px;
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
text-align: left;
}
colgroup.over {
background: #ebeeff;
}
.oce-first
{
background: #d0dafd;
border-right: 10px solid transparent;
border-left: 10px solid transparent;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}
.oce-predict
{
background: #d0dafd;
border-right: 3px solid #F00;
border-left: 3px solid #F00;
border-top: 3px solid #F00;
border-bottom: 3px solid #F00;
min-width:199px;
font-size: 14px;
padding: 12px 15px;
color: #039;
text-align:justify;
}
table.one-column-emphasis
{
font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
font-size: 12px;
width: 100%;
border-collapse: collapse;
color: #969;
}
table.one-column-emphasis th
{
font-size: 14px;
font-weight: bold;
padding: 12px 15px;
color: #039;
text-align:center;
}
table.one-column-emphasis td
{
padding: 10px 15px;
color: #669;
border-top: 1px solid #e8edff;
min-width:166px;
text-align:center;
}
table.one-column-emphasis tr:hover td
{
background: #ebeeff;
text-align: center;
}
table.one-column-emphasis tr:hover td:hover
90
>>
Appendix:
Source
Code
{
color: #039;
background: #94acff;
}
.bold {
font-weight: bold;
}
.italics {
font-style: italic;
}
.oce-first {
text-align: justify;
}
91
>>
Appendix:
Source
Code
The PHP scripts
data.php
<?php
$queue=$HTTP_GET_VARS['queue'];
$userId=$HTTP_GET_VARS['userId'];
include("connect.php");
$queueArray=explode("_",$queue);
for($i=0;$i<substr_count($queue,"_");$i++)
{
$values=explode(":",$queueArray[$i]);
mysql_query("INSERT into data
values(\"".$userId."\",\"".$values[0]."\",\"".$values[1]."\")");
}
mysql_close($conn);
?>
connect.php
<?php
$dbhost = 'localhost:8889';
$dbuser = 'root';
$dbpass = 'root';
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ('Error connecting to mysql');
$dbname = 'MSc';
mysql_select_db($dbname);
?>
bought.php
<?php
$product=$HTTP_GET_VARS['product'];
$userId=$HTTP_GET_VARS['userId'];
include("connect.php");
mysql_query("INSERT into bought values(\"".$userId."\",\"".$product."\")");
mysql_close($conn);
?>
alignData.php
<?php
include("connect.php");
$result=mysql_query("SELECT * FROM `bought`");
while($row = mysql_fetch_array($result))
{
$result_1=mysql_query("SELECT * FROM `data` WHERE `userID`=\"".$row['userId']."\" order by
`cellID`");
$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
$previous_column=$row_1['cellID'];
$previous_value=$row_1['time'];
while($row_1 = mysql_fetch_array($result_1))
{
if($previous_column==$row_1['cellID']) {
$previous_value+=$row_1['time'];
}
92
>>
Appendix:
Source
Code
else {
$columnNames=$columnNames.",".$previous_column;
$values=$values.",\"".$previous_value."\"";
$previous_value=$row_1['time'];
$previous_column=$row_1['cellID'];
}
}
$columnNames=$columnNames.",".$previous_column.",product";
$values=$values.",\"".$previous_value."\",\"".$row['product']."\"";
mysql_query("INSERT INTO finalData(userID".$columnNames.") values
(\"".$row['userId']."\"".$values.")");
mysql_query("DELETE from `bought` where `userId` = \"".$row['userId']."\"");
mysql_query("DELETE from `data` where `userID` = \"".$row['userId']."\"");
}
mysql_close($conn);
?>
predict.php
<?php
include("connect.php");
$totalTime=0;
$result_1=mysql_query("SELECT * FROM `data` WHERE `userID`=\"".$_GET['userId']."\" order by
cellID");
$columnNames="";
$values="";
$row_1 = mysql_fetch_array($result_1);
$previous_column=$row_1['cellID'];
$previous_value=$row_1['time'];
while($row_1 = mysql_fetch_array($result_1))
{
if($previous_column==$row_1['cellID'])
{
$previous_value+=$row_1['time'];
}
else
{
$$previous_column=$previous_value/$totalTime;
$previous_value=$row_1['time'];
$previous_column=$row_1['cellID'];
}
}
$$previous_column=$previous_value/$totalTime;
decisionTree();
//neuralNetwork();
function decisionTree()
{
$model_DT=0;
if($b5 <= 0.04509)
if($k4 <= 0.013828)
if($v1 <= 0.000362)
if($r0 <= 0.000626)
if($d5 <= 0.003481)
if($d5 <= 0.001586)
if($g4 <= 0.033267)
if($s3 <= 0.004874)
if($u1 <= 0.002108)
if($f1 <= 0.039667)
if($f4 <= 0.028894)
if($i4 <= 0.004699)
if($d2 <= 0.001173)
if($e5 <= 0.001377)
93
>>
Appendix:
Source
Code
94
>>
Appendix:
Source
Code
function neuralNetwork()
{
$Node5=(-0.0209449762256399)+($a0*0.0120761574490061)+($a1*-0.0174298014185729)+($a2*-
0.0175622955697642)+($a3*-0.000798046164731245)+($a4*-0.00566210278243689)+($a5*-
0.00257021437573848)+($b0*0.0813554156049207)+($b1*-
0.0383601651270091)+($b2*0.0315342748963075)+($b3*0.04750940128612)+($b4*0.00444930879229902)+($b5*
0.0447743155601993)+($c0*0.0127846301489485)+($c1*0.0167829106398277)+($c2*0.0412283962113621)+($c3
*0.0647197008365273)+($c4*0.026137495413712)+($c5*0.0292672102649498)+($d0*0.0575247995032596)+($d1
*-0.0248903478567491)+($d2*-0.0356248056960633)+($d3*0.0131503378763436)+($d4*-
0.00943722882163672)+($d5*0.0254130310753136)+($e0*0.0953293388209953)+($e1*-
0.0358630730881965)+($e2*0.09184645890614)+($e3*0.0879998946588433)+($e4*-
0.0210989430518799)+($e5*0.0236328879965554)+($f0*0.0521255666178908)+($f1*0.0562279524027289)+($f2
*0.0420766593208718)+($f3*0.0219358641315261)+($f4*0.0500915161629286)+($f5*0.0598788090622592)+($g
0*-0.0106339935340819)+($g1*0.0158371741591566)+($g2*0.0828753056435395)+($g3*-
0.0152508552198513)+($g4*-
0.00815101349601804)+($g5*0.0268439313590316)+($h0*0.070123678107641)+($h1*-
0.0147305324346031)+($h2*0.0517135568746786)+($h3*-0.0117294349734072)+($h4*-
0.00594235655570873)+($h5*0.0410639065208286)+($i0*-0.00105630930040345)+($i1*-
0.00543787837624847)+($i2*0.0603755263497366)+($i3*0.0287693595250936)+($i4*0.0554227984526808)+($i
5*0.0600355834517169)+($j0*0.0186135251521197)+($j1*0.00984875030922667)+($j2*0.0193290574626347)+(
$j3*0.021484574396215)+($j4*0.0484829773111019)+($j5*0.0233728871769681)+($k0*0.0410110073637687)+(
$k1*-
0.00743846515678319)+($k2*0.0446579060767132)+($k3*0.00789530586935209)+($k4*0.0185589336156669)+($
k5*0.0178833473514336)+($l0*0.0366297156412459)+($l1*0.0297884220860898)+($l2*0.0450253751867714)+(
$l3*0.0705159823038729)+($l4*0.074643360814636)+($l5*0.049178643898654)+($m0*0.00649293306157912)+(
$m1*0.0235761949995652)+($m2*0.0282972581223614)+($m3*0.00995247757969736)+($m4*0.0635360916248171)
+($m5*-0.0185514952082912)+($n0*0.0798799834823821)+($n1*-
0.0367274799798666)+($n2*0.0461992904934746)+($n3*0.0354383668658634)+($n4*-
0.00123240277220675)+($n5*-0.0150807856098709)+($o0*-
0.0260784636646052)+($o1*0.0553028912171675)+($o2*0.0802089447351997)+($o3*-
0.0235601224487924)+($o4*-0.0281363990127924)+($o5*0.0319917291420718)+($p0*-
0.0257109331590629)+($p1*-0.0279769700636828)+($p2*0.0433907293866429)+($p3*-
0.0310545628159805)+($p4*0.0348153094694314)+($p5*-0.00776438719161176)+($q0*-
0.0069736497593223)+($q1*0.0161811177301145)+($q2*0.0576906924312276)+($q3*0.0441712928131897)+($q4
*0.0165528172670987)+($q5*-0.0274805831321372)+($r0*0.0120430047036489)+($r1*-
0.000892653621313331)+($r2*0.0868045378672117)+($r3*0.0281943074796785)+($r4*0.0670839346752799)+($
r5*0.0110772507057164)+($s0*0.0214207237015366)+($s1*-
0.032511653106313)+($s2*0.0328856849361516)+($s3*0.0313926662260086)+($s4*0.0111177031525771)+($s5*
0.0284289901014687)+($t0*0.0428425565992686)+($t1*0.0534413420371503)+($t2*0.0244766875457709)+($t3
*0.0647078085232812)+($t4*0.0112235270733354)+($t5*0.0097765520400492)+($u0*0.0259846759422365)+($u
1*-
0.0430507927467189)+($u2*0.107107831659775)+($u3*0.0467301403971514)+($u4*0.0571975966844622)+($u5*
-0.0079845822250066)+($v0*0.0303173561775128)+($v1*-
0.0043169837441232)+($v2*0.0866140345320475)+($v3*0.00261036151061667)+($v4*0.00523185366643474)+($
v5*-0.0239702999191261);
// Similar codes for the rest 72 nodes have been omitted. This was done because it was a 40 page
long code. The complete code is available online for reference
$max=max((1/(1+(1/pow(2.718282,$Node0)))),(1/(1+(1/pow(2.718282,$Node1)))),(1/(1+(1/pow(2.718282,$N
ode2)))),(1/(1+(1/pow(2.718282,$Node3)))),(1/(1+(1/pow(2.718282,$Node4)))));
95