You are on page 1of 13

DATA MINING

SEMINAR

TOPIC: DATAMINING FOR


TELECOMMUNICATIONS

SUBMITTED BY: MAYURI KIRAN


ANVEKAR

SUBMITTED TO: RINO CHERIAN


DATE: 30/03/2012

II.
I.
IA
B
N
S
T
T
R
R
O
Table of Contents
A
D
C
U
T
C
T.
III.
I.. TYPES OF TELECOM
DATA
..................................
..
O
1.
Call
Summary
............................................................................................................................
Data
.............................................
..
N
..................................
4
.. 2. Network 4
...
..VI.Data ..........
...
Customer
..C 3.
...
3.1
...................
..O Data
...
Ch............
IV....................
Mining
Sequential Patterns In Telecommunication Database Using
..N ....................
...
ro
...................
Genetic
Algorithm ................. 5
..CL....................
...
1.
Sequential Patterns
...............................................................................................................
Mining
5
m
...................
..US....................
...
os
...................
2.
....................
..VII.
...
IO...................
oGenetic
....................
..RE
...
N Algorithm
. me
........
4
3.
Mining
Sequential Patterns in Telecommunication Database Using
.................
..................
..FE
...
.....GA. ...
.......................................... 6
4 .....
..RE
...
......................
..NC
...
......................
.....
3.2 Genetic...........................................................................................................................
Operators
.................
..ES
...
..... .....
7
........
...
......................
..... Fitness
3.3
........
...
......................
.....
Function.
.
.................
3.4
SPT-GA
............................................................................................................................
Algorithm
........
...
..... .................
.....
5
8
.. .......................
...
.....
......
4. .....
Experiment
..
...
.....
......
..... ..........
.................
Results
.. Discovering
...
V.
Structural Patterns In Telecommunications
.....
......
.....
.................
.......................
..
...
Data
........................................................... 11
.....
......
.....
.................
.......................
..
...
.....
......
.....
.................
.......................
..
...
.....
......
.....
.................
.......................
..
...
.....
......
..... 7
.....................
..
...
.....
......
9 .....
..
...
.....
......
.....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
........
...
..... .....
..13 .. 6
...

I.

ABSTRACT

Telecommunication companies generate a tremendous amount of data.


These
data include
call
detail data,
which describes
the calls that traverse the
telecommunication
networks,
data, which describes
the statenetwork
of the hardware and software
components
in which
the network,
andthe telecommunication customers. This
customer data,
describes
chapter
describes
data
mining
can behow
used to uncover useful information buried within
these
data
sets. Several
mining
applications
are data
described and together they demonstrate that
data
mining
can be used to fraud, improve marketing ef fectiveness,
identify
telecommunication
and identify network faults.

II.

INTRODUCTION

The international telecommunications play an important role in message


sharing and
information
passing.
In the
last decade a dramatic change in the structure of
telecommunications
has
been taken place,companies
from public monopolies to private companies. The
quick
of
mobiledevelopment
telephone networks
and video calling and Internet technologies
has created enormous
competitive
pressure on the companies sector. As new competitors arise
in market, tools
telecom
needprofit and withstand. Also, stock market
intelligent
to gain
expectations
are huge
and need tested tools to gain information about
investors, financial
analysts
how
companies
perform
financially
compared
to their competitors, what they are good at, who
are
competitors
are,the
etc.major
In other
term, the telecom companies need to benchmark their
performances
compete trendsagainst
in order to remain important role in this market. There is
an
enormousabout
amount
of companies financial performance that is now
information
these
publicly
available.
This our capacity to analyze it; the problem is that
amount greatly
exceeds
we oftenand
lackaccurately
tools to process these data.
quickly
Data mining for telecommunications companies involve the use of
simple, traditional
and techniques used to analyze large populations of
advanced
mathematical
data
and deliver
insights,
forecasts,
explanations
and predictions of how systems, customers,
networ
and marketplace
are
likely tokreact
to different situations.
The hypercompetitive nature of the
telecom
has
created
aindustry
need
to
understand
customers,
touse
keep
them,
andlike
to model
new
understand
telecommunication
products.
theThis
new
creates
patterns,
business
a make
great
trendsdemand
better
involved,
forof
catch
innovation
resources
fraudulent
and
improve
data
activities,
effective
ways
to market
mining
identify
the
quality
technique
of services.
to help

III.

TYPES OF TELECOM DATA

The Initial step in the data mining process is to understand the data.
Here
discuss three
types we
of telecom
data. main
They are as follows:

1. Call Summary Data


Every time a call is placed on a telecom network, descriptive
information
about
the call
is saved
as a call detail
for futur
e record.
At a minimum, each call detail record
will
includeand
the terminating phone numbers, the date and time of the call
originating
and the duration of the
call.

2. Network Data
Telecommunication networks are extremely complex configurations of
equipment,
comprised
of Each network element is capable of
interconnected
components.
generating
error and
status
messages, which
leads
to a tremendous amount of network data.

3. Customer Data
Telecommunication companies, like other businesses have millions of
customers.
necessitya database of information on these customers.
they have toFor
maintaining
This
information
will and may include other information such as
include
name, address
service plan,
credit score,
contract
information,
family income and payment history.

IV.

Mining Sequential Patterns in Telecommunication


Database
Using Genetic Algorithm

Sequential pattern mining is the process of finding the relationships


between
of if there exists any specific order of the
sequentialoccurrences
events, to find
occurrences.
The extraction
of
sequential pattern
is not polynomial
in time of execution. The other
algorithms
for performing
sequential pattern
mining can assure optimum solutions but they do not
take
intotaken
consideration
the time
to reach such solutions. Whereas this algorithm based on
genetic
concepts
givesbut
a in a reasonable time (polynomial) of
non-optimal
solution
execution.

1. Sequential Patterns Mining

The goal is to find all sub sequences from the given sets of transactions;
this
whenapproach
the data istouseful
be mined have some sequential nature to deal with
databases
that have a timeseries
characteristics.
Sequential Pattern can be defined as follows.
Definition : Let I ={x1...xn} be a set of items. An itemset is a non-empty subset of
andisan
itemset withitems,
k items
called k-itemset. A sequence s=(X
) is an1 ...X
order
list of item sets, and
l
an item set Xi (1= i = l) in a sequence is called a transaction. In a set of
sequences,
s
is maximal aifsequence
s is not contained
in other sequences.

2. Genetic Algorithm
Genetic Algorithm (GA) is a part of evolutionary computing, which is a
artificial
chromosomes)
new
Best
their
population
solutions
best
intelligence.
fitness.
which
called
by
This
mutation
are
population.
Genetic
is repeated
selected
and
algorithm
crossover.
to
Solutions
until
form
some
starts
new
This
from
condition
with
solutions
isone
motivated
a set
population
(for
(offspring)
of solutions
example
by aare
hope,
are
rapidly
growing
area
of
(represented
taken
that
will
selected
number
the
beand
better
new
of
according
used
populations
population
by
than
to form
the
to old
or
a one.

improvement of the best solution) is satisfied. To measure the quality of


afunction
solution,
is fitness
assigned to each chromosome in the population.

3. Mining Sequential Patterns in Telecommunication


Database Using
Genetic Algorithm
Database, this algorithm is called SPT-GA algorithm. Firstly, we
present our
structure
andchromosome
encoding schema, genetic operators, and then we define
the
fitnesscriteria.
assignment
andwe give the structur e of SPT-GA algorithm.
selection
Finally,
3.1 Chromosome.
The used structure of GA chromosomes and how it is represented is as
follows:
3.1.1 Structure . In Telecommunication Database, country code values are used for
chromosomes. creating the
In the algorithm, chromosomes have a fixed length, and its length is
equal to
number
of country
codes
that
are available
in the database as in Figure 1.

3.1.2 Representation. In Genetic Algorithm, there are many alternatives to represent a


chromosome based on other problem like binar y and integer
representation.
decide
representation isTobetter
to which
be used for Sequential Pattern rules, we
should
useare
therelevant
short; low-order
schemata
to the underlying problem and relatively unrelated
to
schemata
overAlso
other
fixed
positions.
we should select the smallest alphabet that permits
a natural
expression
of in [8].
the
problem,
presented
In SPT-GA algorithm, we choose the binary representation because it is
the
suitable
our most
algorithm
andfor
it needs less space and it r epresents the needed
information
(element occurred
or not).
For example, using the Telecommunication Database in Table.1, if a
sequence
<249,
973, 91>, is
it equal
can betorepresented
as in Figure.2.

Additionally, as you can see in Figure.2, order cannot be extracted


directly.
this to associate the transactions sequence as a
problem,To
wesolve
decided
metadata
with For
eachthat, we use Vertical Bitmap Representation that
chromosome.
makes
SPT-GA
algorithm
take less
time and
space totobe executed.
3.2 Genetic Operators.
SPT-GA uses genetic operators to generate the offspring of the existing
population.
Genetic algorithm will send chromosomes that represented by binary
string
where to
each
corresponds
an bit
element occurrence (0 or 1); the number of bits is
equal
the number
items.to
After
encodingofof the solution domain, initially many
chromosome
solutionsonare
randomlysize.
generated depending
population
Genetic algorithm will select the chromosome regarding to our fitness
function.that
Theused
major
measure
by our algorithm is Sequential Interestingness
measure (SIM) .
Then crossover takes place, it selects genes from parent chromosomes
and
createsThe
a new
offspring.
simplest way how to do this is to choose randomly some
crossover
and
everythingpoint
before
this point copy from a first parent and then everything
after
crossover
pointparent. Crossover is shown in Figure 3.
copy afrom
the second
After a crossover is performed, mutation takes place. This is to prevent
falling
all solutions
in optimum of solved problem. Mutation changes
population
into a local
randomly the
offspring.
Fornew
binary encoding we can switch a few randomly chosen
bits1.from
1 to 0isorshown
from 0in Figure 3.
to
Mutation

the interests of users without using background knowledge. This


3.3
Fitness
Function.
Fitness
function
method
sequential
corresponding
method
called
Definition:
the
defines
sequential
The
SIM(A
asequential
newinterestingness
criterion
C)
interestingness
= min
C
measure
Cmeasure
{(Confidence(A|C
(SIM).
of a rule Apatterns
Cis:
Support(AC) to
i discovers
i ))a }

Where: (a = 0): is a confidence priority that represents how important


the
f requency
of the
pattern
is
Ci: is subsequence of C, it represents a condition of sequence C
i = 1 n where n is the number of conditions in C.
The first term of the criterion evaluates that the frequencies of the subpatterns
not frequent
while theare
second
term evaluates that the frequency of the pattern is
frequent.
3.4 SPT-GA Algorithm
. The SPT-GA algorithm that was proposed is described in this section.
In Figure 4, the pseudo code of SPT-GA algorithm is presented.
ALGORITHM
Input:
Initial samples size: N
Maximum generations: G
Threshold value: T
Minimum fitness: minF
Output:
Malpractice user no lists
Begin
Step1 : Initialize counter count = 0
Step2 : Initial population IN of size N
Step3 : For each chromosome i
If S1& S2 is given then
Measure
fitnessthen
F(i, S1, T)
S2,T)
Else If S2
S1the
is given

IN

Measure the fitness F(i, S2, T)


Else
Measure the fitness F(i, T)
Step4 : Mutate and crossover P.
Step5: IF (fitness = minF)
Select fittest rules from P
Step6 : Set temp = temp +1
Step7: IF (t > G) then
S=P
Stop
Else
Go to Step 3
End
After the encoding of the dataset, using bitmap representation, the
algorithm starts
by selecting
individuals
to initial
population. Then the following processes are
repeated
the pre-specified
maximumuntil
number
of generations is achieved. The fitness values
determined
for each
selected
individual given
the rule
antecedent or consequent. The fittest rules that
are
larger than
or in
equal
minimum
fitness
P will be selected. Giving the antecedent or
consequent
is the
not time
mandatory
butand extract more desired rules.
it will reduce
of search
Existing chromosomes are used in generating new ones by applying
crossover
mutation survive based on their fitness used in the
operators. and
Chromosomes
process.
This
the
interesting
setway,
is determined
and the target is achieved.

4.
Experiment
Results
transactions
The
follows.
population
A
Telecommunication
results
SPT-GA
size
ofand
some
(N),
60
algorithm
country
experiments
generations
Database
is
codes.
written
is
(G),
ontaken
The
telecommunication
confidence
with
crossover
from
MATLAB
a Telecommunication
priority
probability
programming
database
(a) and
usedisis
presented
language.
minimum
Company
0.8,
while and
The
fitness
the mutation
analyzed
ituser
has
(minF).
can
1091
tune
is
as

0.001. The output of this experiment is a text file that includes the
interesting
rules that
represent
the most suitable
telecommunication
sequences. For example, the rule
of
Figure
5 told
us that
when
country
code
91 is called, that means (40%) of callers will call
country code 92 afterwards.

In SPT-GA algorithm, there are four parameters that must be determined


by
a user: number
of
generation
(G), population
size (N), confidence priority (a) and
minimum
experimentfitness
sets N(minF).
= 20, a First
= 1 and minF = 0 with G = [20...200]. The
second
G = N = [20...200]. The two experiments were
20, a = 1experiment
and minF sets
= 0 with
done
on two
days of calls
with 1091
transactions
and 60 country codes.
According to the experiment, Figure 6 shows the time, in seconds, spent
by
the SPT-GA
algorithm
to extract the best rules while Figure 7 shows the average
fitness
of the shows
final output.
Both figures
the result related to increasing the generations and
population size.

The tests showed that when the generation increases the time will be
around
each
other butincreases the time will be increased. However,
when
the
population
experiments,
From
increasing
the
experiments,
theas
of
population
population.
shown itinsize.
isFigure
Moreover,
observed
But,6,either
GA
that
thetakes
tests
ways,
increasing
less
also
GA
time
showed
does
of when
generation
notthat
take
increasing
when
awill
long
comparing
the
two
the generation
population
take
time;
less
it istime
only
increase,
than
than
and
a matter
the best fitness will be around.

of seconds. In addition, increasing the generation and the population


will
not guarantee
a largefitness.
improvement
of average

V.

Discovering Structural Patterns In


Telecommunications
Data

The SUBDUE system is a structural discovery tool that finds


substructures
graph-based
representationinofastructural
databases using the minimum description
length
(MDL)
principle.
SUBDUE
discovers
substructures that compress the original data and
represent
structural
concepts in
the data. Once a substructure is discovered, the
substructure
is used instances
to simplify
the data by replacing
of the substructure with a pointer to
the
newly discovered
substructure.
The discovered substructures allow abstraction over
detailed data.
structures
in the
original
Iteration
of the substructure discovery and replacement
process
constructs
a of the structural data in terms of the
hierarchical
description
discoveredprovides
substructures.
hierarchy
varying This
levels of interpretation that can be accessed
based of
onthe
thedata
specific
goals
analysis.
SUBDUE represents structural data as a labeled graph. Objects in the
data
vertices
or graph, and relationships between objects map to
smallmap
sub to
graphs
in the
directed
edges in or
theundirected
graph. A substructure is a connected sub graph within the
graphical
representation.
This
graphical
representation
serves
asofinput
to
substructure
shows
vertices
representation
8.
substructure
One
aofin
geometric
the
theinfour
graph,
of
anthe
input
instances
example
substructure
andgraph
theofrelationships
isthe
such
a discovered
set
substructur
an
vertices
become
graph.
by
e the
isSUBDUE
and
highlighted
labeled
The
edges
objects
edges
from
from
inin
this
the
inthe
the
discovery
Figure
8graphical
figureisgraph
graph.
data
input
graph
graph.
theoretically,
become
also
Thesystem.
shown
graphical
that
Anlabeled
instance
match,
intoFigure
theof
a
representation of the substructure.

Figure 8: Example Substructure in Graph Form


Experiment Results:
In a study upon running the SUBDUE system on the data in graph form,
the
six by
substructures
output
the system were analyzed. Figur e 9 shows two of these six
substructures.inThe
substructures
this figure show that the following patterns are
common:
1. Local calls originating between 11:00am and 2:30pm with 30 to 45
minute durations and
2. Long distance calls originating between 6:00pm and 10:00pm with 5
to 20 minute durations.

Figure 9: Telecom pattern discovered by Subdue.

VI.

CONCLUSION

Genetic Algorithm is applied to find frequent sequences in


Telecommunication
Database
in order to know the countr y codes that
to help Telecommunication
companies
have
relation
between
them.aSo,
the telecommunication
companies can estimate the countries
that
have
a specificand
order
of the
occurrences
give a discount on the calls to these countries.
SPT-GA
algorithm
utilizes algorithm that discovers best rules in a
the
property
of evolutionary
short
time with meaningful
results.

VII. REFERENCES
1. Data Mining In Telecommunications; Gary M. Weiss
2. Data Mining In Telecommunications And Studying Its Status In Iran
Telecom
Companies And Operator; Jamal Sophieh
3. Data Mining And CRM In Telecommunications; D. Camilovic
4. Genetic Algorithms; William H. Hsu
5. A Fraud Detection Approach in Telecommunication using Cluster GA;
V.Umayaparvathi
& Dr.K.Iyakutti

You might also like