You are on page 1of 23

Term Paper

Business Data Mining and Decision Making

Application of Data Mining in the Telecommunication Industry

Krishna Kumar Balaraman 2009024


Presented to

Prof. V Nagadevara
Quantitative Methods & Information Systems

OCTOBER 22, 2011

Indian Institute of Management Bangalore

Table of Contents
Introduction ........................................................................................................... 3 Privacy Concerns in datamining in telecommunication industry................................. 7 Customer Segmentation and Analysis using Clustering technique ............................. 8 Reducing Customer Churn...................................................................................... 9 Customer Fraud Prevention .................................................................................. 11 Network Operation and Maintenance Management............................................... 13 Telecommunication Software Quality .................................................................... 16 Classification of telecommunications companies .................................................... 17 Insights for future ................................................................................................. 18 Conclusion .......................................................................................................... 19 Reference ............................................................................................................ 20

Application of Data Mining in the Telecommunication Industry

Introduction
Datamining is the process of discovering interesting patterns and knowledge from large amounts of data. It is a process where intelligent methods are applied to extract data patterns [26].The telecommunications industry is one of the earliest adopters of datamining. The telecom companies (Telcos) generate a huge amount of data from customer details to call patterns to network usage. Such data is collected and forms the telecom data warehouse. A data warehouse can be described as [2] a data set which is subject-oriented, integrated, relatively stable and reflects historical changes, used for supporting decision-making in management. Telcos are able to carry out datamining on huge data sets they collect and which are part of the telcos data warehouse. The basic data mining techniques include association rules, classification, clustering, regression analysis, sequence analysis, discriminant analysis, outlier analysis, neural networks, fuzzy logic etc. I have been involved in the development of products for the telecom industry for close to two decades. I have worked on products ranging from consumer devices to network products. Currently I am working on optical network management software which is used to monitor the critical backbone infrastructure. The amount of network data which is captured through alarms and operational logs are large. To make meaningful real-time decisions based on such huge amounts of collected data would require appropriate data mining techniques to be applied on the huge data set collected by the telcos. The telecommunication industry has grown in leaps and bounds in the past few decades and it generates a vast amount of data. The call details records contain information about every call made and billions of such records are created worldwide every month. As per literature, AT&T long distance customers generated over 300 million call detail records per day in 2001 [1]. Such huge amounts of data offer a huge opportunity for data mining which is useful in the area of marketing and fraud prevention. Highly evolved network management systems are able to keep a pulse on the operation of the network and collect a vast amount of data in the form of alarms and the network usage. Such kinds of information can be used for datamining purposes which can help in better network deployment and utilization. Network planning tools benefit immensely by the predictive models developed from analyzing existing data. It is therefore quite easy to appreciate that the telecommunication industry is one of the early adopters of datamining techniques. The nature of data collected from telecommunication networks pose many interesting challenges for datamining purposes. The scale of the data with billions of record in raw form represented in time series format present a challenge in analyzing. The need to summarise such data in a useful form for analysis and building rules and models poses considerable challenges. The rarity of events which must be predicted like network failure (99.999% uptime requirement for networks), and telephone fraud implies that rules and models developed from the datamining techniques need to use the right technique for analyzing such vast amounts of data. Further, application of models/rules developed in real-time like fraud detection involves real time performance.

Application of Data Mining in the Telecommunication Industry

The steps of data preparation are quite important in analyzing the data generated from telecommunication networks. The basic forms of data collected are broadly classified as call detail records, network records, and customer records. Figure 1: The general structure of data mining in the telecommunications industry

DECISION MAKERS

KNOWLEDGE BASE / EXPERT SYSTEMS

MODEL / ALGORITHM EVALUATION

DATA MINING

MODEL BASE

DATA WAREHOUSE

EXTERNAL DATA

CALL DETAIL RECORDS

NETWORK RECORDS

CUSTOMER RECORDS

Application of Data Mining in the Telecommunication Industry

The call detail records have descriptive information about the calls and number of call detail data tends to be huge. As the purpose of datamining is to extract information at the customer level and not at the level of individual calls, the call detail records are not used as it is for data mining. The summarization of a customer into a single record describing the calling pattern of the customer is of more interest. The useful description of the customer depends on the suitable choice of summary variables and features. Based on a call received and made by a customer in time frame P, Weiss [1] lists a few features which may summarise the customer: Average call duration % no-answer calls % calls to/from a different area code % of weekday calls (Monday Friday) % of daytime calls (9am 5pm) average number of calls received per day average number of calls originated per day number of unique area codes called during P These can be used to summarise the customer as a business or a residential caller, as a telemarketer, the tip over from peak business to residential use, etc. Sometimes, these summary description, also called signatures, have to be updated in real-time for millions of phone lines which means that there must be relatively short and simple features (summaries) which must be updated quickly and efficiently. The network record is generated by the performance and fault management reports generated by the thousands of components connected in a network. Nowadays, the network is a heterogeneous combination of different kinds of technologies and interconnections which add complexity to the isolation of problems. The data collected include strings indicating the nature of problem, the timestamp, the network elements involved etc. Due to the critical nature of the operations, rule based corrective actions are automatically initiated in many circumstances. The datamining techniques can be used to identify faults by automatically extracting knowledge from the network data, which involves real time data stream that have to be summarised appropriately for effective datamining based predictive and/or corrective action. Customer records maintained by the telcos have many attributes of the customer from the name and address to the plan and average usage. Apart from this many demographic and econometric data also forms part of the customer record which can be used in conjunction with the call detail records to form rules for cross-selling and up-selling opportunities, and fraud detection. Datamining applications for telcos include subscription fault detection and superimposition fraud detection using deviation methods and outlier analysis using real time data, call detail record, and customer data. For example, a rule for detecting fraud can be a person who has an average international call duration of five minutes but the call going on for more than an hour indicates him/her to be a victim of cloning fraud. The cost of misclassification is very high is such cases and therefore the datamining technique and the rules developed will have to be sensitive to the cost of letting a suspected fraud call going through or not. The rarity of the events makes it more challenging for datamining techniques to evolve highly robust rules. In such cases, oversampling of the training dataset must be done to accommodate for the rarity of frauds and reduce the skewness.

Application of Data Mining in the Telecommunication Industry

Datamining is also used for marketing purposes and customer profiling. There are privacy concerns and anti-trust concerns involved in such usage. While there are laws in certain countries to address this, datamining techniques can be used by telcos to devise programs to retain customers, increase customer lifetime value/profitability, and reducing churn through predictive model development. Classification and Clustering are two main techniques in the customer segmentation. Classification techniques, Clustering techniques, and Neural network based techniques are used for datamining the telco data for customer segmentation and customer related. In telecommunications, association rule learning, can for example, be used on calling detail data to identify pairs of customers that frequently call each other (which in turn can be used to identify socalled calling circles). Network fault isolation and corrective action models are developed using classification techniques and by using correlation models developed from fault and performance management records. The time series data is summarized as a set of classified examples for datamining to develop effective predictive models. Datamining methods (Table 1) are also used for optimizing network deployment and international roaming agreements. In short, datamining can help the telcos handle four key challenges that they face today as summarized by Pareek (2007) as the 4 Cs: consolidation, competition, commoditisation and customer service [10].
Data Mining Application Areas Marketing, Sales and CRM Data Mining Techniques Association Classification Clustering Forecasting Regression Sequence discovery Visualization Outliers detection Deviation detection Statistical modeling Dynamic clustering Classification Visualization for pattern recognition Classification Prediction Sequence analysis Time-series data analysis Data Mining Methods Association rules Decision trees Genetic algorithms Multiple factor analysis Neural networks K-Nearest Neighbour Linear/logistic regression Visualization methods Anomaly detection techniques Rule discovery Clustering algorithms Bayesian rules Visualization methods for recognizing unusual patterns Markov models Neural networks Genetic algorithms Bayesian Belief Networks Rough sets Classification trees Self-organizing maps

Fraud Detection

Network Management

Table 1: Data Mining Techniques and Methods Applied in Telecommunications [10]

Application of Data Mining in the Telecommunication Industry

Privacy concerns in datamining in telecommunication industry


With increasing legal protection to privacy and customer data giving telcos the competitive edge, it has become important to ensure data privacy while mining the telcos data base for purposes of marketing, differentiated services, customer retention, etc. Some of the datamining techniques can be used to ensure the privacy of data which would assure the customer that their data is not misused and also encourage inter-company collaboration for developing predictive models for issues such as fraud detection. The data partitions matter in trying to select the data abstraction methods. When the date is horizontally partitioned, the same schema is used to store data in a distributed manner across sites. When the data is vertically partitioned, different data schemas are used i.e. different sites collect different information on the same entity. In the telecommunication world, one example of horizontal and vertical partitioning of data could be among two telecommunication companies and a third party. When the two telecommunication companies plan a joint system for detecting misuse of mobile phones (fraud detection), the two companies each maintain a database with daily aggregated customer calling data like the number of international calls associated with each customer for a given day. The information on previous frauds can be stored in another database which is maintained by the third party [12]. Two approaches to data privacy preserving techniques are: a) Data transformation/randomization: This involves modifying sensitive data so that it loses its sensitive meaning, but still retains statistical properties of interest. Privacy can then be preserved by only revealing randomized and transformed versions of the data, for instance perturbed using a randomization algorithm that maintains statistical properties. b) Secure multi-party computation: This is a computation performed by multiple parties where each party has in its possession a part of the input needed to perform the computation. At the end of the computation, the parties should only have learned the result of the computation. Secure sum, secure set union, secure size of set union etc. are used as building blocks in such techniques. In using the dataset for datamining, the telcos can use the following techniques to maintain privacy. Privacy Preserving Classification techniques addresses questions of how to train a classifier without revealing the training data itself, and how to classify new unseen data elements without revealing those data elements. Some classification techniques like Privacy Preserving Nearest Neighbor Search for horizontally partitioned data and a Nave Bayesian classifier for vertically partitioned databases will have to be tailored to ensure privacy.

Application of Data Mining in the Telecommunication Industry

Privacy Preserving Association Rule Learning uses transformation/randomization, and secure multi-party distribution computation techniques to ensure privacy preservation.

data and

Privacy Preserving Clustering Techniques uses data transformation/aggregation so that the clustering performed on the distorted data is still valid. Multidimensional Scaling (MDS) technique, where a set of points in a high dimensional space is transformed to a lower dimensional one while preserving the relative distances between pairs of points, is also used for privacy preservation.

Customer Segmentation and Analysis using Clustering technique


Customer Call records are used by telcos to segment their customers for pricing and marketing strategies. Clustering techniques are used to identify groups which can be targeted with specific and suitable promotional programs so as to increase the effectiveness of business campaigns. Cluster analysis is a statistical technique that is used to identify a set of groups that both minimize within-group variation and maximize between-group variation based on a distance or dissimilarity function, and its aim is to find an optimal set of clusters [3]. K-means method is a very popular approach for classification because of its simplicity of implementation and fast execution and has been widely used in market segmentation, pattern recognition, information retrieval, etc. [3]. The K-Means algorithm calculates the centers of the clusters iteratively. In a dataset D = { xi; i=1..n} having K clusters with K Centers C={ Ci; i=1..K}, an item xi is assigned to a cluster with centeriod Ck which is the nearest Euclidean distance to the observation. The time complexity of this progress is O(nkd) where n refers to the number of the total items, and k refers to the number of clusters initially set, and d refers to the vector dimension of data objects. The time consumption by the algorithm is relatively large when operating with large datasets. Qin, Zheng, and Huang [3] have proposed an improved K-means algorithm to reduce the computational complexity of the algorithm thus reducing the computational time for large datasets like telcom related call records. Ren, Zheng and Wu [5] have proposed a genetic algorithm for telecommunication customer subdivision. First, the features of telecommunication customers (such as the calling behavior and consuming behavior) are extracted. Then, the similarities between the multidimensional feature vectors of telecommunication customers are computed and mapped as the distance between samples on a two-dimensional plane. Finally, the distances are adjusted to approximate the similarities gradually by genetic algorithm [4]. In customer segmentation, the R (Recency), F (Frequency), and M (Monetary) monetary factors matter in making meaningful decisions. The broad definition of these dimensions is described below. The specific definitions of RFM model are described as follows: [3] Recency of the last consumer: Recency refers to the interval between the time that the latest consuming behavior happens and present. The shorter the interval is the greater R is.

Application of Data Mining in the Telecommunication Industry

Frequency of the consumers: Frequency refers to the number of transactions in a certain period of time, for example, two times a month or two times of one week. The greater the frequency is the bigger F is. Monetary value of the consumers: Monetary value refers to consumption money amount in a particular period. The more the monetary value is the bigger M is.

According to research, the bigger the R and the F indicates an increased likelihood of new transactions. Bigger the M is the likelihood of repeat purchases is increased. In the telecom space the payment of a customer is a very important aspect in deciding the importance of the customer. Thus, in the RFM model, the M value gets the largest weightage in using centroids in clustering customers in the range of importance for targeting. Case Study: Qin, Zheng, and Huang applied the improved K-means algorithm to analyse a practical dataset from a mobile communication company for customer segmentation. They were able to get eight classes of customers based on the payment fees as the biggest indicator of customer importance. Based on the eight classes of customer, the mobile company can use different strategies for different classes of customers. The results of the cluster analysis are given in Appendix 1.

Reducing Customer Churn


When a customer leaves a telco services to adopt the services of another operator, it is a loss to the incumbent provider and is referred to as customer churn. This is a huge challenge for operators as it implies lost revenue and has implications on customer loyalty. With the advent of number portability in the mobile space, the problem of customer churn has become more acute. By being able to estimate which customers or classes of customers are likely to switch, telcos can develop campaigns and programs to increase customer loyalty and improve customer retention. Also, the cost of acquiring customers is increasing and telcos are beginning to lay more emphasis of increasing existing customer value and in this context, reducing customer churn becomes very crucial. Datamining techniques can be used to build customer churn predictor models. By using information like customer demography, billing patterns, customer call records and service changes records, churner predictor model can be built using Artificial Neural Networks [4]. Decision Trees are also used to predict customer churns. Research in the area indicates that the more data and customer records the more accurate the predictor of the churn analysis is. It is important to identify reasons for customer churn voluntary or non-voluntary. Noticeable causes of Churn are given below (Table 2).

Table 2: The Causes of Churn (Reference 4)

Application of Data Mining in the Telecommunication Industry

As per research, the churn prediction factors for telcos include subscribers who do not have a discounted package, incoming local and long distance calls, being in a prepaid plan, and standard deviation of calls from other operators. Richter, Yom-Tov, and Slonim [6] have attempted churn prediction by analysis of social groups and have used key performance indicators to predict churn even before the first churn has happened in the group. This is by just using call records and not financial or demographic details. They have used regression methods for data mining. Data Understanding and Data Preparation are extremely important in Churn predictor analysis for telcos as the amount of records are vast and it has to be ensured that there are no duplications, incorrectness or incompleteness in the dataset. It is very important to identify the dependent variable for churn analysis. Logistic Regression is seen as an appropriate method for doing churn analysis. Logistic regression can be used on data to find similarities between observations within each classification in terms of the predictor value. This lends itself to classification of observation into a churn class or not. It is crucial to balance the training dataset based on the dependent variable, in this case if the customer has churned or not so as to get an unbiased predictor model. After this balancing, the variables which have no effect on the target variable are removed. The remaining variables are considered as significant. It is further analysed if these variables have correlation. For a given level of significance, the logistic regression analysis is done to determine the reasons for customer churn. Decision Tree analysis also is used in customer churn analysis. Analysis done by Grsoy [4] using the records of a mobile operator yielded the following results. In Logistic Regression Analysis, the accuracy of correctly predicting non-churners is 74.3% and the accuracy of predicting the churner subscribers correctly is 66%. For evaluating C&RT Decision tree algorithm, the accuracy of the model is calculated and the result is 71.76% which is considered as very high. The probability of having wrong outcome is 28.24%. In a study done by Pan [7] on churn among broadband customers using three methods - C5.0, Logistics regression, and neural network algorithm - C5.0 had the best performance among three algorithms for that dataset. If the model trained by C5.0 is selected as an early-warning system, about 80 percent of the potential churn customer can be found by using the model. Oseman et. Al. [14] studied customer churn through data classification and construction of decision tree using ID3 (Iterative Dichotomiser 3) algorithm. From the decision tree analysis, the first classification attribute that contribute to churn is the area of the subscribers and this is related to the lengths of services and total of minutes for customer churn. It was surmised that if area is rural and length of service more than 20 years, subscribers are not likely to churn to other providers, and if area is sub-urban and the total of minutes that customer engages in line less than 10 minutes, the subscribers are likely to churn. Fuzzy Correlation Analysis can also be used to deduce the right kind of marketing for a particular group of customers which will reduce churn. Using attributes like time to contract expiry and bill payment amount the kind of marketing method yielding the highest retention rate can be deduced [22]. Thus the marketing and

Application of Data Mining in the Telecommunication Industry

10

customer data can be used to extract key factors of telecom churn which would help in utilizing appropriate marketing techniques to reduce churn. Case study: As per [4] two instances of the use of datamining to handle customer churn are cited. Vodafone which bought Telsim applies data mining for sales, marketing, financial management, future prediction, and for many different needs. Vodafone detects peak hours by using its databases and makes more workforces ready to avoid any disruption in communication. Also, Vodafone determines average of prepaid minutes purchased and finds subscribers who will likely churn. Turkcell which obtained customer information through business intelligence and data mining techniques offers new tariffs and develops campaigns for existing customers. Turkcell also has been developing programs that increase customer loyalty.

Customer Fraud Prevention


The telecom industry across the world suffers from customer fraud like unpaid bills and arrears. This causes a significant impact on the bottomline and topline of the telcos. Through the years telcos have collected huge amounts of data in the form of call records, calling patterns, payment history, etc. This data lends itself very well to perform datamining to detect malicious calls in real-time enabling operators to take real time action to reduce arrears. As there are no deterministic rules to identify a subscriber as a fraudster, the telcos can at best model a degree of belief in fraudulent behaviour using call behaviour of customers. Graphical models such as Bayesian networks supply a general framework for dealing with uncertainly in a probabilistic setting and therefore able to tackle the problem of fraud detection to a certain extent. Every graph of a Bayesian network codes a class of probability distributions. The nodes of that graph comply with the variables of the problem domain which is identifying fraudulent customers from the telcos point of view. By using datamining techniques like Deviation Determination Method on customer call records and payment history, financial losses can be reduced. Using usage patterns to form clusters of customers will help identify inconsistent features and outliers. Outlier analysis in datamining helps in finding abnormal and deviation data in the databases. The ability to find outliers using clustering techniques and explaining the causes leads to better fraud prevention in telecom industry. Kohonen neural clustering algorithm has been used by Wu, Kang & Yang [8] in forecasting customer related fraud in telecom industry. Kohonen neural network clustering algorithm has been found to be effective algorithm to find outliers. It is also called self-organizing feature maps neural network clustering method [8]. Using an iterative algorithm to optimise the objective function, the algorithm helps is deriving clusters of datasets. Kohonen neural network is a two layers feed-forward neural network, it has input layer and output layer. The number of neurons of input layer is M, which amounts to the dimension of input sample vector. The neurons of output layer are competitive output neurons, whose values are in {0, 1}. After self-organizing learning, Kohonen neural network makes the density of the Application of Data Mining in the Telecommunication Industry 11

connection weight vectors consistent with the probability distribution of input model, which means the density of the connection weight vectors could reflect the statistical features of input model. The layout of neurons in output layer has many forms, the typical one being a two-dimension plane matrix [8]. The Kohonen neural network algorithm is used to form clusters after which classification techniques can be use to predict the potentially malicious customers. The process of clustering can extract the outliers from the entire data set, but cannot state the features of this group, so it becomes necessary to use the decision tree of classification to extract the classification rules of this special group, and apply them to the process of business forecast. Once the telcos identify the problem of customer fraud as an important business problem to solve, the relevant data can be obtained by regularly processing the bills and call records of the customers. We can use attributes like calling charge, time of calling, duration of calling, type of calling with information table and arrears table of customer to build feature records set of target customers. Also, attributes like customers ID, gender, age, time in network, state, fees of longdistant phone call, fees of roaming, average length of phone call, number of phone calls, fees in arrears, service change type, etc. become crucial. The data preprocessing phase needs to convert the raw date into reliable data for data mining through sampling, setting the label of the decision variable, etc. Subsequently, we can use the Neural Network algorithm to form clusters and extract classification rules to identify and predict malicious usage. The rules are verified using test data. The confusion matrix can be generated to test the effectiveness using test data. Actually not in arrears Actually in arrears 0 1 Forecast not in arrears A C 0 Forecast in arrears B D 1

Accuracy rate of forecasting malicious arrears = D/(B+D) Accuracy rate of forecasting not in arrears maliciously = A/(A+C) Hitting rate of forecasting malicious arrears = D/(C+D) Hitting rate of forecasting not in arrears maliciously = A/ (A+B)

Table 3: Forecasting Effectiveness using confusion matrix [8] Once the confusion matrix analysis suggests suitability of use as per business policies and strategies, the rules can be used on real-time customers. According to their information of calling duration and cost, account, state-changing, it becomes possible to forecast the probability of customers becoming the malicious accounts in the later months. Sequential Pattern Mining algorithms can be used to solve the problem of discovering the maximum frequent sequences in a given database. Genetic algorithms are one of the best ways to solve a problem for which little is known and its uses the principles of selection and evolution to produce several solutions to a given problem. The Sequential Pattern Clustering with Genetic Algorithm (GA) has been used on a database of telephone numbers to detect fraud and identify the misuse customer and malpractice customer in the telecommunication area [18]. Before using the technique initial pre-activities like defining the fitness value, initial population size, mutation point, crossover point and ordering of data are done. Fitness function value is a criterion for generating the new solution or new candidate functions. Initial samples size, maximum generations, threshold value, minimum fitness are used as input to give as output the malpractice user number-list. This study illustrates the

Application of Data Mining in the Telecommunication Industry

12

use of clustering and GA techniques for datamining the telco database to detect and prevent potential customer fraud. Sanver and Karahoca [23] have used the Adaptive Neuro-fuzzy Inference System (ANFIS) structure and optimization processes to be applied to Call Detail Records to detect customer fraud. The ANFIS approach uses Gaussian functions for fuzzy sets and linear functions for the rule outputs. The initial parameters of the network are the mean and standard deviation of the membership functions (antecedent parameters) and the coefficients of the output linear functions (consequent parameters). The ANFIS learning algorithm is then used to obtain these parameters by recursively updating the rule parameters till an acceptable level of error is achieved. Iterations have two steps, one forward and one backward. In the forward pass, the antecedent parameters are fixed, and the consequent parameters are obtained using the linear least-squares estimate. In the backward pass, the consequent parameters are fixed, and the output error is back-propagated through this network, and the antecedent parameters are accordingly updated using the gradient descent method [23]. As per the study the total duration of calls, international calls, and total cost of calls are an important factors in fraud identification.

Network Operation and Maintenance Management


The telecommunications network is made up of a complex connection of network elements and other devices. The backbone of the network comprises of optical fiber connections and Core network elements. The edge layer of the network consists of many intermediate devices of differing technologies which provide regional, WAN and MAN connectivity. The access network provides last mile connectivity in terms of copper, wireless, optical cables etc. with smaller level distribution and booster devices. This complex network generates lots of data in terms of logs of events, fault and performance related alarms, system state and disturbances, etc. in a time series manner. The complexity of this data is growing rapidly with increasing network deployment to support ever increasing demand for speed and data transfer. Tens of Gigabytes of data are generated every day in various networks across the world. Such vast amounts of data need to be mined to identify and correct real-time problems that may arise in a network which may affect millions of users and affect the operation of the telcos. Further, such data can be used to optimize the network deployment based on the traffic flow and the needs. Some of the realtime traffic routing needs can also be addressed in situations of failure in certain segments of the network. Data mining of this data has helped in developing network planning tools which helps the telcos to predict and build optimized and efficient networks which has a significant impact on their opex and capex. Operation and Maintenance Centers (OMC) of the telcos has systems such as Operating SubSystem (OSS) and Network Management System (NMS) (Appendix 4). The data generated can be divided into three categories: system configuration data, system parameter data and dynamic data that describe operations of network functions. The analysis of such data, though presented in a readable format, can be overwhelming for the Network Operation Center (NOC) personnel to comprehend and take corrective action all the time. Data mining methods can help immensely in the decision making process at all levels of network management from fault detection and correction to network Application of Data Mining in the Telecommunication Industry 13

optimization. The data mining begins with preprocessing the data which reduces noise, copes with extremes, deals with missing values, and balances variables and their value ranges. Subsequently many analysis methods are used depending of the statistical information available and the overall goal of the analysis. The phase following the analysis consists of interpretation, presentation and validation of information and knowledge based on the features discovered in the previous process. Frequent Patterns are identified from the dataset generated by the network. Frequent sets consist of value combinations that occur inside data records like alarm entries. Frequent episodes describe common value sequences like alarm message types that occur in the network. Frequent sets can be used for calculation of association rules. In telecommunications management systems, the kind of event creation, transfer and logging mechanisms result in unordered entries. The Apriori algorithm ([15], page 27) for frequent sets can be modified to compute frequent unordered episodes such as using a set of log/alarm sequence windows. Frequent patterns capture information in these repetitive entries. The Comprehensive log compression (CLC) method uses frequent patterns to summarise and remove repetitive entries from log data [15]. This makes it easier for a human observer to analyse the log contents. Also, CLC supports on-line analysis of large log les. The CLC method analyses the data and identifies frequently occurring value or event type combinations by searching for frequent patterns from the data. For the purposes of OMC in the telecom space, frequent episodes and association rules can be used to assist in defining rules for an alarm-correlation engine. But, maintenance and security events are much larger, where some of the event types are unknown and new types keep appearing, when new components are added to the network. The method that addresses this kind of changing data set cannot rely on a pre-dened knowledge base only. Hence, predictive algorithms become necessary and these can be done by self-learning methods. Large volumes of daily network log data enforce network operators to compress and archive the data to offline storages. Whenever an incident occurs in system security monitoring, for example that immediately requires detailed analysis of recent history data, the data has to be fetched from the archiving system. Typically the data also has to be decompressed before it can be analysed. These kind of on-line decision-making tasks on the tactical level are challenging for data mining methods and the knowledge discovery process. Log les that telecommunications networks produce are typically archived in compressed form. Closed sets can be used to create a representation that reduces the size of the stored log by coding the frequently occurring value combinations. The coding can be done without any prior knowledge about the entries. Using datamining techniques to identify the critical attributes to be stored can be a critical differentiator in implementing efficient algorithms for real-time access of data without the need for decompressing. Queryable lossless Log data Compression (QLC) [15] is one way to archive data so that it can be accessed without decompression. Burn-Thornton et. al. [16] used many datamining methods to evaluate their efficacy in proactive datamining for network management. The algorithms chosen were: k-NN (Statistical), C4.5 (Machine Learning, decision tree), CN2 (Machine Learning, rule-induction), RBF (Neural Networks), and OC1 (Machine Learning,

Application of Data Mining in the Telecommunication Industry

14

decision tree algorithm). The study surmised that the C4.5 algorithm could prove to be the best algorithm to use in order to both accurately, and rapidly mine multi-set, multivariate small class performance data, and also perform this task over the smaller size of data set. However, the k-NN algorithm appeared to be the most suitable data mining algorithm to use for providing the basis for proactive system management due to its better accuracy for classification, although its speed was less than that of C4.5. Data Mining can be used to produce the probabilistic network by correlating offline alarm event data, and deducing the cause using this probabilistic network from live alarm events. The cause and effect graph to form a Bayesian Graph/Network can be considered a complex form of alarm correlation. The alarms are connected by edges that indicate the probabilistic strength of correlation. Induction has to be used to deduce this structure from the data, but is a NP-hard problem to solve due to the vast amount of variables which in turn gives a very large number of potential graphical structures which can be induced. In practice, when it comes to learning the cause and effect graph, the volume of event traffic and correlation of alarms can be reduced by simple first stage correlation (generally pattern matchers). The expert system approach (for eg. the deduction from the probabilistic network) could then handle the remaining more complex problems, taking advantage of the much reduced and enriched stream of events. Rules for the system can be written from mined results from such tools as Clementine [17]. Based on the datamining results, additional rules could be potentially adapted to extend the existing correlation system in an element/network manager/management system. Another use of datamining on the huge amount of network data generated is to monitor the delivery of Service Level Agreement (SLA) to customers. Data Exceptions (DE), specific indications of unexpected performance, indicate periods where the delay data differs from some expectation due to reasons like spikes, delay step changes and, changes in time-of-day-delay variation. Using datamining on weeks of collected networks data, the networks operator can classify unseen exceptions as DE and hence be able to take corrective action as required so as to maintain the SLAs. Neural network can be used on the data to create decision engines which helps the operator in doing this. As per the study by Phillips et. al. (Reference 19) DEs are automatically detected using a two-stage process. First, a statistical test is performed (KolmogorovSmirnov) which gives an indication that an exception has occurred in the data. Subsequently, the delay data immediately surrounding the exception is passed to a neural network for classification. The neural network is initially trained with a set of DEs of various types taken from the communication network. As per Phillips et al.s evaluation of this approach, a classification accuracy of 99% was eventually achieved on this training set. When used on unseen exceptions, the trained neural network achieved a classification accuracy of 79%, averaged over all exception types. The KS test, which preceded the neural network, correctly identified 99.5% of the exceptions presented to it [19]. Classification And Regression Trees (CART) algorithm can be used to the classify QoS based on the Key Performance Indicators (KPI) of the telecommunication networks and its element for eg. a cell site. The CART is a binary decision tree, which can be applied to both numerical and nominal data. Classification with the CART is based on observations of a set of variables data, variables which are used as predictors, and a classification variable (also called as the target value) attached to these observations. The tree construction helps in determining the

Application of Data Mining in the Telecommunication Industry

15

binary splits of data set X with training data set L so that X is cut into smaller and smaller subsets. The solution is to search over every possible threshold in every variable for the split that best improves the tree structure according to a specified score function [20]. The terminal nodes can be written out as linguistic classification rules which can be relatively easily understood by the by the NOC operator to take necessary action based on the QoS classification. Self-Organizing Map (SOM), an unsupervised neural network, can also be used to manage telecommunication traffic. SOM has been analysed for various telecommunication purposes like call admission control, controlling a router system, creating geographical clusters from calling patterns, for adaptive resource allocation, user profiling to detect fraud in mobile telecommunications networks, visualizing the behavior of network cells, visualizing the performance, detect the anomalies, and analyze the trends of a mobile network [20]. Case Study: The Telecommunication Alarm Sequence analyzer (TASA) is one of the data mining tools that help in fault identification by automatically discovery recurrent patterns of alarms within the network data. This patterns discovered by the tool are used to construct a rule-based alarm correlation system. TASA is also capable of finding episodic rules that depend on temporal relationships between the alarms.[21]

Telecommunication Software Quality


Large scale telecommunication software like NMS are deployed in by operators and run for many years with new version and generations of software updating the existing ones. The heterogeneity of networks and the large scale of deployment make many versions running simultaneously on the network. Given this scenario, developers would like to spend their time in areas which are most critical and are most prone to fault. Using the number of defects as the dependent variable, different datamining techniques like logistic regression and decision tress can be used to find those sub modules which are most prone to error and hence concentrate on bettering its quality. Various aspects of large scale software like coupling in the software, the experience of developers, history of change and faults, size of the software, duration of deployment, etc. can be used as attributes for datamining purposes. Many fault proneness predictive model can be built using classification of files and submodules as faulty prone. Logistic Regression and Machine Learning techniques like C4.5 are used for building such models. Each leaf of a decision tree thus built would correspond to a subset of the data set available like source code characteristics and their fault/change history, and its probability distribution can be used for prediction when all the conditions leading to that leaf are met [25]. Coverage Algorithms, which create independent rules with probability of fault for a particular class of subset of code, iteratively identify attribute-value pairs that maximize the probably of the desired classification and, after each rule is generated, remove the instances that it covers before identifying the next optimal rule. Neural networks, like the back-propagation algorithm, can also be used for classification purposes. Another classification approach is the Support Vector Application of Data Mining in the Telecommunication Industry 16

Machine classifier (SVM), which attempts to identify optimal hyperplanes with nonlinear boundaries in the variable space in order to minimize misclassification. There are certain challenges in using datamining techniques in detecting fault proneness of software. When building models to predict fault components or files, the process tends to be rather exploratory which results in a large number of predictors with low correlation. The number of training instances needed for instance-based learning increases exponentially with the number of irrelevant variables present in the data set. At the same time, strong inter-correlations among variables affect variable selection heuristics in regression analysis. Variable selection schemes like Correlation-based Feature Selection (CFS) help in the better selection of variables for analysis. Arisholm et. al. [25] used a large Java based legacy system maintained by Telenor in Norway consisting of more than 2600 Java classes amounting to about 148K SLOC (Lines of Code). Using occurrences of corrections in classes of a specific release which are due to field error reports as the dependent variable, the aim was to facilitate unit testing and inspections on the most fault prone modules so as to focus the efforts and reduce cost of quality. They found that C4.5 decision trees, happen to perform very well overall (for different percentages of code and test sets) and the 20% most fault prone classes account for around 70% of test sets. On using this model, feedback from developers showed that they were able to uncover many new faults by investing extra days of unit testing on classes predicted as the most fault prone. Turhan, Koak and Bener [23] analyzed twenty five projects of a large telecommunication system in their study. To predict defect proneness of modules they trained models on publicly available Nasa MDP data. They used projects implemented in Java and gathered 29 static code metrics from each. In total, there are approximately 48,000 modules spanning 763,000 lines of code. All projects are from presentation and application layers. In their experiments they used Static Call Graph Based Ranking as well as Nearest Neighbor Sampling for constructing method level defect predictors. They found that Nave Bayes methodology achieves significantly better results than many other mining algorithms for defect prediction

Classification of telecommunications companies


Costea and Eklund [12] have demonstrated the use of datamining techniques to classify telecommunications companies in respect to their financial performance. They have used Self-Organizing Map algorithm (SOM) to cluster the companies and used apply two classification methods - multinomial logistic regression (MLR) and decision tree induction (DTI) - to develop class prediction models. The research problem was to find a relationship between a categorical variable (financial performance class) and a financial data-vector. The SOM is an unsupervised-learning neural-network method that produces a similarity graph of input data. It consists of a finite set of models that approximate the open set of input data, and the models are associated with nodes (neurons) that are arranged as a regular, usually 2-D grid. The models are produced by a learning process that automatically orders them on the 2-D grid along with their mutual similarity. The original SOM algorithm is a recursive regression process [11]. The financial user data of many telecommunications companies was used to evolve 6 groups/clusters (Appendix 2). The characteristics were used to classify companies Application of Data Mining in the Telecommunication Industry 17

into one of the six categories. Both the classification methods used similar variables for classification. the DT model relies heavily on ROE, Interest Coverage, Equity to Capital, while, for MLR technique, for each model equation, in general, positive and large coefficients for ROE, Interest Coverage and Equity to Capital, and negative/small coefficients for the other variables. This is an interesting case study of how to predict the financial health of the telecommunications industry using datamining techniques.

Insights for future


The telecommunication industry is undergoing a vast change. From just fixed voice based products, the industry has become wireless and data based. This has brought in various challenges in addition to the above mentioned business challenges they face. The future holds many changes in the telecommunication space which has implications for telco, equipment manufacturers and users. New areas of expansion like cloud, 4G, Internet Protocol (IP) based systems, changing nature of communication with integrated devices, Voice over IP (VOIP) phones, free services like email and upcoming public wireless systems. These pose great challenges of customer retention, introduction of new services, and being able to be cost effective in marketing efforts. Datamining can provide help by providing information that can be used for taking strategic decisions. Market basket Analysis can be used to learn the trends of purchase by customers. This would help in giving more effective bundling of new services with existing ones. For example, if a land line customer has also a high speed internet connection, and it is can be used as a association rule for identifying customers who are likely to buy high end service like on-demand movie or go for 4G mobile service. Existing association rules can be used to evolve and test minimum support threshold and minimum confidence threshold to evolve new association rule for prospective buyers of new services. This can be helpful as it is never an easy task to predict adoption of new technologies and products, and existing association rules can be used to target potential customers better. This kind of induction based rules could be evolved to have better predictive capabilities on customer segmentation for new services. Frequent item set mining methods like Apriori algorithm can be used for understanding correlation between services giving an idea of customers who are ready to adopt new services. Customer targeting for adoption for new services can also be done by pattern mining using sequential and time series patterns. This gives an indication of the range early adopters to laggards of past services when they were introduced. The patterns of demographic like age and adoption of new services through the times gives indication of how to target customers. By using the pattern matching derived association rules from the past, the current customer set can be targeted for corresponding services. For eg. If the data indicates that the new services have a high adoption rate in the age group of 25-30 always, any new service can be targeted towards such customers. Likewise, if the pattern suggests that the high value added and margin services have been adopted more by urban professionals in a 30-45 age group, this would be the segment to target for such future services. Fuzzy set based classification techniques would be useful as well as they give a degree of membership based category classifications.

Application of Data Mining in the Telecommunication Industry

18

Location based advertisement services can be made more effective by using clustering techniques and creating clusters based on attributes derived from connection between demographics, location, and product purchase history of the customer. This would involve collaboration between different data sets from different players and issues of privacy will have to addressed here. Methods of anonymisation will have to be used and use advanced clustering methods like probabilistic model-based clustering techniques or flexible fuzzy clustering methods using Expectation-Maximisation Algorithm. Due to the large amount of multi-dimensional data Subspace Search Methods may have to be deployed to evolve engines for targeted advertising. Analysis of data and text generated through the high amount of data flowing through the networks will involve text mining techniques to understand both the customer trends, and also to access the network planning and network deployment needs. Such capability at the telco end would greatly enhance the competitiveness of telcos with respect to companies which use resident data for attracting advertisement revenue. Such text mining capability at the telco side would enable them to become a more important player in the communication industry value chain. Use of Mobile and availability of fast internet has shifted usage patterns of customers. There are reports of the network being clogged due to high usage of multimedia content over wireless and operators having to remove plans of unlimited usage and such. With landlines being outstripped by mobile connections, and the need to increase Average Revenue Per User (ARPU), and data applications have become an important part of the ecosystem. Datamining techniques used on the type of traffic, plans, and profitability would help the telcos offer the right kind of plan in a customized manner to individual customers. Online Analytical Processing systems can be developed which can classify the new or existing customer based on patterns of usage, ARPU generated, technology adoption pattern etc. to offer real-time plans which would greatly enhance customer retention and loyalty. Discriminant Analysis can be used to predict customer acceptability to reduce customer fraud. Logistic Regression techniques can be used to predict and ensure better quality of service based on SLA agreements and the evolving patterns of network traffic. Datamining techniques used for predictive analysis will play a critical role in the success of telcos especially with the changing landscape of technology and players, customer choices, and lifestyles.

Conclusion
The various applications of datamining in the telecommunications industry has been presented. Even though the telecommunications industry has been one of the earliest adopters of datamining, the changing trends of technology and the focus on customer delight are making datamining an integral part of the operators. The new evolving techniques to predict future trends from past data are critical to make datamining a powerful tool especially since the uncertainty of adoption is high coupled with the opportunities for making substantial business gains.

Application of Data Mining in the Telecommunication Industry

19

Reference
1 2 3 Data Mining In Telecommunications - Gary M. Weiss, Department of Computer and Information Science, Fordham University, Copyright 2009, IGI Global The Research of Data Mining in Telecom Data Warehouse- SHI Jun-yong, LI Ling-ling, 2010 IEEE Computer Society, DOI 10.1109/ICSEM.2010.160 Improved K-Means algorithm and application in customer segmentation Xiaoping Qin, Shijue Zheng, Ying Huang, Guangsheng Deng(Vietnam),Department of Computer Science, Huazhong Nomal University, 2010 IEEE Compuer Society, 2010 Asia-Pacific Conference on Wearable Computing Systems, DOI 10.1109/APWCS.2010.68 Customer churn analysis in telecommunication sector - Umman Tuba imek Grsoy, Istanbul University Journal of the School of Business Administration, Cilt/Vol:39, Say/No:1, 2010, 35-49,ISSN: 1303-1732 www.ifdergisi.org 2010 Clustering Analysis of Telecommunication Customers - H. Ren, Y. Zheng, Y. Wu, The Journal of China Universities of Post and Telecommunications. 16(2), 114-116 (2009). Predicting customer churn in mobile networks through analysis of social groups- Yossi Richter, Elad Yom-Tov, Noam Slonim -IBM Haifa Research Lab, 165 Aba Hushi st., Haifa 31905, Israel. On Customer Churn and Early Warning Model of Telecom Broadband - Ding Pan, Center for Business Intelligence Research, School of Management, Jinan University, Guangzhou, China- 978-0-7695-4261-4/10 2010 IEEE Computer Society Fraudulent Behavior Forecast in Telecom Industry Based on Data Mining Technology - Sen Wu, Naidong Kang, Liu Yang - School of Economics and Management, University of Science and Technology Beijing, Communications of the IIMA, 2007 Volume 7 Issue 4 Designing an Expert System for Fraud Detection in Private Telecommunications Networks - C.S. Hilas, Expert Systems with Applications. 36, 11559-69 (2009). Business Intelligence Applications and Data Mining Methods in Telecommunications: A Literature Review - Dorina Kabakchieva, Sofia University,125 Tzarigradsko shosse Blvd., 1113 Sofia, Bulgaria Self Organization of a Massive Document Collection -Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojrvi, Jukka Honkela, Vesa Paatero, Antti Saarela, IEEE Transactions On Neural Networks, Vol. 11, No. 3, May 2000 A Framework for Predictive Data Mining in the Telecommunications Sector Adrian Costea, Tomas Eklund ,Turku Centre for Computer Science and IAMSR / bo Akademi University Lemminkisenkatu 14 B, FIN-20520 Turku, Finland Privacy Preserving Data Mining in Telecommunication Services - Ole-C hristoffer Granmo , Vladimir A . Oleshchuk, ISSN 0085-7130 Telenor ASA 2007, Telektronikk 2.2007 Data Mining in Churn Analysis Model for Telecommunication Industry Khalida binti Oseman, Sunarti binti Mohd Shukor, Norazrina Abu Haris, Faizin bin Abu Bakar, Journal of Statistical Modeling and Analytics Vol. 1 No. 19-27, 2010 Data mining for telecommunications network log analysis - Kimmo Hatonen, University of Helsinki, Finland

5 6 7

9 10 11 12 13 14

15

Application of Data Mining in the Telecommunication Industry

20

16 Pro-Active Network Management Using Data Mining - K.E.Burn-Thornton, J. Garibaldi & A.E. Mahdi, School of Electronic, Communication and Electrical Engineering, University of Plymouth, 0-7803-4984-9/98 1998 IEEE. 17 Data Mining telecommunications network data for fault management and development testing - R. Sterritt, K. Adamson, C.M. Shapcott, E.P. Curran Faculty of Informatics, University of Ulster, Northern Ireland. 18 A Fraud Detection Approach in Telecommunication using Cluster GA V.Umayaparvathi, Dr.K.Iyakutti, MKU, International Journal of Computer Trends and Technology- May to June Issue 2011, ISSN: 2231-2803 19 Architecture for the Management and Presentation of Communication Network Performance Data- Iain Phillips, David Parish, Mark Sandford, Omar Bashir, and Anthony Pagonis, IEEE Transactions On Instrumentation And Measurement, VOL. 55, NO. 3, JUNE 2006, 0018-9456 2006 IEEE 20 Data Mining for Managing Intrinsic Quality of Service in Digital Mobile Telecommunications Networks - Pekko Vehvilinen, Tampere University of Technology, Publications 458 21 Computational Intelligence in Data Mining and Prospects in Telecommunication Industry - Isinkaye O. Folasade , Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016) 22 Analysis of marketing data to extract key factors of telecom churn management - Hao-En Chueh, African Journal of Business Management Vol. 5(20), pp. 8242-8247,16 September, 2011 23 Data mining source code for locating software bugs: A case study in telecommunication industry- Burak Turhan, Gozde Kocak, Ayse Bener, Expert Systems with Applications 36 (2009) 99869990 24 Fraud Detection Using an Adaptive Neuro-Fuzzy Inference System in Mobile Telecommunication Networks - Mert Sanver*, Adem Karahoca**, * Institute for Computational and Mathematical Engineering, Stanford University, Stanford, 94305, USA, **Department of Computer Engineering, Bahcesehir University, Besiktas, Istanbul, 34900, TURKEY 25 Data Mining Techniques for Building Fault-proneness Models in Telecom Java Software - Erik Arisholm, Lionel C. Briand, Magnus Fuglerud - Simula Research Laboratory, 18th IEEE International Symposium on Software Reliability Engineering, 1071-9458/07 2007 IEEE, DOI 10.1109/ISSRE.2007.22 26 Datamining Concepts and Techniques, Third Edition Jiawei Han, Micheline Kamber, Jian Pei, Morgan Kaugmann Series

Application of Data Mining in the Telecommunication Industry

21

Appendix 1 : Results of Customer Analysis [3]

Appendix 2: Class predictions using two classification models [10]

Appendix 3: Taxonomy of Data Mining [14]

Application of Data Mining in the Telecommunication Industry

22

Appendix 4: A cluster hierarchy of an operator business model and responsibility areas. [15]

Application of Data Mining in the Telecommunication Industry

23

You might also like