You are on page 1of 5

2017 3rd IEEE International Conference on Computer and Communications

Network Traffic Data Analysis Based on DGX

Dan Zou, Jun Liu, Qing Yan


Beijing Key Laboratory of Network System Architecture and Convergence, Beijing Laboratory of Advanced Information
Networks
Beijing University of Posts and Telecommunications
Beijing, China
e-mail: zoudanx@163.com

Abstract—Skewed data appear frequently in network traffic


data, this paper focuses on modeling and analyzing these II. RELATED WORK
skewed data. The research on data distribution model plays an People have made a lot of attempts to model network
important role in data mining. In many fields of science, traffic data. Literature [5,6] studied these data, but they did
various models have been proposed to describe skewed data. not find a good distribution model. The common
Most of them only perform well for particular data sets, but
distributions are power exponent distribution, parabola
fail with others. The purpose of this paper is to fit and analyze
distribution, Zipf distribution. Among them, the last one is
the skewed data in network traffic data with Discrete Gaussian
Exponential (DGX), and to estimate the model parameters the most widely used one. For example, Gabaix X [7]
with genetic algorithm. investigated the growth of cities with Zipf distribution, while
Zipf distribution has also been observed when study the
Keywords-network traffic data; data mining; DGX; genetic income distribution of companies in [8]. Next, we will
algorithm introduce the Zipf distribution.
In 1935, Zipf , a linguist at the Harvard University, when
I. INTRODUCTION researched English word frequency, found that if we order
Network traffic analysis refers to the analysis of the words by their frequency in descending order, there is an
traffic data generated by various network devices. The inverse relationship between the frequency of each word and
analysis of network traffic data is very important for network its order, which is known as Zipf's law.
management, network structure optimization and abnormal Zipf's law generally has two forms of expressions. One is
traffic monitoring. The research on distribution of these data to describe the relationship between frequency and rank,
is the first step. Network traffic data contains a large amount while another is focus on the relationship between frequency
of skewed data. and count.
Skewed data refers to data with uneven frequency. A. Frequency and Rank
Skewed data can be seen everywhere in real life, such as “80%
of wealth is concentrated in 20% of people”, “ 80% of web 1
fr ĝ (1)
page hits comes from 20% of people”, data included in these rα
familiar saying are skewed data. People summarized a rule
for this kind of data, known as "80/20 principles". Some Where r represents the rank of an object, fr represents
attempts have been made to model skewed data, such as the frequency of this object and α is the parameter of this
power-law distribution, parabolic distribution and Zipf model.
distribution [1]. Among them, Zipf distribution is the most B. Frequency and Count
widely used one. However, Zipf distribution is not optimal.
In the process of fitting skewed data with Zipf distribution, cf ĝ
1
(2)
studies [2, 3] found deviations at the top of the fitting curve, fβ
which is called "top concavity". Thus, a question appears in
our mind: can we figure out a better model for skewed data? where f represents the frequency, cf represents the count of
We develop a kind of skewed data model which can be an object whose frequency is f, and β is the parameter of
realized rapidly in a distributed system. This model which is this model.
based on Discrete Gaussian Exponential (DGX) distribution It's not hard to find that if relationship between the
[4] can serve as a useful tool for data mining. We will model frequency and the rank of a data set obey Zipf's law, then
an open data set and a network traffic data set with it, to frequency-rank curve is supposed to be liner in log-log scales.
demonstrate the superiority of our model in modeling So does the relationship between the count and the frequency.
skewed data. For network traffic data, some of the previous work fits it
with Zipf distribution. It has a good performance, but we
often get a deviation from the fitting curves, this is called

978-1-5090-6352-9/17/$31.00 ©2017 IEEE 1199


“top concavity”. Next, we will try to fit a network traffic data The process of calculating μ and σ is to find the global
set with Zipf distribution, and then check the performance. maximum point of function l(μ, σ). Genetic algorithm has
The data set is coming from network traffic data. We shown its power in determination of maximum point in [9].
mainly concentrate on the count of server IP. First, we draw
the frequency-rank plot in Fig. 1.
As we can see in Fig. 1, frequency-rank plot shows Zipf-
like skewed behavior. But we can easily see that not all the
points fall near a straight line, which means that they do not
follow Zipf distribution exactly. Similar phenomena have
been found in other studies, this is what we called “top
concavity “.
III. MODEL AND ESTIMATING ALGORITHM
A. Probability Distribution Function
2
A(μ,σ) (ln(k) ü μ)
P(x = k) = exp — , k = 1,2,... (3)
k 2σ2

where Figure 1. Fequency-rank plot of server IP

2 —1
1 (ln(k)—μ)
A(μ,σ) = ∑∞k=1 exp — (4)
k 2σ2

A is a normalization constant depending on μ and σ .


Here, we can find an interesting relationship. When
μ → — ∞, (3) becomes
μ
1 μ ln(k) —1
P(x=k)ĝ exp — ĝ k σ2 (5)
k 2σ2

μ
Let θ = 1— , we get
σ2

P(x=k)ĝ k— θ (6)

Thus, we can see that this is the probability density


function of generalized Zipf distribution. It is shown that the
Zipf distribution can be regarded as a special DGX
distribution where the parameter μ is close to the negative
infinity. In other words, Zipf distribution is a special case of Figure 2. Schematic diagram of genetic algorithm
DGX distribution.
Next, we will explain the process of finding the
B. Estimating the Parameters maximum point with genetic algorithm. As we can see in Fig.
We estimate the parameters of DGX distribution with 2, there are five steps for genetic algorithm:
maximum likelihood estimation (MLE). Assuming that the 1) Initialization
input data set is X={x1 ,x2 ,…,xn }, then the likelihood would Generate the initial population with random points.
be 2) Selection
Calculate the fitness of each individual in the population,
2
1 (ln(xi )—μ) and then select the individuals with better fitness with
L(μ, σ)= ∏ni=1 P(xi ) = A(μ,σ)n ∏ni=1 exp — (7)
xi 2σ2 roulette wheel selection. We also use elite mechanism to
ensure the best individual survives.
and log-likelihood is 3) Crossover
Two different chromosomes exchange some of their
2
(ln(xi )—μ) genes using a single point crossover method according to the
l(μ, σ)= n ln (A(μ,σ)) + ∑ni=1 ln(xi ) — (8) crossover probability.
2σ2
4) Mutation

1200
Chromosomal variation happens with a mutate rate. be the equation of the fitted line, where S represents the
5) Judgement synthetic data, R represents the real data, while b0 and b1 are
Determine whether the number of generation reaches the undetermined coefficients.
specified value, if so, output the best individual in the
Algorithm 1: Generate synthetic data with DGX model
current generation. Otherwise, jump back to Step2.
Input: μ : parameter μ of DGX model ,
IV. EXPERIMENT σ : parameter σ of DGX model,
In this section, we set up a DGX model for an open data D : the real data set.
set [10], so that we can explain the detail of modeling Output: synthetic_data
process of DGX, and then we will evaluate its fitness . 1 sum←0
The data set is the severity of terrorist attacks worldwide 2 n←size(D)
from February 1968 to June 2006, measured as the death toll. 3 for i ←1 to n do
D ={d1 ,d2 , …, dn } represents the input data set, where 4 sum←sum+D(i)
di (i = 1, 2, …, n) refers to number of deaths in a terrorist 5 end for
attack. 6 for j ←1 to n do
First of all, we estimate the parameters μ and σ:
7 num ←D(j)
8 count ←sum * P(μ, σ, num)
[μ, σ] = Max[l(μ, σ),D] (9)
9 synthetic_data(num)←count
10 end for
l(μ, σ) is the log-likelihood function ,which is defined in
(8).The process of estimating the parameters is to estimate 11 return synthetic_data
the global maximum point of l(μ, σ) . We can do it with the So,
help of optimization function fminsearch() in Matlab.
However, if the size of input data set is too large, this method ∑ni=1 Ri Si — nSR
is not so applicable. Under these circumstances, we can b1 = 2 (10)
∑ni=1 Ri 2 -nR
design algorithm on our own. For example, we can do it
based on Genetic Algorithm, and implement algorithm with
b0 =S — b1 R (11)
Spark or other big data computing framework.
We get the global maximum point with (—2.9634,
Where
2.2535), which means μ = — 2.9634 and = 2.2535 . So
we get the probability distribution function, with which we 1
can generate a synthetic data set. The pseudo code of S = ∑ni=1 Si (12)
n
generate a synthetic data set with DGX is shown in
Algorithm 1, where D = {d1 , d2 , … , dn } represents the real 1
R= ∑ni=1 Ri (13)
n
data set.
After generating a synthetic data set with DGX, we have
to compare real data set and synthetic data set. As we can see If b0 is close to zero and b1 is close to one, we consider it
in Fig. 3, we draw the frequency-number plot and quantile- a fact that the real data set and the synthetic data set follow
quantile plot (qqplot) of two data sets. the same distribution. So we can draw a conclusion that
We fit the qqplot with a straight line, and estimate the DGX perform well in modeling this data set.
parameters with the least square method. Let S =b0 + b1 R to

Figure 3. Frequency-number plot and qqplot deaths in the terrorist attack

1201
For example, a US website is especially popular in China,
V. NETWORK TRAFFIC DATA and each request from China needs to come across the
Network traffic data refers to all the data that transmitted Pacific Ocean to be sent to the United States server. Then the
on the network. The analysis of network traffic data is very response will also be delivered to the user through the same
important for network management, network optimization, distance. And the whole process obviously consumes a lot of
and the detection of abnormal traffic. Network traffic data time, which is intolerable for user.
are massive data, which contain a lot of skewed data, such as In order to minimize the long distance transmission,
online hours of Internet users, count of websites one user network service providers place proxy servers all over the
visits and page views. world. When a user tries to access a file, the nearest proxy
In this paper, we will model and analyze these skewed server will check whether the file has been cached. If not, the
data sets with DGX model, which we have discussed above. proxy server will send a request to the server in US. Then the
From the previous chapter, we can see the superiority of proxy server will get response which contains the asked file.
DGX while modeling the skewed data. And we only go over Meanwhile, the proxy server will store this file in cache. If
the raw data once during modeling, which means DGX has someone requests for the same file, the proxy server do not
low time complexity. So it is suitable for modeling large have to send request to the server in USˈit only get the file
scale data like network traffic data. from the cache. It not only reduces the waiting time for users,
but also decreases the amount of data transmitted in the
A. Page Views network, which means conserving valuable bandwidth.
The data set is coming from network traffic data. In this However, cache needs to be stored, so it has a finite size.
section, we apply DGX to page views. The computed How to measure the relationship between cache storage size
parameters are μ = —10.8381 and σ = 6.4008. and hit ratio is very relevant. At this point, we can use the
From Fig. 4, we can directly observe the distributions of DGX model to analyze these requests. For example, we can
two data sets are similar. And in qqplot, slope of the fitting see from Table I that if we expect the hit ratio to be above
straight is 0.9850 and the correlation coefficient of quantile 50%, 11.0% of files have to be stored, while 16.6% of files
is 0.9980. Thus, we can say that DGX performs perfectly in should to be stored if we expect the hit ratio to be above 90%.
modeling this data set. It can be seen that the DGX distribution can provide
Obviously, page views are skewed data. Page views of guidance for the design of cache, and optimization of
most websites are very few, while a small number of very relationship between size of cache and hit ratio.
popular sites occupy a large portion of requests. The
relationship between the percentage of websites and requests TABLE I. PERCENTAGE OF WEBSITES AND REQUESTS
can be easily computed with DGX. This may be very Percentage of websites Percentage of requests
important for Internet cache.
A major problem for Internet service providers is how to
11.0% 50%
meet the needs of quick access to Internet content while 14.0% 80%
maintaining quality of service. Caching is the most important
way to solve this problem. Research on the distribution of
16.6% 90%
page views is of great benefit for it [11]. 28.6% 99%

Figure 4. Count-frequency plot and qqplot of page views

1202
Figure 5. Count-frequency plot and qqplot of requests

B. Number of Requests
ACKNOWLEDGMENT
The raw data is the same with the data in the previous
sections, and we will study the user requests. The computed This work is supported in part by the National Natural
DGX parameters are μ= 5.1469 and σ = 2.0923. Science Foundation of China (61671078), Funds of Beijing
From Fig. 5, we can see that this data set is clearly Laboratory of Advanced Information Networks of BUPT,
deviates with Zipf's law. In qqplot, slope of the fitting Funds of Beijing Key Laboratory of Network System
Architecture and Convergence of BUPT, and 111 Project of
straight is 1.0001 and the correlation coefficient of quantile
China (B08004).
is 0.9986. Therefore, we can say that DGX model fits this
data set well while Zipf's law does not. REFERENCES
Analyzing the surf behaviors such as count of requests of [1] Newman M E J. Power laws, Pareto distributions and Zipf's law[J].
each user is very important for advertising. We can model Contemporary physics, 2005, 46(5): 323-351.
the number of requests per day with DGX, and then record [2] Faloutsos C, Matias Y, Silberschatz A. Modeling skewed distribution
the DGX parameters. After a period of time, we can use using multifractals and the80-20'law[J]. 1996.
these parameters to simulate the data of these days, and [3] Powers D M W. Applications and explanations of Zipf's
analyze the change of the user's requests. The greatest law[C]//Proceedings of the joint conferences on new methods in
language processing and computational natural language learning.
advantage of doing this is that we do not need to store all the Association for Computational Linguistics, 1998: 151-160.
traffic data, but only need to store the parameters of the daily [4] Muthukrishnan S. Data streams: Algorithms and applications[J].
model. Storing huge amount of data costs a lot of money. We Foundations and Trends® in Theoretical Computer Science, 2005,
can approximately rebuild the origin data with DGX model. 1(2): 117-236.
Although small part of information may be lost, it is [5] Crovella M E, Bestavros A. Self-similarity in World Wide Web
traffic: evidence and possible causes[J]. IEEE/ACM Transactions on
worthwhile comparing with the money it can saved. networking, 1997, 5(6): 835-846.
As can be seen in the previous section, DGX can be [6] Bi Z, Faloutsos C, Korn F. The DGX distribution for mining massive,
applied in advertising, data storage and other areas. skewed data[C]//Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining.
VI. CONCLUSION ACM, 2001: 17-26.
The DGX model performs very well in modeling skewed [7] Gabaix X. Zipf's Law and the Growth of Cities[J]. The American
Economic Review, 1999, 89(2): 129-132.
data, and it also shows the superiority in modeling the
[8] Obadage A S, Harnpornchai N. Determination of point of maximum
network traffic data. It not only fits the data accurately, but likelihood in failure domain using genetic algorithms[J]. International
also has low time complexity. Zipf's law can be regarded as a Journal of Pressure Vessels & Piping, 2006, 83(4):276-282.
special case of DGX model. Therefore, the data which Zipf's [9] Okuyama K, Takayasu M, Takayasu H. Zipf's law in income
law fitted well, can also be modeled with DGX model. And, distribution of companies[J]. Physica A: Statistical Mechanics and its
Applications, 1999, 269(1): 125-131.
DGX can model some data that Zipf's law cannot fit well. At
[10] A. Clauset, C. R. Shalizi, and M. Newman. (2009) Open dataset.
present, Zipf's law is applied to various fields. A question Available: http://tuvalu.santafe.edu/ aaronc/powerlaws/data.htm
occurred in my mind: can we try to apply DGX model in [11] Adamic L A, Huberman B A. Zipf's Law and the Internet[J].
more areas? I believe that DGX will show its power in Glottometrics, 2001, 3.
various fields in the future.

1203

You might also like