You are on page 1of 5

A Markov Chain-based Availability Model of Virtual Cluster Nodes

Jianhua Che
Information & Network Security Key Lab. of State Grid
State Grid Electric Power Research Institute
Nanjing, China
chejianhua@zju.edu.cn
Weimin Lin
Information & Network Security Key Lab. of State Grid
State Grid Electric Power Research Institute
Nanjing, China
linweimin@sgepri.sgcc.com.cn
Tao Zhang
Information & Network Security Key Lab. of State Grid
State Grid Electric Power Research Institute
Nanjing, China
zhangtao@sgepri.sgcc.com.cn
Houwei Xi
Information & Network Security Key Lab. of State Grid
State Grid Electric Power Research Institute
Nanjing, China
xihouwei@sgepri.sgcc.com.cn

AbstractBenefiting from the virtualization technology,
virtual cluster system possesses a lot of advantages different
from traditional cluster system. However, the availability
analysis of virtual cluster system is still short of efficient
methods. The availability analysis of virtual cluster node is the
base of analyzing the availability of virtual cluster system. In
this paper, we summarize a typical architecture paradigm of
virtual cluster node by studying the overall architecture of
virtual cluster system and the deployment style of virtual
cluster nodes, i.e., two active virtual cluster nodes building on a
physical machine and their standby virtual cluster nodes
building on another physical machine, and give its state
transition diagram by analyzing the complete lifecycle of
virtual cluster node and the transition conditions of different
node states, and design a Markov Chain-based availability
model for this typical architecture paradigms of virtual cluster
node. This model enables to characterize the lifecycle state and
state transition of virtual cluster node and provide an efficient
method to understand the availability level of each virtual
cluster node in a complicate virtual cluster system or cloud
data center. Finally, the practicability of the proposed model
was proved by numerical simulation experimental results.
Keywords-virtual cluster node; availability modeling; Markov
Chain; virtualization
I. INTRODUCTION
The resurgence of virtual machine (VM) [13] provides a
high efficient solution for many IT service demands and
important business applications, e.g. cloud computing [10]
and Internet Data Center(IDC) [14]. With its widespread
application, virtual machine is introduced into traditional
cluster system, which promotes the birth of virtual cluster
system. Virtual Cluster System (VCS) is a kind of cluster
systems that install cluster nodes into virtual machines and
manage cluster nodes with virtualization technology [6].
Compared to traditional cluster system, the advantages of
virtual cluster system are higher resource utilization, lower
standby cost, simpler management work and higher level
availability, etc. [11] But at the same time of offering these
advantages, virtual cluster system has several defects, e.g.
the existence of virtual machine monitor(VMM) [9] brings
some unsteady factors and the availability of virtual cluster
node is a problem of Single Point of Failure(SPOF) [2]. How
to evaluate the availability of virtual cluster system is
becoming the focus of numerous researchers. Furthermore,
the availability of virtual cluster node is the base of
evaluating the availability of virtual cluster system. As
virtual machine has many special features compared with
physical machine, the availability models of traditional
cluster node are not adaptive to the availability analysis of
virtual cluster node. Therefore, it's very necessary to model
the availability of virtual cluster node.
In this paper, we proposed a Markov Chain-based model
for analyzing the availability level of virtual cluster nodes,
and validated its practicability with numerical simulation
experiments. Specifically, the contribution of this paper is as
follows:
First, we summarized a typical architecture paradigm of
virtual cluster nodes by studying the overall architecture of
virtual cluster systems and the deployment style of virtual
cluster nodes;
Second, we described the state transition diagram of this
kind of typical virtual cluster nodes by analyzing their
complete lifecycle state and the transition conditions of
different lifecycle states;
Third, we designed an availability model based on the
Markov Chain theory to analyze the availability level of
virtual cluster nodes in a virtual cluster system or cloud data
center.
The rest of this paper is organized as follows: We begin
in Section 2 with related work. Then, we introduce the
proposed Markov Chain-based availability model for one
typical paradigm of virtual cluster nodes in Section 3.
Furthermore, we validate the practicability of the proposed
model with several numerical simulation experiments in
Section 4. Finally, we conclude with discussion in Section 5.
II. RELATED WORK
Although the availability evaluation of traditional
computer system has been extensively studied, the
availability evaluation of virtual cluster system is still short
of efficient methods at current time. Allen and Miroslaw [7]
gave an earlier survey on the availability analysis models and
evaluation tools of traditional cluster system, and introduced
the availability model elements(including fault rate, recovery
time and service cost) and analysis models(including fault
2011 Seventh International Conference on Computational Intelligence and Security
978-0-7695-4584-4/11 $26.00 2011 IEEE
DOI 10.1109/CIS.2011.118
507
tree, reliability diagram, Markov Chain and Stochastic
Petri). After introducing some basic concepts of availability,
Alan Wood [1] explored the use of Markov model in the
availability analysis.
Regarding to the lifecycle state and availability level of
virtual machine, Farr etc. [4] extended the state type of
virtual machine in the DMTF specification with two kinds of
states: Active and Inactive. Herein, the Active state includes
Operational and Gemini}, the Inactive state includes
Planned and Unplanned. Hence, there are six kinds of virtual
machine states: Latent, Defined, , , Paused and Suspend.
Le etc. [8] has studied the fault injection of virtual machine
system and the application of virtual machine in the fault
injection of traditional computer system. Qin and Xie etc.
[12] analyzed the mutual impact between the scheduling of
multiple applications and the availability in a heterogeneous
system, and modeled all nodes of a heterogeneous system
according to the computing power and availability data of
every node. Brendan cully etc. [3] presented a solution of
building general high availability service framework with
virtual machine and developed a prototype system-Remus.
Werner Fischer and Christoph Mitasch [5] summarized the
availability problems of a virtual machine system and gave
the node architecture scheme that can increase the
availability of virtual cluster system.
Thandar Thein and Jong Sou Park etc. [14] optimized the
rejuvenation process and enhanced the tolerance ability of
computer system with virtualization and software
rejuvenation, designed a framework that can increase the
survivability of a distributed system, and clarified the
relation between the availability of virtual machine system
and the number of backup virtual machines. Based on the
previous work, Thandar Thein and Jong Sou Park etc. have
evaluated the availability of virtual cluster system using
software rejuvenation [17], provided the formulation
description of multiple-virtual machine system with state
transition diagram and verified their work with numerical
simulation experiments [16]. The work of this paper is based
on Thandar Thein's work and has different model semantic
and parameter definition compared with their work.
III. MARKOV CHAIN-BASED AVAILABILITY MODEL OF
VIRTUAL CLUSTER NODES
Availability may be analyzed based on many models, for
example, fault tree, reliability block diagrams, Markov chain
and stochastic Petri nets, etc. Markov chain is firstly
proposed by the Russian mathematician Andrey Markov in
1907, and often used to model the availability of fault-
tolerant computer system, dynamic redundant computer
system, sequence-dependant fault and recovery computer
system. In the Markov chain model, the stochastic process of
virtual cluster nodes running are denoted by a series of state
transitions of virtual cluster nodes. Herein, the state of a
virtual cluster node is denoted by the vertex of a state
diagram, the translation between different states is denoted
by the edge of a state diagram, and the conditional
probability of state transition acts as the weight of edges in a
state diagram. The conditional probability of virtual cluster
nodes transiting from one state into its next state is
determined by the current state, and has nothing with the
historic states. In addition, several important solutions of
Markov chain model include steady-state solution, transient
solution and decomposition method and so on.
A. One Typical Paradigm Of Virtual Cluster Nodes
In a traditional cluster system, all nodes are built on the
physical machines. At the same time, almost every active
node owns a corresponding standby node to improve its
availability. However, in a virtual cluster system, many even
all nodes may be built in virtual machines, including active
nodes and standby nodes. One typical paradigm(we call it
2VNs/2PMs) in all kinds of virtual cluster nodes is that two
active nodes are built respectively in two virtual machines
dwelling on a physical machine, and their standby nodes are
built respectively in two other virtual machines dwelling on
another physical machine as figure 1 shown.

Figure 1. One typical fundamental paradigm of virtual cluster nodes
This paradigm not only increases the resource utilization
of virtual cluster nodes, and also reduces the standby cost of
virtual cluster nodes. At the same time, this paradigm
relieves the problem of Single Point of Failure (SPOF) to a
certain extent. Hence, this paradigm is used widely.
B. Basic Concept and Prerequisite Condition
During the complete lifecycle of a virtual cluster node,
there are usually five main states: Normal, Unsteady,
Rejuvenation, Switchover and Failure. Herein, Normal
means that a virtual cluster node is staying in normal work
stage, Unsteady means that a virtual cluster node is staying
in abnormal and unsteady stage and the virtual cluster node
service is still available with a decreased performance,
Rejuvenation means that a virtual cluster node is staying in
the stage of transiting from Unsteady to Normal state,
Switchover means that a virtual cluster nodes staying in
Unsteady state is switching to its standby node for
unrecovered faults, Failure means that a virtual cluster node
stops working for the sake of failure.
The virtual cluster node staying in Normal state may go
into Unsteady state after running some time, the virtual
cluster node staying in Unsteady state has three next states:
Rejuvenation, Switchover and Failure, and the recoverable
virtual cluster node goes into Rejuvenation state, the
unrecoverable virtual cluster node goes into Switchover state
and then Failure state after migrating the run-time context
and application workload into the standby virtual cluster
node by live migrating. In addition, the virtual cluster node
that has no time to migrate the run-time context and
application workload for sudden unexpected reasons goes
directly into Failure state, the virtual cluster node staying in
Rejuvenation state goes into Normal state by software
508
rejuvenating. The virtual cluster node staying in Switchover
state fails at last, its corresponding service is provided by its
standby virtual cluster node, and the standby virtual cluster
node has all same states and state transition. The state
transition diagram of virtual cluster nodes is shown as figure
2.

Figure 2. The state transition of virtual cluster nodes
Based on the above analysis, we define the model
parameters as shown in Table 1. According to the definitions
of these model parameters, we can know that the
rejuvenation time of a virtual cluster node staying in the
rejuvenation state is 1/, and the switchover time of a virtual
cluster node staying in the switchover state to migrate to its
standby node is 1/. At the same time, these model
parameters are under the following prerequisite conditions
and hypothesis:
TABLE I. THE DEFINITION OF MARKOV CHAIN-BASED AVAILABILITY
MODEL PARAMETERS
The definition of model parameters
N The time ratio of a VCN staying in Normal state
U The time ratio of a VCN staying in Unsteady state
R The time ratio of a VCN staying in Rejuvenation state
S The time ratio of a VCN staying in Switchover state
F The time ratio of a VCN staying in Failure state
The frequency of a VCN changing from Normal to Unsteady state
The probability of a VCN changing from Unsteady to Normal state

The probability of a VCN changing from Rejuvenation to Normal
state
The probability of a VCN migrating from active to standby node
1/
The frequency of a VCN migrating from Switchover state to
standby node
The probability of a VCN changing from Unsteady to Failure state
The frequency of a VCN changing from Failure to Normal state
The , , , , , and of all virtual cluster nodes
are same and steady;
Compared to other probabilities, the probability of a
virtual cluster node transiting from Normal to
Failure state can be neglected;
Virtual cluster node can still provide a continuous
service during the rejuvenation process.
C. State Transition Diagram and Analysis Model
For two physical machines that each hosts two virtual
cluster nodes, one physical machine is the standby machine
of the other one. When any active virtual cluster node
hosted by the active physical machine fails, its run-time
context and application workload will be migrated into the
standby virtual cluster node hosted by the standby physical
machine. When all virtual cluster nodes hosted by a physical
machine fail, then the physical machine fails.
The state transition diagram of the 2VNs/2PMs paradigm
is shown as figure 3. Herein, N denotes the Normal state, U
denotes the Unsteady state, R denotes the Rejuvenation
state, S denotes the Switchover state, and F denotes the
Failure state. In addition, the suffix 1 and 2 means the
number of virtual cluster nodes hosted by the same physical
machine, the suffix A means the virtual cluster node is an
active one, the suffix S means the virtual cluster node is a
standby one.

Figure 3. The state transition diagram of the 2VNs/2PMs paradigm
According to the hypothesis in previous section, the
balance equations of the state transition diagram are as
follows:
1 ) (
1 1 A A
U N
+ + =
2
1 1 A A
R U
=
3
1 1 1 S A S S
F S R N
+ + =
4 ) (
1 1 S S
U N
+ + =
5
1 1 S S
R U
=
6
1 1 S S
S U
=
7
2 2 1 2 S A A A
S R U N
+ + =
8 ) (
2 2 A A
U N
+ + =
9
2 2 A A
R U
=
10
2 2 A A
S U
=
11
2 2 1 2 A S S S
S R U N
+ + =
12 ) (
2 2 S S
U N
+ + =
13
2 2 S S
R U
=
14
2 2 S S
S U
=
509
15
2 A A
F U
=
16
2 S S
F U
=
By resolving the sum of all state probabilities, the
conservation equation of the state transition diagram are as
follows:
17 1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
= + + + +
+ + + + +


= = =
= = = = =
S A iS iS iS
iS iA iA iA iA
F F
i
S
i
R
i
U
i
N
i
S
i
R
i
U
i
N



And we can obtain the following expression of state
probability by combining the above balance equations and
the conservation equation. We can have the following
equations by resolving the equation (47)~(58).
18
) ( 2
2 2
1 1 S A
U U


+
+ +
=

19
) 2 ( 2
2 4
1 2 S A
U U


+
+ +
=

20
) 2 )( ( 2
) 2 4 ( ) ( 2
1 2 S S
U U


+ +
+ + + +
=

Furthermore, we can obtain the closure formulation of
virtual cluster node availability model about the 2VNs/2PMs
paradigm as the following:
21 )
2
1 (
2 4 3
1
1

+ +
+ + + +
+
+ +
=

S
U

As the virtual cluster node in the 2VNs/2PMs paradigm
is unavailable in the switchover and failure state, the steady
availability of virtual cluster node in the 2VNs/2PMs
paradigm is:
) ( 1 ) ( lim
2 2 1 1 S A S A S A
F F S S S S
t
t A A + + + + + = =


So the downtime in a given time interval L is:
L L DT
S A S A S A
F F S S S S
+ + + + + = ) ( ) (
2 2 1 1


And the cost of downtime is:
L C C C
C C C L C
S S A A S S
A A S S A A
F F F F S S
S S S S S S
+ + +
+ + =
)
( ) (
2 2
2 2 1 1 1 1



IV. ANALYSIS AND VALIDATION OF THE PROPOSED
MODEL
As the precise data of seven model parameters is hard to
measure, we choose six average model parameter value
tuples of 10,00 real data tuples as Table 2 shown to represent
six kinds of different availability level, and analyze the
availability of the 2VNs/2PMs paradigm using the proposed
availability model. All real data comes from one of our web
servers. In the experiments, every model parameter is set to a
default value: =2 times/month, =75%, =6 times/hour,
=24%, 1/=6 second, =2 times/year and =2 times/month.
When every model parameter varies according to six values
in Table 2, the other model parameters are set to their own
default values.
TABLE II. SIX TUPLES OF AVERAGE MODEL PARAMETER VALUES IN
THE EXPERIMENTS
Model
parameter
The values of model parameters
The 1st
tuple
The 2nd
tuple
The 3rd
tuple
The 4th
tuple
The 5th
tuple
The 6th
tuple
(times/month) 1 2 3 4 5 6
60% 65% 70% 75% 80% 85%
(times/hour) 6 10 12 15 20 30
30% 25% 20% 15% 10% 5%
1/(seconds) 4 6 10 30 60 300
(times/year) 1 2 3 4 5 6
(times/month) 1 2 4 8 30 60
The transition relation between seven model parameters
and the availability of the 2VNs/2PMs paradigm is
illuminated in figure 4 and 5. We can find that the
availability of the 2VNs/2PMs paradigm increases with the
value of increasing in figure 4(a), and furthermore the
availability of the 1VN/1PM paradigm(as a benchmark)
increases more because the increasing of means that the
probability of a virtual cluster node turning into Rejuvenation
state increases and the probability of a virtual cluster node
turning into Failure state degrades in the same interval. As
the 1VN/1PM paradigm does not have a standby virtual
machine to switch when failing, its availability is influenced
much by the failure rate(
F
), but not by the model parameters
of and 1/.
0. 60 0. 65 0. 70 0. 75 0. 80 0. 85
99. 940
99. 945
99. 950
99. 955
99. 960
99. 965
99. 970
99. 975
99. 980
99. 985
99. 990
99. 995
100. 000
1VN/ 1PM 2VNs/ 2PMs
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)

( a)
4 6 10 30 60 300
99. 995
99. 996
99. 997
99. 998
99. 999
100. 000
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)
1/ ( Second)
( d)
0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8
99. 940
99. 945
99. 950
99. 955
99. 960
99. 965
99. 970
99. 975
99. 980
99. 985
99. 990
99. 995
100. 000
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)
( %)
( b)
0. 05 0. 10 0. 15 0. 20 0. 25 0. 30
99. 995
99. 996
99. 997
99. 998
99. 999
100. 000
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)

( c)

Figure 4. The relation between the model parameters and the availability
of virtual cluster node
The availability of the 2VNs/2PMs paradigm degrades
with the value of increasing according to figure 4(b), in a
less extent than the 1VN/1PM paradigm for the sake of
higher rejuvenation rate. The availability of the 2VNs/2PMs
paradigm degrades with the probability of virtual cluster
nodes turning into the Switchover state increasing as figure
4(c). This mainly because the switchover between an active
virtual cluster node and its standby virtual cluster node will
make its unavailable and degrade its availability. From figure
4(d), we can find that the availability of virtual cluster nodes
510
degrades with the increasing of the switchover time (i.e. the
downtime in live migrating process).
1 2 3 4 5 6
99. 9687
99. 9726
99. 9765
99. 9804
99. 9843
99. 9882
99. 9921
99. 9960
99. 9999
1VN/ 1PM 2VNs/ 2PMs
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)
( t i mes/ year )
e
1 2 4 8 30 60
99. 9552
99. 9584
99. 9616
99. 9648
99. 9680
99. 9712
99. 9744
99. 9776
99. 9808
99. 9840
99. 9872
99. 9904
99. 9936
99. 9968
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)
( t i mes/ mont h)
f
1 2 3 4 5 6
99. 88
99. 90
99. 92
99. 94
99. 96
99. 98
100. 00
A
v
a
i
l
a
b
i
l
i
t
y
(
%
)
( t i mes/ mont h)
g

Figure 5. The relation between the model parameters and the availability
of virtual cluster node
According to figure 5(e), the availability of the
2VNs/2PMs paradigm degrade with the increasing of , and
the availability degradation degree of the 1VN/1PM
paradigm is bigger. The availability of the 2VNs/2PMs
paradigm degrades with the increasing of , and is influenced
to a lower extent as figure 5(f). From figure 5(g), we can find
the availability of the 2VNs/2PMs paradigm increase with
the increasing of , and the availability of the 1VN/1PM
paradigm increase with a bigger degree.
V. CONCLUSION AND FUTURE WORK
The availability evaluation of virtual cluster system is an
important issue in its promotion and application, and the
availability evaluation of virtual cluster node is the base of
the availability evaluation of virtual cluster system. This
paper summarized one typical architecture paradigm of
virtual cluster nodes by analyzing the overall architecture of
virtual cluster systems and the deployment style of virtual
cluster nodes, gave the state transition diagram of this typical
paradigm by studying the complete lifecycle of virtual
cluster nodes and the transition conditions of different node
states, and proposed an availability model based on the
Markov chain theory to contribute the availability analysis of
virtual cluster nodes and virtual cluster systems. Finally, the
numerical simulation experimental results proved the
practicability of this proposed availability model. In the
future, we will study the availability relation between virtual
cluster node and virtual cluster system.
ACKNOWLEDGMENT
This work is supported by the State Key Development
Program for Basic Research of China ("973 project",
No.2007CB310900) and the 2010 Annual Funding Project of
Baoding Association of society and Science (No.20100309).
The authors want to thank Prof. Qinming He coming from
Zhejiang University for his helpful advice.
REFERENCES
[1] Alan Wood. Availability modeling: understanding Markov models to
calculate system reliability. Circuits & Devices, pp.22-27, 1994.
[2] M. Aldinucci, M. Danelutto, and M. Torquati, and F. Polzella, and G.
Spinatelli, and M. Vanneschi, and A. Gervaso, and M. Cacitti, and P.
Zuccato. VirtuaLinux: virtualized high-density clusters with no single
point of failure. Proc. of the Int. Conference ParCo2007. Vol. 38,
pp.355-362, 2007.
[3] B. Cully, and G. Lefebvre, and D. Meyer, and M. Feeley, and N.
Hutchinson, and A. Warfield. Remus: High Availability via
Asynchronous Virtual Machine Replication. Proceedings of the 5th
USENIX Symposium on Networked Systems Design and
Implementation. San Francisco, California, pages 161-174, 2008.
[4] E. Farr, and R. Harper, and L. Spainhower, and J. Xenidis. A Case for
High Availability in a Virtualized Environment (HAVEN).
Proceedings of the 2008 Third International Conference on
Availability, Reliability and Security, 675-682, 2008.
[5] W. Fischer, and C. Mitasch. High availability clustering of virtual
machines-possibilities and pitfalls. Paper for the talk at the 12th
Linuxtag, Wiesbaden/Germany, May 3rd-6th, 2006.
[6] I. Foster, and T. Freeman, and K. Keahey, and D. Scheftner, and B.
Sotomayor, and X. Zhang. Virtual clusters for grid communities.
Proceedings of the Sixth IEEE International Symposium on Cluster
Computing and the Grid. pages 513-520, 2006.
[7] A.M. Jr. Johnson, and M. Malek. Survey of software tools for
evaluating reliability, availability, and serviceability. ACM
Computing Surveys (CSUR), 20(4): 227-269, 1988.
[8] M. Le, and A. Gallagher, and Y. Tamir. Challenges and Opportunities
with Fault Injection in Virtualized Systems. First International
Workshop on Virtualization Performance: Analysis, Characterization,
and Tools, Austin, Texas, April 2008.
[9] M. Rosenblum and T. Garfinkel. Virtual Machine Monitors: Current
Technology and Future Trends. IEEE Computer, 38(5):39-47, 2005.
[10] M.A. Vouk. Cloud computing-Issues, research and implementations.
Journal of Computing and Information Technology. 16(4): 235-246,
2008.
[11] H. Nishimura, and N. Maruyama, and S. Matsuoka. Virtual clusters
on the fly-fast, scalable, and flexible installation. Proceedings of the
Seventh IEEE International Symposium on Cluster Computing and
the Grid, Rio de Janeiro, Brazil. Pages 549-556, 2007.
[12] X. Qin, and T. Xie. An availability-aware task scheduling strategy for
heterogeneous systems. IEEE Transactions on Computers, 57(2):
188-199, 2008.
[13] M. Steinder, and I. Whalley, and D. Chess. Server virtualization in
autonomic management of heterogeneous workloads. ACM SIGOPS
Operating Systems Review. VOL.42NO.1:94-95. 2008.
[14] T. Thein, and M. Pokharel, and S.D. Chi, and J.S. Park. A Recovery
Model for Survivable Distributed Systems through the Use of
Virtualization. The Fourth International Conference on Networked
Computing and Advanced Information Management (NCM08).
Gyeongju, Korea. September 2-4, 2008.
[15] T. Thein, and J.S. Park. Availability analysis of application servers
using software rejuvenation and virtualization. Journal of Computer
Science and Technology. 24(2): 339-346 Mar. 2009.
[16] T. Thein, and J.S. Park, and S.D. Chi. Availability Modeling and
Analysis on Virtualized Clustering with Rejuvenation. International
Journal of Computer Science and Network Security. VOL.8 No.9,
September 2008.
[17] T. Thein, and S.D. Chi, and J.S. Park. Improving Fault Tolerance by
Virtualization and Software Rejuvenation. Proceedings of the 2008
Second Asia International Conference on Modeling & Simulation
(AMS), pages 855-860, 2008.
511

You might also like