Estimating Reliability of Grid Systems Using Bayesian Networks

Reliability Engineering and System Safety 104 (2012) 96105
Contents lists available at SciVerse ScienceDirect
Reliability Engineering and System Safety

journal homepage: www.elsevier.com/locate/ress
An automated method for estimating reliability of grid systems

using Bayesian networks
Ozge Doguc n, Jose Emmanuel Ramirez-Marquez
School of Systems & Enterprises, Stevens Institute of Technology, Hoboken, New Jersey, United States
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 16 February 2010
Received in revised form
11 March 2012
Accepted 14 March 2012
Available online 18 April 2012
Grid computing has become relevant due to its applications to large-scale resource sharing, wide-area
information transfer, and multi-institutional collaborating. In general, in grid computing a service
requests the use of a set of resources, available in a grid, to complete certain tasks. Although analysis
tools and techniques for these types of systems have been studied, grid reliability analysis is generally
computation-intensive to obtain due to the complexity of the system. Moreover, conventional
reliability models have some common assumptions that cannot be applied to the grid systems.
Therefore, new analytical methods are needed for effective and accurate assessment of grid reliability.
This study presents a new method for estimating grid service reliability, which does not require prior
knowledge about the grid system structure unlike the previous studies. Moreover, the proposed
method does not rely on any assumptions about the link and node failure rates. This approach is based
on a data-mining algorithm, the K2, to discover the grid system structure from raw historical system
data, that allows to nd minimum resource spanning trees (MRST) within the grid then, uses Bayesian
networks (BN) to model the MRST and estimate grid service reliability.
& 2012 Elsevier Ltd. All rights reserved.
Keywords:
Grid systems
Bayesian networks
Reliability
Grid service
Minimum resource spanning tree
1. Introduction
Grid computing has become relevant due to its applications to
large-scale resource sharing, wide-area information transfer, and
multi-institutional collaborating. In general, in grid computing
services request a set of resources, available in a grid, to complete
certain tasks. Many experts believe that the grid technologies will
offer a chance to extend the benets of the Internet [1]. However,
it is difcult to analyze the grid reliability due to its highly
heterogeneous and distributed characteristics. Because the grid
systems involve cross-organizational sharing, they support existing distributed computing technologies. As an example, enterprise-level distributed computing systems can use the grid
technologies to achieve resource sharing across its different
institutions. Although, several development tools and techniques
for the grid systems have been studied, estimating grid reliability
is not straightforward due to the size and complexity of the
grid [2]. Therefore, new analytical methods are needed to evaluate
the grid reliability.
Abbreviations: BN, Bayesian Network; RST, Resource spanning tree; MRST,

Minimum resource spanning tree; CPT, Conditional probability table; K2, named
after Kutat 2; QoS, Quality of service; RM, Resource manager; RN, Root node
n
Corresponding author. Tel.: 1 201 920 4332; fax: 1 201 920 4641.
E-mail addresses: ozgedoguc@hotmail.com, odoguc@stevens.edu (O. Doguc).
0951-8320/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.ress.2012.03.016
Over the past several years, research and development efforts

have focused on the challenges that arise when large grid
organizations [14] are built. As a recent topic, there are a few
studies on estimating grid system reliability in the literature
[58]. In these studies, the grid system reliability is estimated
by focusing on the reliabilities of services provided in the grid
system. For this purpose, the grid system components that are
involved in a grid service are classied into spanning trees, and
each tree is studied separately. However, these studies mainly
focus on understanding grid system structures rather than estimating the actual system reliability. Thus for simplication
purposes, they make certain assumptions on component failure
rates, such as satisfying a probabilistic distribution [7].
For reliability estimation, Bayesian networks (BN) have been
proposed as an efcient method [912]. BN provide signicant
advantages over traditional frameworks for the systems engineers,
mainly because they are easy to interpret and they can be used in
interaction with domain experts in the reliability eld [13]. Using
the BN structure and the probabilistic values, the system reliability
can be estimated with the help of Bayes rule [12]. There are
several recent studies for reliability estimation using BN
[9,11,1416], which require specialized networks that are
designed for a specic system. That is, the BN to be used for
analyzing system reliability should be known beforehand (i.e. the
BN can be built by an expert who has adequate knowledge about
the system under consideration). However, human intervention is
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
Nomenclature
Gi
Si
Ri
u
T
Pi
Component i in the grid system

Service i in the grid system
Resource i in the grid system
Maximum number of parents in the BN
Historical dataset
Set of parents of node i in the BN
always open to unintentional mistakes that could cause discrepancies in the results [17].
To address these issues, this paper introduces a methodology
for estimating grid system reliability by combining techniques
such as BN construction from raw component and system data,
association rule mining and evaluation of conditional probabilities. Based on the extensive literature review, this is the rst
study that incorporates these methods for estimating grid system
reliability. With the increasing popularity of computer environments in systems engineering, grid systems have been widely
used in various system-related applications. Understanding the
grid system structure and the component relationships is essential for systems engineers for optimal resource allocation and
improving the system reliability. This study provides a methodology for automated discovery of component relationships and
estimation of reliability of grid services to help the systems
engineers.
The methodology suggested in this paper automates the
process of spanning tree discovery and BN construction by using
the K2 algorithm (a commonly used association rule mining
algorithm) that identies the associations among the grid system
components by using a predened scoring function and a heuristic. According to the proposed method, once the BN is efciently
and accurately constructed, reliabilities of grid services are
estimated with the help of Bayes rule. Unlike previous studies,
the methodology proposed in this paper does not rely on any
assumptions about the component failure rates in grid systems.
Moreover, the proposed method does not require prior knowledge
about the grid system structure.
2. Background information
This section provides background information about the grid
systems, BN and the K2 algorithm. Earlier studies on estimating
grid system reliability are also discussed in this section.
2.1. Grid systems
To represent distributed computing infrastructures for
advanced science and engineering, the term grid was rst used
in the 90s [3]. The grid concept was rst developed to enable
resource sharing within geographically diverse scientic organizations. The main problem that lies under the concept of grid
systems is coordinated resource sharing and problem solving in
dynamic and multi-institutional organizations [1]. Different than
typical distributed systems, the computational grid systems
require large-scale sharing of resources on different types of
components. A service request in a grid system involves a set of
nodes and links, through which the service can be provided. In a
grid system, the Resource Managers (RM) control and share
resources, while the Root Nodes (RN) request service from RM
(an RN may also share resources). Also, Dai and Wang [7] showed
that the links and nodes in each grid service form a spanning tree.
97
ti
f
m
Time that observation i was done on the grid system

Scoring function for the K2 algorithm
Number of nodes in the parent set
pi
Set of parents for node i in the BN
rel
Doguc and Ramirez-Marquezs method for estimating
reliability
combine Dai and Wangs method for combining reliabilities
They dene the resource-spanning tree (RST) as a tree that starts

from the requestor RN (as its root) and covers all resources that
are required for the requested grid service.
An example grid system is displayed in Fig. 1. The RM are
shown as single and RN are shown as double circles (G1) in the
gure. As an example grid service S1 in the grid system in Fig. 1,
assume that G1 requests the resources R2, R4 and R5. For the sake
of simplicity and without sacricing from correctness, it can be
assumed that the grid component Gi shares resource Ri only. So, in
this example the components G2, G4 and G5 share resources R2, R4
and R5 respectively.
Reliability of a grid system can be estimated by using reliabilities of services provided through the system [7]. In order to
evaluate the reliability for a grid service, the links and nodes that
are involved in that service should be identied. Dai and Wang
previously showed that the reliability of a grid service can be
estimated by using the reliabilities of minimum RSTs (MRST) [7].
In Fig. 1, although there are several RST for S1 that include all
requested resources ({G1, G2, G3, G6, G8, G5, G4}, {G1, G2, G3, G6, G8,
G9, G4, G4}, {G1, G3, G6, G8, G7, G5, G4}, etc), only one of them is an
MRST; {G1, G2, G3, G5, G4}. Other possible spanning trees in the
grid are either larger than the MRST or do not include all
requested resources. Moreover, since there is no other component
in this example that shares the resource R5, it can be concluded
that there is only one MRST for the grid service S1.
There are several studies in the literature that focus on the
reliability of grid systems, however many of them rely on certain
assumptions [58,19] that will be discussed in Section 3. Dai and
Wang [7] present a methodology to optimally allocate the
resources in a grid system in order to maximize the grid service
reliability. They use a genetic algorithm to nd the optimum
solution efciently among numerous possibilities. Later Levitin
and Dai [19] propose dividing grid services into smaller-size tasks
and subtasks, then assigning the same tasks to different RM for
parallel processing. This paper focuses on the reliabilities of
MRSTs in the grid system, where the reliability of an MRST is
the probability for the MRST to provide the given service. The K2
algorithm is used to discover the MRSTs and BN to evaluate grid
service reliabilities. The next section provides information about
the BN and the K2 algorithm.
Fig. 1. A sample grid system.
98
2.2. Bayesian networks

Estimation of systems reliability using BN dates back as early as
1988, when it was rst dened by Barlow [21]. The idea of using
BN in systems reliability has mainly gained acceptance because of
the simplicity it allows to represent systems, and its efciency for
obtaining component associations. The concept of BN has been
discussed in several earlier studies [2224]. More recently, BN
have found applications in, software reliability [15,25], fault
nding systems [23], and general reliability modeling [26].
One could summarize the BN as an approach that represents
the interactions among the components in a system from a
probabilistic perspective. This representation is performed via a
directed acyclic graph, where the nodes represent the variables
and the links between each pair of nodes represent the causal
relationships between the variables. From a system reliability
perspective, the variables of a BN are dened as the components
in the system while the links represent the interaction of the
components leading to system success or failure. In a BN this
interaction is represented as a directed link between two components, forming a child and parent relationship, so that the
dependent component is called as the child of the other. Therefore, the success probability of a child node is conditional on the
success probabilities associated with each of its parents. The
conditional probabilities of the child nodes are calculated by using
the Bayes theorem via the probability values assigned to the
parent nodes. Also, absence of a link between any two nodes of a
BN indicates that these components do not interact for system
failure/success thus, they are considered independent of each
other and their probabilities are calculated separately. During
the process of system reliability estimation calculations for the
independent nodes are skipped, reducing the total amount of
computational work.
requires an exponential-time search algorithm to nd the optimal

parent set. With the heuristic, the K2 algorithm does not need to
consider the whole search space; it starts with the assumption
that the component has no parents and adds incrementally that
parent whose addition most increases the scoring function. When
addition of no single parent can increase the score, the K2
algorithm stops adding parents to the component. Using the
heuristic reduces the size of the search space from exponential
to quadratic [27]. The pseudo-code of the K2 algorithm is given
below (Fig. 2).
This paper uses the K2 algorithm to discover the MRSTs in the
grid system. Also, the K2 algorithm is used to construct the BN
models to represent MRST. Details of the proposed method will be
discussed in Section 3.
2.3. Estimating system reliability using BN

BN are known to be useful in assessing the probabilistic
relationships and identifying probabilistic mappings between
system components [5]. The components are assigned with
individual conditional probability tables (CPT) within the BN.
The CPT of a given component Gi contains p(Gi9Pi) where Pi is the
set of Gis parents. In Gis CPT, all of its parents are instantiated as
either Reachable or Unreachable; so for m parents there are
2m different parent set instantiations; thus 2m entries in CPT. The
BN is complete when all the conditional probabilities are calculated and represented in the model.
To illustrate these concepts, the BN shown in Fig. 3 presents an
experts perspective on how the ve components of a system
interact. Each component in the grid system is represented with a
2.2.1. Finding component associations using historical data

The K2 algorithm, for nding component associations, was rst
dened by Cooper and Herskovits [27] as a heuristic search
method to discover associations in a graph. This algorithm
searches for the parent set (i.e. the preceding components in the
graph) of a component that has the maximum association with it.
The K2 algorithm is composed of two main factors: a scoring
function to quantify the associations and rank the parent sets
according to their scores, and a heuristic to reduce the search
space to nd the parent set with highest degree of association.
Without the heuristic, the K2 algorithm would need to examine
all possible parent sets, i.e. starting from the empty set, it should
consider all possible parent sets. Even with a restriction on the
maximum number of parents (u), the search space would be as
large as 2u (total number of subsets of a set of size u); which
Fig. 2. Pseudo-code for the K2 algorithm [27].
Fig. 3. An example BN.
node in the BN and the overall MRST behavior (will be discussed in

Section 3) is shown as a node in the bottom. For this BN the child
parent relationships of the components can be observed, where on
the quantitative side [16] the degrees of these relationships
(associations) are expressed as probabilities and each node is
associated with a CPT. In Fig. 3, the topmost nodes (G1, G2 and G4)
do not have any incoming edges, therefore they are conditionally
independent of the rest of the components in the system. The prior
probabilities that are assigned to these nodes should be known
beforehand with the help of a domain expert or using historical
data about the system. Based on these prior probabilities, the
success probability of a dependent node, such as G3, can be
calculated using Bayes theorem as shown in Eq. (1)
pG3 9G1 ,G2
pG1 ,G2 9G3 pG3

pG1 ,G2
99
Table 1
Example historical dataset.
Observation
G1
G2
G3
G4
G5
G6
G7
G8
G9
1
2
3
4
5
6
7
8
9
10
0
1
1
1
1
0
0
1
0
1
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
0
0
1
0
0
1
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
0
1
0
1
1
0
0
0
0
1
1
1
1
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
0
1
1
0
0
As shown in Eq. (1), the probability of the node G3 is only

dependent on its parents, G1 and G2 in the BN shown in Fig. 3. The
total number of computations done for calculating this probability
is reduced from 2n (where n is the number of nodes in the network)
to 2m, where m is number of parents for a node (and m5n). Similar
to the prior probabilities, CPT can be computed by using historical
data of the system. Eq. (1) can be applied to node G5 similarly, using
G4 as input. Overall reliability (i.e. probability of availability for
the bottom most node) can also be calculated by using the same
equation.
3. Estimating grid service reliability using BN

As discussed in Section 2, there are several studies [9,10,11,13,
14,2931] in the literature that dene reliability estimation
methods for traditional small-scale systems. However these
studies mostly rely on certain assumptions on the system topology and operational probabilities of the components (links and
nodes). In the case of dynamic grid systems, these assumptions
may be invalid since links can be destroyed or established on the
y. Moreover, due to dynamic creation and modications of nodes
and links, logical connections between nodes may exist [7], thus
the operational probabilities of nodes and links cannot be
assigned with constant values. There are a number of recent
studies on estimating grid service reliability using MRST [57];
however these studies rely on the assumptions that node and link
failures satisfy a probabilistic distribution. As discussed in Section
2.1, Dai and Wang [7,20] previously discussed an efcient way of
discovering the MRST for a given grid service, when the grid
system structure is known. However for large and complex grid
systems, it is usually impossible to understand the overall
structure of the grid system. Therefore, these studies do not
provide realistic applications for real life grid systems.
As an alternative to traditional approaches, the new method for
automated MRST discovery uses the historical raw grid system data
as input, and discovers the MRSTs to evaluate the service reliabilities. The proposed method consists of two steps, MRST discovery
and reliability estimation. Both steps of the proposed method are
automated and require very limited human intervention.
In Table 1, an example historical dataset for the grid system in
Fig. 1 is shown, where all nine components of the example grid
system in Fig. 1 are represented in different columns. Each row in
Table 1 shows the state of the grid components at an instance of
time ti; when the observation was done. For the sake of simplicity
and without loss of generality in the proposed method, the
historical raw system data can be assumed to follow a binary
behavior. That is, for each component in the grid system the value
of 0 represents unavailability, while the value of 1 represents
availability at the time of the observation.
This study rst shows that a historical dataset similar to the

one in Table 1, can be used to discover the associations between
the components of a grid system. The proposed method uses the
K2 algorithm to discover the MRSTs for grid services. The method
starts with the RN that requests the service, and discovers the
links to the RMs that share the required resources. The method
keeps discovering the links until all required resources are
accessible by the RN through the links. Different than the original
K2 algorithm, the proposed method stops when it nds all MRSTs
for the requested grid service. On the other hand, in the case of
many providers, multiple MRSTs may exist. For these cases, the
method keeps running until it visits all RMs that provide the
requested resources. When the provider RMs form a small subset
of all RMs in the grid system, the proposed method ignores the
rest of the system and still nds all MRSTs efciently.
This method provides an efcient solution when the required
resources are shared by a small set of components in the grid system.
The pseudo-code for the rst step of the proposed method is shown
in Fig. 4. An illustration of this step is also given in Section 3.3.
As explained in Section 2, this study focuses on reliabilities of
MRST in a grid system in order to evaluate the overall grid service
reliability. In this manner, each MRST can be thought as a smaller
size grid system [7] and therefore can be modeled and evaluated
separately. For example, {G1, G2, G3, G4, G5} forms an MRST in the
example grid system given in Fig. 1; thus it can be considered as a
system with ve components and modeled with a BN. For this
purpose, each component in the MRST is represented with a
separate node in this BN model.
Fig. 5 shows the pseudo-code for the second step of the proposed
method. In the second step, when a historical system dataset
(as shown in Table 1) is available, Doguc and Ramirez-Marquezs
[12] study is used to construct BN for each MRST to estimate its
reliability. Their method takes a historical dataset for a system
(or MRST), and calculates its reliability. We show Doguc and
Ramirez-Marquezs method as function rel in the gure. In the
case of multiple MRSTs, separate BNs will be constructed by using
historical data to calculate the reliabilities of each of the MRST. Dai
and Wang [7] showed that these reliability values can be combined
by using Bayes rule to calculate the overall grid service reliability. In
Fig. 5, Dai and Wangs [7] method for combining multiple reliabilities is represented with combine. The combine function takes an
MRST and aggregates its reliability with others.
In the next section, an example case scenario is provided to
demonstrate how the proposed method automatically discovers
the MRST and constructs the BN model to estimate the grid
service reliability.
3.1. Illustration of the proposed method
This section provides an illustration of the proposed method.
As an example, the grid system shown in Fig. 1 and the grid service
100
Fig. 4. Pseudo-code for the proposed method for MRST discovery.
Fig. 5. Pseudo-code for the second step of the proposed method.
Fig. 6. MRST discovery (step 1).
Fig. 8. MRST discovery (last step).
Fig. 7. MRST discovery (step 2).
S1 that was discussed in Section 2 are used, so that the RN,

G1 requests resources R2, R4 and R5 in the sample grid system.
The method starts with the initial ordering of the components
(i.e. from G1 to G9) and benets from the K2 algorithm in nding
the associations between the components in a grid system. Using
the historical dataset shown in Table 1, the proposed method

starts from the second component G2 and looks for an association
between G2 and G1. This association is decided by the K2 algorithm
using the historical dataset; so that if both G1 and G2 are available
(i.e. reachable to each other) during most of the observations, then
there is a strong association between them. Using the dataset
shown in Table 1, the K2 algorithm decides a strong association
between two components; therefore the proposed method decides
that there should be a link between them. Moreover, the G2
component contains the resource R2; however the resources R4
and R5 are still needed for the requested service S1; therefore
the method keeps on iterating. Since there are no more components preceding G2, the method considers the component G3 next
(Figs. 68).
In this step, the method looks for the associations between G3

and the preceding components, G1 and G2. Again using the
historical dataset, the method decides that both G1 and G2 are
highly associated with G3, therefore there should be links
between them.
From the grid structure learned so far, the RN G1 can access the
resources R1, R2 and R3; however the requested resources R4 and R5
are still not reachable by G1. Therefore, the method considers the
component G4 next. According to the historical dataset, G4 component is not associated with neither of the preceding components;
therefore the method moves to the next component, G5. At this
step the method nds associations between G5 and G3, and G5 and
G4, and creates links between them. Also, the method updates the
list of reachable resources so far ({R1, R2, R3, R4 and R5}), which now
contains all the resources requested by G1. At this point, the
method stops and does not consider the rest of the components
in the grid system. Since all requested resources are reachable, the
discovered part of the grid system already contains the MRST,
which is required for estimating grid service reliability. By skipping
the rest of the components in the grid system the proposed method
provides an efcient solution, since the grid system can include a
large number of components. In this example, the proposed
method needs to consider only a small subset of the whole grid
system to discover the MRST.
After the MRST is discovered, the proposed method uses the K2
algorithm to compute the probabilistic assignments between the
component associations, construct the BN model and estimate the
grid service reliability. For this purpose, the method uses the
historical dataset for the discovered MRST as shown in Table 2.
In addition the historical data for the components G1,y, G5
Table 2 contains an extra column (MRST Behavior) that shows if
the grid service can be provided through the MRST at the time of
observation. For example, the grid service cannot be provided in
the rst 3 observations, and can be provided during observations
4 and 5.
As a rst step, the K2 algorithm starts with the rst component
in the dataset, G1. Since G1 does not have any preceding components, i.e. possible candidate parents in the BN, the K2 algorithm
skips it and picks the second component in the dataset, which is
G2. For G2, there may be two alternative parent sets: the empty set
f, or G1. Therefore, the K2 algorithm computes the scoring
function f using each of these alternative parent sets and compares the results. Then, the set of candidate parents with highest
f score will be chosen as the parent set for G2. At the end of this
iteration the values 1/1320 and 1/4200 are computed and
compared; and the former, representing the score of the empty
set {f}, picked as the parent. So the K2 algorithm decides that G2
has no parents in the BN.
In the following iterations of the K2 algorithm, the number of
possible candidate parent sets to be considered and thus, the
amount of computations for the f score calculation increases.
Table 2
Example historical dataset.
Observation
G1
G2
G3
G4
G5
MRST Behavior
1
2
3
4
5
6
7
8
9
10
0
1
1
1
1
0
0
1
0
1
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
0
0
1
0
0
1
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
0
1
0
1
0
0
0
1
1
0
0
1
0
0
101
Table 3
f scores for all possible candidate parent sets for G3.
Parent set
f score
1
1320
1
2800
1
1800
1
640
{G1}
{G2}
{G1, G2}
Skipping the details, f scores of the candidate parent sets for the
G3 component are given in Table 3. Because the K2 algorithm
iterates the components according to their ordering in dataset,
components G4 and G5 are not taken into account as candidate
parents for G3. In this step, the K2 algorithm selects the set {G1, G2}
as parent set of G3, because it has the highest f score. The number
of computations grows with the order of the component, and when
the K2 algorithm nishes processing the last column (MRST Behavior
in Table 2), it outputs the BN structure displayed in Fig. 3.
The next step of the proposed method is estimating the grid
service reliability using the BN that was constructed by the K2
algorithm. Besides the associations that were discovered in the
previous step, the inference rules described in Section 2.2 should
be used to calculate the conditional probabilities. The conditional
probabilities are calculated and stored in CPT and each component with a non-empty parent set in the BN is associated with a
CPT. The ones with no parents are independent of others and
associated with prior probabilities as explained in Section 2.2.
The probability values in the CPT are calculated by using the
raw data in Table 2 and can be expressed as the probability of an
instantiation of the parent set. For example the probability,
G3 being 0 given the parent instantiations as G1 0 and G2 0 is
0.5, since two out of ten observations parents are instantiated as 0
and 0; and for one of these cases G3 is instantiated as 0. In the next
step, with the help of CPT and the prior probabilities that G1 and
G2 have, the success probability value for G3 can be calculated.
According to the BN structure (in Fig. 3) components G1 and G2 are
independent of others; therefore their success probabilities can be
directly inferred from the observations dataset in Table 2. From
Table 2 it can be evaluated that p(G1 1)0.6 and p(G2 1)0.5.
While evaluating the other components in the BN, the success
probabilities for the rest of the components in the sample MRST
can be evaluated; such that p(G4 1) 0.6 and p(G5 1)0.75.
In the last step, the MRST reliability can be calculated by using
these probability values and the CPT of the MRST Behavior node in
the BN structure given in Fig. 3. The success probability of the
MRST Behavior node can be calculated as 0.35 or 35%; which is the
reliability of the MRST used in this section. The reader must recall
that this reliability value is calculated based on only 10 observations on the sample system. With more observations available,
the K2 algorithm could provide more accurate results on the
degrees of associations between the system components and
calculate more precise values in the CPT of the nodes; which will
increase the accuracy of the calculated service reliability.
4. Experimental analysis
This section provides experimental analysis of the proposed
method for grid service reliability estimation. For experimental
analysis, the proposed method is implemented in Matlab 8, using
a computer equipped with Intel Core 2 Duo 2.1 Ghz CPU and 2 GB
RAM. This computer runs on 32-bit Windows Vista Business
operating system.
102
First, the experiments on the performance of the proposed

method are provided. For this purpose, several grid services are
created in the example grid system shown in Fig. 9. In this grid
system, the nodes with double circles represent the RN and while
others represent the RM. For each grid service requested by RNs,
the time spent for discovering the MRST are measured, building
the BN model and estimating the grid service reliability. In order
to obtain statistically signicant results, the experiments are
re-run 100 times and minimums, maximums, averages and
standard deviations of the running times are calculated.
Tables 4 and 5 provide the list of resources that are shared by

each component in the grid system. Also, Table 6 shows the
individual reliabilities of the components in the example grid
system. The next section provides experimental results on the
performance of the method proposed in this paper. Later, in order
to show the accuracy of the proposed method, the experimental
results are compared with the ones provided by Dai and Wang [7].
For this purpose, the same grid network example presented in
their paper is used.
4.1. Performance analysis

As discussed in Section 3, the proposed method consists of two
major steps: MRST discovery and grid service reliability estimation. This section discusses the performance of each step separately. The performance of the proposed method for MRST
discovery is also compared with the performance of Dai and
Wangs method. For experimental analysis, the grid system
shown in Fig. 9 is used and the grid services shown in Table 5
are dened.
As discussed in Section 2, different numbers of MRSTs can be
found for each grid service. For example, for the grid service
S1 there are three MRSTs: {G1, G2, G3, G4}, {G1, G2, G5, G6} and
{G1, G2, G5, G7}. The performance of the proposed MRST discovery
method is estimated by calculating its running time for each
MRST in example the grid system. Also, as discussed in Section 2
all three MRST are used for estimating the reliability of the grid
service S1. The historical datasets for the experiments contained
100 observations in each experiment. Fig. 10 shows the experimental results for the proposed MRST discovery method.
Also, statistical details of the experimental results are given in
Table 7. (In both Fig. 10 and Table 7, the running times are given
in seconds.) As discussed above, the number of grid services, thus
the numbers of MRSTs in the example grid systems are different
and vary between 3 and 10. Section 3 discussed that the size of
the discovered MRST is an important factor in the running time of
the discovery step, thus the average MRST size for each grid
service are given in Table 7. Table 7 also provides average,
minimum and maximum running times for MRST discovery for
each grid service along with the average running times for Dai
and Wangs method. As discussed in the beginning of this section,
the numbers in Table 7 are calculated by performing the experiments 100 times.
As it can be observed from Fig. 10, the running time of the
proposed MRST discovery algorithm increases linearly with
the MRST size and number of MRST for the grid service. Also,
the standard deviations in Table 7 are very low showing that the
results are statistically signicant. Moreover, in all cases, the
method is more efcient than Dai and Yangs method, which uses
genetic algorithm to discover the MRST.
Next, the experimental analysis results for BN construction
using the K2 algorithm and reliability estimation with Bayes rule
are given. Fig. 11 shows the performance of the K2 algorithm to
construct the BN model for MRST. As expected, the running time
of the algorithm increases quadratically with respect to the nodes
in the BN (components in the MRST).
Table 8 provides the experimental results for the performances
of BN construction and reliability estimation steps. It can be
Fig. 9. Example grid system used for experimental analysis.
Table 4
List of resources shared by each component.
Component
List of shared resources
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
R1,
R2,
R1,
R4,
R1,
R4,
R2,
R4,
R6,
R3,
R1,
R2,
R5,
R3,
R3,
R3,
R5,
R3,
R6,
R4,
R5,
R7,
R6,
R6,
R4,
R6,
R4,
R5,
R7,
R6,
R4,
R7,
R7,
R7,
R8,
R8,
R9,
R7,
R7,
R7, R8, R15

R6, R8, R11, R11
R8, R9, R10
R10, R11, R12, R14
R5, R8, R10
R8, R11, R12, R13, R15
R8, R9
R8, R9, R10, R14
R9, R10, R11
R10, R11, R14, R16
R11, R12, R15
R10, R12, R34, R14
R13
Table 5
List of grid services and required resources.
Grid service
Requestor RN
Required resources
S1
S2
S3
S4
S5
S6
S7
S8
S9
G2
G2
G3
G3
G3
G6
G6
G9
G9
R2,
R2,
R1,
R4,
R2,
R3,
R1,
R1,
R1,
R5,
R3,
R3,
R7,
R4,
R4,
R2,
R2,
R2,
R7,
R4,
R6,
R8,
R6,
R6,
R5,
R3,
R4,
R8,
R7,
R8,
R9,
R7,
R7,
R6,
R4,
R5,
R11, R12, R14

R8, R9
R10, R11, R12
R10, R11, R12, R13, R14, R15
R10, R11, R12
R9, R11, R12, R13, R14
R7, R9, R10, R11, R12, R13 R14 R15
R5, R6, R7, R9, R10, R11
R6, R8, R9, R10, R11, R13, R14, R16
Table 6
Reliabilities of components in the grid system.
Component
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
Reliability
0.99
0.91
0.95
0.93
0.99
0.94
0.95
0.98
0.98
0.99
0.91
0.94
0.91
103
Fig. 10. Performance of the proposed MRST discovery method on the performance of the MRST discovery step.
Table 7
Statistical details of the experimental results for MRST discovery times.
Case # Number
of MRST
Avg. MRST
size
Avg.
1
2
3
4
5
6
7
8
9
4.33
5.33
5
5.4
5.16
5.14
5.375
5.55
6
0.9
0.82
1.03
2.46
2.32
2.71
5.98
5.76
6.33
10.35 10.25 10.54
21.34 21.12 21.77
44.65 44.25 45.06
89.98 89.28 90.7
172.23 171.57 173.02
359.03 358.29 360.32
3
3
4
5
6
7
8
9
10
Min.
Max.
Std. Dev.
Dai and
Wang
0.105987
1.12
0.197569
3.35
0.28746
6.54
0.147309 12.67
0.330606 24.02
0.40501
57.9
0.710023 112.83
0.725971 259.49
1.027343 582.65
Table 8
Experimental results for the performance of BN construction and reliability
estimation.
Case #
Number of
MRST
MRST
size
Avg. BN
construction
time
Avg. reliability
estimation
time
1
2
3
4
5
6
7
8
9
3
3
4
5
6
7
8
9
10
4.33
5.33
5
5.4
5.16
5.14
5.375
5.55
6
4.24
8.12
14.51
21.44
29.54
37.91
44.04
50.12
58.47
0.77
1.44
2.29
3.76
5.05
7.11
8.59
9.91
11.54
Table 9
Reliability estimation results for the case grid services.
Fig. 11. Performance of the K2 algorithm to construct the BN.
observed that the running time of BN construction is signicantly

larger than the running time of reliability estimation. Also, both
BN construction and reliability estimation times increase with the
sizes of the MRST. The running times in Table 8 are provided in
seconds. The estimated reliabilities of the MRSTs are given in
Table 9.
Case #
Average
Minimum
Maximum
Std. Dev.
1
2
3
4
5
6
7
8
9
0.9757
0.963
0.9557
0.9575
0.9786
0.9792
0.9795
0.9780
0.9817
0.9798
0.9588
0.9364
0.9381
0.9638
0.9719
0.9672
0.9683
0.9662
0.9832
0.979
0.9754
0.9905
0.983
0.9974
0.9946
0.9871
0.9942
0.1909
0.1528
0.1545
0.1920
0.1761
0.1203
0.1750
0.1787
0.1721
4.2. Accuracy analysis

This section analyzes the accuracy of the grid service reliability
results using known reliability values from Dai and Wang [7]. In
their study, Dai and Wang created an example grid system and
estimated the reliability of one service within the grid system.
Their example grid system is shown in Fig. 12.
Dai and Wangs method for grid service reliability estimation
relies on the assumptions on failures rates of the grid components.
104
Fig. 12. Example grid system used in Dai and Wangs study [7].
Table 10
Failure rates (per second) of the components.
Component G1
li
G2
G3
G4
G5
G6
G7
G8
G9
0.002 0.001 0.003 0.001 0.005 0.004 0.002 0.003 0.002
Table 11
Comparison of the experimental results with Dai and Wangs results.
Dai and Wang

Proposed methodology
Average
Minimum
Maximum
Std. Dev.
0.96034
0.96108
0.9428
0.9538
0.9696
0.9714
0.005752
0.01873
Table 12
Running times of MRST discovery, BN construction and reliability estimation.
MRST discovery
BN construction
Reliability estimation
Average
Minimum
Maximum
Std. Dev.
23.77
34.05
6.88
19.45
30.84
5.61
25.78
38.9
8.19
0.405
1.352
0.535
In their example, they assumed that the failure rates are statistically distributed with l intervals as shown in Table 10.
Using this setup, Dai and Wang estimated the reliability of the
grid service S, where the resources R1, R2, R3 and R4 are requested by
the RN G1. The MRST for this service involves 5 components; G1, G2,
G3, G7 and G9. In order to estimate the reliability of the grid service S,
ho generated 100 historical datasets containing 100 observations
using the failure rate assumptions in Table 10. The implementation
of the proposed method is ran for 100 times with different historical
datasets. The average, minimum and maximum reliability values of
the experimental results (out of 100 experiments) and comparison
with Dai and Wangs results are provided in Table 11.
As it can be observed from Table 11, the experimental results
are very close (less than 0.1% difference) to Dai and Wangs
results. Moreover, the low standard deviation values show that
the results are statistically signicant. Table 12 shows the MRST
discovery, BN construction and reliability estimation times
(in seconds) for the example grid system shown in Fig. 12.
5. Conclusions
Grid systems are newly developed concepts for large-scale
distributed systems. In a grid system, there can be various nodes
that are logically and physically distributed; and large-scale

sharing of resources is essential between these nodes. There are
mainly two types of nodes in a grid system: RM share resources
and RN request service from them. Identication of the links and
nodes between RN and RM is essential for estimating the
reliability of the requested service. Due to their special and
complex nature, traditional reliability estimation methods cannot
be used for grid systems. As the grid systems become popular in
the last decade, they nd new application areas in systems
engineering; however question of estimating reliabilities of the
grid systems remained wide open. Although there has been
studies in the literature on estimating grid service reliability,
these studies rely on certain assumptions about the link and
component failure rates [7,8,20] and/or assume that the grid
system structure is completely known [7,28]. However, in the
real-life grid systems these assumptions may not be true at all
times. First, the component and link failures in real-life systems
may occur randomly and making assumptions on their failure
rates can provide incorrect results. Second, the grid systems can
be very large and dynamic that their structure may not be exactly
known apriori.
This study discusses an automated method for estimating grid
service reliability without relying on any assumptions about the
component and link failures. Moreover the proposed method does
not require prior knowledge about the grid system structure.
Alternatively, the method is based on a popular data mining
algorithm, K2, and nds the associations between the grid
components automatically. To nd the component associations,
the proposed method works with a dataset that shows availabilities of the grid components in the past. After discovering the grid
structure based on this dataset, the proposed method nds the
MRSTs, and also computes CPTs for the components that it
considers during the process. The MRSTs and CPTs are essential
for estimating the grid service reliability, which is estimated with
the help of BN and Bayes theorem. Moreover, the proposed
method does not need to consider all components in the grid
system, which can be a very large set. Instead, it stops when it
nds all possible MRSTs, which usually requires considering only
a small subset of the components in the grid system. Also,
experimental analysis of the performance and accuracy of the
proposed method are provided. It is shown that the proposed
method discovers the MRSTs in less time than Dai and Wangs
method (that uses genetic algorithm) and provides very accurate
reliability values. Finally, the proposed method will be very useful
for system and reliability engineers, since it is fully automated,
does not rely on assumptions and does not require prior knowledge of the grid system structure.
References
[1] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling scalable
virtual organizations. International Journal of Supercomputer Applications
2001;15(3).
[2] Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S, Condor G. A computation
management agent for multi-institutional grids. Cluster Computing
2002;5(3):237246.
[3] Foster I, Kesselman C. In computational grids. VECPAR, 1998; Morgan
Kaufmann, 1998. p. 1552.
[4] Buyya R, Date S, Mizuno-Matsumoto Y, Venugopal S, Abramson D. Economic
and on demand brain activity analysis on global grids. Computing Research
Repository 2003.
[5] Dai YS, Levitin G. Optimal resource allocation for maximizing performance
and reliability in tree-structured grid services. IEEE Transactions on Reliability 2007;56(3):444453.
[6] Dai YS, Pan Y, Zou X. A hierarchical modeling and analysis for grid service
reliability. IEEE Transactions on Computers 2007;56:681691.
[7] Dai YS, Wang X. Optimal resource allocation on grid systems for maximizing
service reliability using a genetic algorithm. Reliability Engineering & System
Safety 2006;91(9):10711082.
[8] Dai YS, Xie M, Poh KL. Reliability of grid service systems. Computers and
Industrial Engineering 2006;50:130147.
[9] Amasaki S, Takagi Y, Mizuno O, Kikuno T. In: a Bayesian belief network for
assessing the likelihood of fault content. 14th international symposium on
software reliability engineering, 2003, p. 125.
[10] Boudali H, Dugan JB. A continuous-time Bayesian network reliability modeling,
and analysis framework. IEEE Transaction on Reliability 2006;55(1):8697.
[11] Gran BA, Helminen A, Bayesian A. Belief network for reliability assessment.
Safecomp 2001 2187, 2001, p. 3545.
[12] Doguc O, Ramirez-Marquez JE. A generic method for estimating system
reliability using Bayesian networks. Reliability Engineering and System
Safety 2009;94(2):542550.
[13] Sigurdsson JH, Walls LA, Quigley JL. Bayesian belief nets for managing expert
judgment and modeling reliability. Quality and Reliability Engineering
International 2001;17:181190.
[14] Hugin Expert. /http://www.hugin.dkS.
[15] Gran BA, Dahll G, Eisinger S, Lund EJ, Norstrm JG, Strocka P, Ystanes BJ.
In: Estimating dependability of programmable systems using BBNs. Safecomp 2000, 2000; Springer 2000, p. 309320.
[16] Lagnseth H, Portinale L. Bayesian networks in reliability. Reliability Engineering and System Safety 2007;vol. 92(1) p. 92108.
[17] Inamura T, Inaba M, Inoue H. In: User adaptation of humanrobot interaction
model based on Bayesian network and introspection of interaction experience.
IEEE/RSJ international conference on intelligent robots and systems. 2000. p.
21392144.
[19] Levitin G, Dai YS. Performance and reliability of a star topology grid service
with data dependency and two types of failure. IIE Transactions 2007;39(8):
783.
105
[20] Dai YS, Levitin G. Optimal resource allocation for maximizing performance
and reliability in tree-structured grid services. IEEE Transactions on Reliability 2007.
[21] Barlow RE. Using inuence diagrams. Accelerated life testing and experts
opinions in reliability 1988:145150.
[22] Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic networks
and expert systems. New York, NY: Springer-Verlag; 1999.
[23] Jensen FV. Bayesian networks and decision graphs. New York, NY: Springer
Verlag; 2001.
[24] Pearl J. Probabilistic reasoning in intelligent systems. San Francisco,
CA: Morgan Kaufmann; 1988.
[25] Fenton N, Krause P, Neil M. Software measurement: uncertainty and causal
modeling. IEEE Software 2002;10(4):116122.
[26] Bobbio A, Portinale L, Minichino M, Ciancamerla E. Improving the analysis of
dependable systems by mapping fault trees into Bayesian networks. Reliability Engineering and System Safety 2001;71(3):249260.
[27] Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning 1992;9(4):309347.
[28] Doguc O, Ramirez-Marquez JE. In a Bayesian approach for estimating grid
system reliability. International conference on grid computing and applications. Las Vegas, NV; July 1417, 2008.
[29] Chen DJ, Chen RS, Huang TH. A heuristic approach to generating le spanning
trees for reliability analysis of distributed computing systems. Computers
and Mathematics with Application 1997;34:115131.
[30] Dai YS, Xie M, Poh KL, Liu GQ. A study of service reliability and availability for
distributed systems. Reliability Engineering and System Safety 2003;79:
103112.
[31] Kumar A, Agrawal DP. A generalized algorithm for evaluating distributedprogram reliability. IEEE Transactions on Reliability 1993;42:416424.

Estimating Reliability of Grid Systems Using Bayesian Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimating Reliability of Grid Systems Using Bayesian Networks

Uploaded by

Copyright:

Available Formats

Reliability Engineering and System Safety 104 (2012) 96105

Contents lists available at SciVerse ScienceDirect

Reliability Engineering and System Safety

An automated method for estimating reliability of grid systems

Abbreviations: BN, Bayesian Network; RST, Resource spanning tree; MRST,

Over the past several years, research and development efforts

Component i in the grid system

Time that observation i was done on the grid system

They dene the resource-spanning tree (RST) as a tree that starts

Fig. 1. A sample grid system.

2.2. Bayesian networks

requires an exponential-time search algorithm to nd the optimal

2.3. Estimating system reliability using BN

2.2.1. Finding component associations using historical data

Fig. 2. Pseudo-code for the K2 algorithm [27].

Fig. 3. An example BN.

node in the BN and the overall MRST behavior (will be discussed in

pG1 ,G2 9G3 pG3

As shown in Eq. (1), the probability of the node G3 is only

3. Estimating grid service reliability using BN

This study rst shows that a historical dataset similar to the

Fig. 4. Pseudo-code for the proposed method for MRST discovery.

Fig. 5. Pseudo-code for the second step of the proposed method.

Fig. 6. MRST discovery (step 1).

Fig. 8. MRST discovery (last step).

Fig. 7. MRST discovery (step 2).

S1 that was discussed in Section 2 are used, so that the RN,

the historical dataset shown in Table 1, the proposed method

In this step, the method looks for the associations between G3

First, the experiments on the performance of the proposed

Tables 4 and 5 provide the list of resources that are shared by

4.1. Performance analysis

Fig. 9. Example grid system used for experimental analysis.

List of shared resources

R7, R8, R15

R11, R12, R14

Fig. 11. Performance of the K2 algorithm to construct the BN.

observed that the running time of BN construction is signicantly

4.2. Accuracy analysis

0.002 0.001 0.003 0.001 0.005 0.004 0.002 0.003 0.002

Dai and Wang

that are logically and physically distributed; and large-scale

You might also like