MTBF Presentation

AVAILABILITY MEASUREMENT
SESSION NMS-2201
NMS-2201
9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
Agenda
Introduction
Availability Measurement Methodologies
Trouble Ticketing
Device Reachability: ICMP (Ping), SA Agent, COOL
SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent
Application
Developing an Availability Culture
NMS-2201
9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Presentation_ID.scr
Associated Sessions
NMS-1N01: Intro to Network Management
NMS-1N02: Intro to SNMP and MIBs
NMS-1N04: Intro to Service Assurance Agent
NMS-1N41: Introduction to Performance Management
NMS-2042: Performance Measurement with Cisco IOS
ACC-2010: Deploying Mobility in HA Wireless LANs
NMS-2202: How Cisco Achieved HA in Its LAN
RST-2514: HA in Campus Network Deployments
NMS-4043: Advanced Service Assurance Agent
RST-4312: High Availability in Routing
NMS-2201
9627_05_2004_c2
INTRODUCTION
WHY MEASURE AVAILABILITY?
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Why Measure Availability?

1. Baseline the network
2. Identify areas for network improvement
3. Measure the impact of improvement projects
NMS-2201
9627_05_2004_c2
Why Should We Care About

Network Availability?
Where are we now? (baseline)
Where are we going? (business objectives)
How best do we get from where we are not to where
we are going? (improvements)
What if, we cant get there from here?
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Why Should We Care About

Network Availability?
Recent Studies by Sage Research Determined That
US-Based Service Providers Encountered:
Percent of downtime that is
unscheduled: 44%
18% of customers experience over 100
hours of unscheduled downtime or an
availability of 98.5%
Average cost of network downtime per
year: $21.6 million or $2,169 per minute!
DowntimeCosts too Much!!!

SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime
Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB
NMS-2201
9627_05_2004_c2
Cause of Network Outages

Change
management
Technology Hardware
Links
20%
Design
Environmental
issues
Natural disasters
Process
consistency
User Error
and Process
40%
Source: Gartner Group

NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Software and
Application
40%
Software issues
Performance
and load
Scaling
Top Three Causes of Network Outages

Congestive degradation
Network design
Capacity
(unanticipated peaks)
WAN failure (e.g., major fiber

cut or carrier failure)
Solutions validation
Power
Software quality
Critical services failure

(e.g. DNS/DHCP)
Inadvertent configuration
change
Change management
Protocol implementations
and misbehavior
Hardware fault
NMS-2201
9627_05_2004_c2
Method for Attaining a

Highly-Available Network
Or a Road to Five Nines
Establish a standard
measurement method
Define business goals as
related to metrics
Categorize failures, root
causes, and improvements
Take action for root cause
resolution and improvement
implementation
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
10
Where Are We Going?

Or What Are Your Business Goals?
Financial
ROI
Economic Value Added
Revenue/Employee
Productivity
Time to market
Organizational mission
Customer perspective
Satisfaction
Retention
Market Share
Define Your End-State?

What Is Your Goal?
NMS-2201
9627_05_2004_c2
11
Why Availability for Business

Requirements?
Availability as a basis for productivity data
Measurement of total-factor productivity
Benchmarking the organization
Overall organizational performance metric
Availability as a basis for organizational

competency
Availability as a core competency
Availability improvement as an innovation metric
Resource allocation information

Identify defects
Identify root cause
Measure MTTRtied to process
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
12
It Takes a Design Effort to Achieve HA

Hardware and Software Design
Process Design
NMS-2201
9627_05_2004_c2
Network and
Physical Plant Design
13
INTRODUCTION
WHAT IS NETWORK
AVAILABILITY?
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
14
What Is High Availability?

High Availability Means an Average End
User Will Experience Less than Five
Minutes Downtime per Year
Availability
NMS-2201
9627_05_2004_c2
Downtime per Year (24x7x365)
99.000%
3 Days
15 Hours
36 Minutes
99.500%
1 Day
19 Hours
48 Minutes
99.900%
8 Hours
46 Minutes
99.950%
4 Hours
23 Minutes
99.990%
53 Minutes
99.999%
5 Minutes
99.9999%
30 Seconds
15
Availability Definition
Availability definition is
based on business
objectives
Is it the user experience you are
interesting in measuring?
Are some users more important
than other?
Availability groups?
Definitions of different groups
Exceptions to the availability

definition
i.e. the CEO should never
experience a network problem
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
16
How You Define Availability

Define availability perspective (customer, business, etc.)
Define availability groups and levels of redundancy
Define an outage
Define impact to network
Ensure SLAs are compatible with outage definition
Understand how maintenance windows affect outage definition
Identify how to handle DNS and DHCP within definition of
Layer 3 outage
Examine component level sparing strategy
Define what to measure

Define measurement accuracy requirements
NMS-2201
9627_05_2004_c2
17
Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time To Failure)
Engineered availability
Reliability is defined as the probability of survival

(or no failure) for a stated length of time
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
18
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
OPERATIONAL STATE to a NON-OPERATIONAL STATE
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems
MTTR stands for Mean Time to Repair
NMS-2201
9627_05_2004_c2
19
One Method of Calculating Availability

Availability =
MTBF
(MTBF + MTTR)
What is the availability of a computer with MTBF =

10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs
Conversely, annual DOWN time is,

8,760 hrs/year X (1- 0.9988)
= 10.5 hrs
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
20
Networks Consist of Series-Parallel

Combinations of in-series and redundant
components
RBD
D1
A
B1
1/2
B2
NMS-2201
9627_05_2004_c2
D2
2/3
D3
21
More Complex Redundancy

Pure active parallel
All components are on
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switch-over reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
22
MEASURING THE
PRODUCTION NETWORK
NMS-2201
9627_05_2004_c2
23
Reliability or Engineered Availability vs.

Measured Availability
Calculations Are SimilarBoth Are
Based on MTBF and MTTR
1. Reliability is an engineered probability of the
network being available
2. Measured Availability is the actual outcome
produced by physically measuring over time the
engineered system
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
24
Availability Choice Based on

Business Goals
Passive availability measurement
(Without sending additional traffic on the production
network using data from problem management, fault
management, or another system)
Active availability measurement

(With traffic being sent specifically for availability
measurement using ICMP echo, SNMP, SA agent, etc.
to generate data)
NMS-2201
9627_05_2004_c2
25
Types of Availability
Device/interface
Path
Users
Application
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
26
Some Types of Availability Metrics

Mean Time to Repair (MTTR)
Impacted User Minutes (IUM)
Defects per Million (DPM)
MTBF (Mean Time Between Failure)
Performance (e.g. latency, drops)
NMS-2201
9627_05_2004_c2
27
Back to How Availability Is Calculated?

Availability (%) is calculated by tabulating end user outage
time, typically on a monthly basis
Some customers prefer to use DPM (Defects per Million) to
represent network availability
Availability (%) = (Total User Time Total User Outage Time) X 102
Total User Time
DPM = Total User Outage Time X 106
Total User Time
Total User Time = Total # of End Users X Time in Reporting Period
Total User Outage Time = (# of End Users X Outage Time in Reporting Period)
Is over All the Incidents in the Reporting Period
Ports or Connections May Be Substituted for End Users
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
28
Defects per Million

Started with mass produced items like toasters
For PVCs,
DPM = (#conns*outage minutes)
(#conns*total minutes)
For SVCs or phone calls,

DPM = (#existing calls lost + #new calls blocked)
total calls attempted
For connectionless traffic (application dependent),

DPM = (#end users*outage minutes)
(#end users*total minutes)
NMS-2201
9627_05_2004_c2
29
NETWORK AVAILABILITY
COLLECTION METHODS
TROUBLE TICKETING METHODS
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
30
Availability Improvement Process

Step I
Validate data collection/calculation methodology
Establish network availability baseline
Set high availability goals
Step II
Measure uptime ongoing
Track defects per million (DPM) or IUM or
availability (%)
Step III
Track customer impact for each ticket/MTTR
Categorize DPM by reason code and
begin trending
Identify initiatives/areas for a focus to
eliminate defects
NMS-2201
9627_05_2004_c2
31
Data Collection/Analysis Process

Understand current data collection methodology
Customer internal ticket database
Manual
Monthly collection of network performance data and export

the following fields to a spreadsheet or database system:
Outage start time (date/time)
Service restore time (date/time)
Problem description
Root cause
Resolution
Number of customers impacted
Equipment model
Component/part
Planned maintenance activity/unplanned activity
Total customers/ports on network
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
32
Network Availability Results

Methodology and assumptions must be
documented
Network availability should include:
Overall % network availability (baseline/trending)
Conversion of downtime to DPM by:
Planned and unplanned
Root cause
Resolution
Equipment type
Overall MTTR
MTTR by:
Root cause
Resolution
Equipment type
Results are not necessarily limited to the

above but should be customized based on
your network and requirements
NMS-2201
9627_05_2004_c2
33
Availability Metrics: Reviewed

Network has 100 customers
Time in reporting period is one year or 24 hours x 365 days
8 customers have 24 hours down time per year
DPM =
8 x 24
x 106
100 x 24 x 365
Availability = 1 -
NMS-2201
9627_05_2004_c2
8 x 24
.
100 x 24 x 365
= 219.2 failures for every

1 million user hours
= 0.978082
MTBF =
24 x 365 .
8
= 1095 (hours)
MTTR =
1095 x (1-0.978082) .
0.978082
= 0.24 (hours)

Presentation_ID.scr
34
TROUBLE TICKETING METHOD

SAMPLE OUTPUT
NMS-2201
9627_05_2004_c2
35
Overall Network Availability

(Planned/Unplanned)
Network Availability
100.00
99.95
99.90
99.85
e
iv
t
tra
s
u
Ill
99.80
99.75
99.70
99.65
99.60
99.55
99.50
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun
Key takeaways
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
36
Platform Related DPM Comparison

600
DPM
500
Platform related DPM contributed

13% of total DPM in September
Platform DPM includes events from:
400
300
200
100
0
June
July
Aug
e
iv
t
tra
s
u
Ill
Sept
Oct
Backbone
NAS
PG
POP
Radius Server
VPN Radius Server
Dec
June
July
Aug
Sept
Other
339.5
424.9
394.7
362.2
Platform Related
49.2
82.5
104
52.6
Total DPM
388.7
507.4
498.7
414.8
------99.99% Target
100
100
100
100
Oct
Nov
Dec
100
100
100
All other events are included in the

Other category
Breakdown of Platform Related DPM

June
July
Aug
Sept
Backbone
1.5
.8
15.7
2.3
NAS
21.7
19.4
27
26.1
PG
26
59.6
56.8
18.9
POP
3.9
.5
1.6
Radius Server
1.2
.3
VPN Radius
8.8
2.8
3.4
Total Platform Related
49.2
82.5
104
52.6
NMS-2201
9627_05_2004_c2
Network Access Server (NAS)

accounts for 50% of the total
Platform related DPM in September
Private Access Gateway (PG)
showing significant decrease over
the past 3 months
37
DPM by Cause
2500
DPM
2000
1500
1000
500
0
Dec
Dec
Unknown
Human Error
18.2
Environmental
36.1
Power
566.1
Other
HW
Config/SW
TOTAL
NMS-2201
9627_05_2004_c2
e
v
i
t
ra
t
s
u
Ill
Jan
Jan
Feb
Mar
Feb
Apr
Mar
May
Apr
May
80
95.2
115.2
23.6
8.98
87.7
106
68.8
18.4
133.4
127
31.4
11.1
89.7
19
14.8
145.7
136.2
212.4
37
314.2
604.4
884.3
512.7
553.6
474.3
422.5
240
406
101.6
117.5
20.2
106.6
201
3789.3
1202.2
1226
1293.1
1641.9
1964.8

Presentation_ID.scr
38
MTTR Analysis: Hardware Faults

Router HW
16
Produce for Each Fault Type
15.1
14
12.42
Hours
12
10
8.49
Number of faults increased slightly

in September however MTTR
decreased 49% of faults resolved in
< 1 Hour in September
e
v
i
t
a
r
t
s
lI lu
7.19
11% of faults resolved in > 24 hours

with an additional 3% >100 Hhours
4
2
0
Jun
Jul
Aug
Sep
Oct
Nov
Dec
100
140
90
120
80
# of Faults
100
>24 Hr
80
12-24 Hr
60
4-12 Hr
40
1-4 Hr
20
<1 Hr
Jun
Jul
NMS-2201
9627_05_2004_c2
Aug
Sep
Oct
Nov
Dec
# of Total
>100
70
>100
60
>24 Hr
50
12-24 Hr
40
4-12 Hr
30
20
1-4 Hr
10
<1 Hr
Jun
Jul
Aug
Sep
Oct
Nov
Dec
39
Unplanned DPM
1000
900
800
700
600
500
400
300
200
100
0
Feb
Feb
Mar
Mar
Apr
Apr
Other
70
100
35
Process
90
80
55
HW
90
200
80
SW
60
140
50
TOTAL
310
520
220
e
v
i
t
a
r
t
s
lI lu
May
May
Jun
Jun
Jul
Jul
Aug
Aug
Sep
Sep
Oct
Oct
Nov
Nov
Dec
Dec
79
80
80
165
110
40
10
100
100
90
210
180
75
10
104
180
115
385
325
245
110
67
80
65
200
145
100
40
10
350
440
350
960
760
460
170
40
Key take-a-ways
Action plans
Identify areas of focus to enable
reduction of DPM to achieve network
availability goal
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
40
Trouble Ticketing Method

Pros
Easy to get started
No network overhead
Outages can be categorized based on event
Cons
Some internal subjective/consistency process issues
Outages may occur that are not included in the trouble
ticketing systems
Resources needed to scrub data and create reports
May not work with existing trouble ticketing
system/process
NMS-2201
9627_05_2004_c2
41
Network Availability Collection Methods
AUTOMATED FAULT
MANAGEMENT EVENTS METHOD
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
42
Availability Improvement Process

Step I
Determine availability goals
Validate fault management data collection
Determine a calculation methodology
Build software package to use customer event log
Step II
Establish network availability baseline
Measure uptime on an ongoing basis
Step III
Track root cause and customer impact
Begin trending of availability issues
Identify initiatives and areas of focus
to eliminate defects
NMS-2201
9627_05_2004_c2
43
Event Log
Analysis of events
received from the
network devices
Analysis of accuracy
of the data
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Event Log Example

Fri Jun 15 11:05:31 2001 Debug: Looking for message header ...
Fri Jun 15 11:05:33 2001 Debug: Message header is okay
Fri Jun 15 11:05:33 2001 Debug: $(LDT) ->
"06152001110532"
Fri Jun 15 11:05:33 2001 Debug: $(MesgID)
->
"100013"
Fri Jun 15 11:05:33 2001 Debug: $(NodeName) ->
"ixc00asm"
Fri Jun 15 11:05:33 2001 Debug: $(IPAddr)
->
"10.25.0.235"
Fri Jun 15 11:05:33 2001 Debug: $(ROCom)
->
"xlr8ed!"
Fri Jun 15 11:05:33 2001 Debug: $(RWCom)
->
"s39o!d%"
"CISCO-Large-special"
Fri Jun 15 11:05:33 2001 Debug: $(NPG) ->
Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN)
->
"aSnmpStatus"
Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) ->
"system"
Fri Jun 15 11:05:33 2001 Debug: $(OSN) ->
"Testing"
Fri Jun 15 11:05:33 2001 Debug: $(OSS) ->
"Normal"
Fri Jun 15 11:05:33 2001 Debug: $(DSN) ->
"SNMP_Down"
Fri Jun 15 11:05:33 2001 Debug: $(DSS) ->
"Agent_Down"
Fri Jun 15 11:05:33 2001 Debug: $(TrigName) ->
"NodeStateUp"
Fri Jun 15 11:05:33 2001 Debug: $(BON) ->
"nl-ping"
Fri Jun 15 11:05:33 2001 Debug: $(TrapGN)
->
"-2"
Fri Jun 15 11:05:33 2001 Debug: $(TrapSN)
->
"-2
44
Calculation Methodology: Example

Primary events are device down/up
Down time is calculated based on device-type
outage duration
Availability is calculated based on the total
number of device types, the total time, and the
total down time
MTTR numbers are calculated from average
duration of downtime
With MTTR the shortest and longest outage
provides a simplified curve
NMS-2201
9627_05_2004_c2
45
Automated Fault Management Methodology

Pros
Outage duration and scope can be fairly accurate
Can be implemented within a NMS fault management system
No additional network overhead
Cons
Requires an excellent change management/provisioning
process
Requires an efficient and effective fault management system
Requires a custom development
Does not account for routing problems
Not true end-to-end measure
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
46
NETWORK AVAILABILITY
DATA COLLECTION
SAMPLE OUTPUT
NMS-2201
9627_05_2004_c2
47
Automated Fault Management:

Example Reports
Device
Type
# of
Count of
Devices Incidents
Total Down
Time
hhh:mm:ss
%
Down
%
Up
Shortest
Outage
Duration
Mean
Time to
Repair
Longest Events
Outage
per
Duration Device
Host
Totals
2389
801
202:27:27
.0673% 99.9327%
0:00:19
0:20:47
7:48:46
24.42
Network
Totals
4732
1673
430:02:03
.1309% 99.8691%
0:00:24
0:22:36
9:49:35
14.90
Other
Totals
897
173
212:29:46
.0509% 99.9491%
0:00:17
0:26:07
2:16:10
16.84
GRAND
TOTAL
8018
2647
844:59:16
.0830% 99.9170%
0:00:20
0:23:10
6:38:11
18.72
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
48
Automated Fault Management:

Example Reports (2)
Other Totals
11%
Host Totals
30%
Host Totals
Network Totals
Other Totals
Network
Totals
59%
Other Totals
7%
Host Totals
30%
Host Totals
24%
Network
Totals
51%
NMS-2201
9627_05_2004_c2
Count of Incidents
Host Totals
Network Totals
Other Totals
Network
Totals
63%
Other Totals
25%
Number of Managed Devices
Total Down Time

Host Totals
Network Totals
Other Totals
49
ICMP ECHO (PING) AND SNMP AS

DATA GATHERING TECHNIQUES
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
50
Data Gathering Techniques

ICMP ping
Link and device polling (SNMP)
Embedded RMON
Embedded event management
Syslog messages
COOL
NMS-2201
9627_05_2004_c2
51

ICMP Reachability
Method definition:
Central workstation or computer configured to send ping
packets to the network edges(device or ports) to determine
reachability
How:
Edge interfaces and/or devices are defined and pinged
on a determined interval
Unavailability:
Pre-defined, non-response from the interface
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
52
Availability Measurement Through ICMP

Periodic ICMP Test
Periodic Pings to Network Devices

NMS-2201
9627_05_2004_c2
Period Ping to Network Leaf Nodes
53

ICMP Reachability
Pros
Fairly accurate network availability
Accounts for routing problems
Can be implemented for fairly low network overhead
Cons
Point to multipoint implies not true end-to-end measure
Availability granularity limited by ping frequency
Maintenance of device databasemust have a solid
change management and provisioning process
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
54

Link and Device Status
Method definition:
SNMP polling and trapping on links, edge ports,
or edge devices
How:
An agent is configured to SNMP poll and tabulate outage
times for defined devices or links; database maintains
outage times and total service time; sometimes trap
information is used to augment this method by providing
more accurate information on outages
Unavailability:
Pre-defined, non-redundant links, ports, or devices that
are down
NMS-2201
9627_05_2004_c2
55
Polling Interval vs. Sample Size

Polling interval is the rate at which data is collected
from the network
Polling interval =
1
Sampling Rate
The smaller the polling interval the more detailed

(granular) the data collected
Example polling data once every 15 minutes provides 4 times the
detail (granularity) of polling once an hour
A smaller polling interval does not necessarily provide a

better margin of error
Example polling once every 15 minutes for one hour, has the
same margin of error as polling once an hour for 4 hours
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
56
Link and Device Status Method

Method definition
SNMP polling and trapping on links, edge ports,
or edge devices
How:
Utilizing existing NMS systems that are currently SNMP
polling to tabulate outage times for defined devices or links
A database maintains outage times and total service time
SNMP Trap information is also used to augment this
method by providing more accurate information on
outages
NMS-2201
9627_05_2004_c2
57
Link and Device Status Method

Pros
Outage duration and scope can be fairly accurate
Utilize existing NMS systems
Low network overhead
Cons
No canned SW to do this; custom development
Maintaining element device database challenging
Requires an excellent change mgmt and provisioning
process
Does not account for routing problems
Not a true end-to-end measure
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
58
CISCO SERVICE ASSURANCE

AGENT (SA AGENT)
NMS-2201
9627_05_2004_c2
59
Service Assurance Agent

Method Definition:
SA Agent is an embedded feature of Cisco IOS software
and requires configuration of the feature on routers within
the customer network; use of the SA agent can provide for
a rapid, cost-effective deployment without additional
hardware probes
How:
A data collector creates SA Agents on the routers to
monitor certain network/service performances; the data
collector then collects this data from the routers,
aggregates it and makes it available
Unavailability:
Pre-defined paths with reporting on non-redundant links,
ports, or devices that are down within a path
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
60
Case Study:
Financial Institution (Collection)
Internet
Web Sites
DNS
SA Agent Collectors
Remote Sites
NMS-2201
9627_05_2004_c2
61
Availability Using Network-Based Probes

DPM equations used with network-based probes as input data
Probes can be
Simple ICMP Ping probe, modified Ping to test specific applications,
Cisco IOS SA Agent
DPM will be for connectivity between 2 points on the network,

the source and destination of probe
Source of probe is usually a management system and the destination are
the devices managed
Can calculate DPM for every device managed
DPM = Probes with No Response x 106
Total Probes Sent
Availability = 1 - Probes with No Response
Total Probes Sent
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
62
Availability Using Network-Based Probes:

Example
Network probe is a ping
10000 probes are sent between management
system and managed device
1 probe failed to respond
DPM =
1
x 106 = 100 probes out of 1 million will fail
10000
Availability = 1 -
NMS-2201
9627_05_2004_c2
1 .
= 0.9999
10000
63
Sample Size
Sample size is the number of samples that have
been collected
The more samples collected the higher the confidence that
the data accurately represents the network
Confidence (margin of error) is defined by
m=
1
sample size
Example data is collected from the network every 1 hour

After One Month
After One Day
m=
NMS-2201
9627_05_2004_c2
1
24
= 0.2041

Presentation_ID.scr
m=
1
24 x 31
= 0.0367
64
Service Assurance Agent

Pros
Accurate network availability for defined paths
Implementation with very low network overhead
Cons
Requires a system to collect the SAA data
Requires implementation in the router configurations
Availability granularity limited by polling frequency
Definition of the critical network paths to be measured
NMS-2201
9627_05_2004_c2
65
COMPONENT OUTAGE ONLINE

MEASUREMENT (COOL)
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
66
COOL Objectives
To automate the measurement to increase
operational efficiency and reduce operational cost
To measure the outage as close to the source of
outage events as possible to pin point the cause of
the outages
To cope with large number of network elements
without causing system and network performance
degradation
To maintain measurement data reliably in presents
of element failure or network partition
To support simplicity in deployment, configuration,
and data collection (autonomous measurement)
NMS-2201
9627_05_2004_c2
67
COOL Features
NMS
rd Party Tools
3rd
NetTools
C-NOTE
PNL
Event Notification Filtering

Outage Monitor MIB
COOL
Access Router
Open access via Outage Monitor MIB

Embedded in Router
Automated Real-Time Measurement
Autonomous Measurement
Outage Data Stored in Router
Customer Equipment
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
68
Outage Correlation and

Calculation
COOL Features (Cont.)
NMS
NMS
NMS
Two-tier framework
Reduces performance impact on
the router
Provides scalability to the NMS
Makes easy to deploy
Outage Monitor MIB
Outage Monitor MIB
COOL
COOL
Access Router
Core Router
Access Routers
Customer Equipment
Outage Monitoring and

Measurement
Provides flexibility to availability

calculation
Support NMS or tools for

such applications as
Calculation of software or
hardware MTBF, MTTR,
availability per object, device,
or network
Verification of customers SLA
Trouble shooting in real-time
NMS-2201
9627_05_2004_c2
69
Outage Model
C
Access Router
Network
Management
System
RP
D
D
A
Physical
D
Interface
B
Power Fan, A
Etc.
Logical
Interface
Type
Objects Monitored
Link
MUX/
Hub/
Switch
Customer
Equipment
Link
Peer
Router
Failure Modes
Physical Entity
Objects
Component Hardware or Software Failure Including the Failure

of Line Card, Power Supplies, Fan, Switch Fabric, and So on
Interface Objects
Interface Hardware or Software Failure, Loss of Signal
Remote Objects
Failure of Remote Device (Customer Equipment or Peer

Networking Device) or Link In-between
Software Objects
Failure of Software Processes Running on the RPs and Line

Cards
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
70
Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to be
defective (service degradation or complete outage)
Duration threshold: the minimum period beyond which an outage needs
to be reported (given SLA)
Start time: when the object outage starts
End time: when the outage ends
Duration
Threshold
Defect
Threshold
Up Event
Time
Down Event
Start Time
NMS-2201
9627_05_2004_c2
Outage Duration
End Time
71
Architecture
Customer Interfaces
Outage Monitor MIB
SNMP Polling
Configuration
SNMP Notification
Data Table Structure
CLI
HA and Persistent Data Store
Outage Component Table

Event History Table
Measurement
Metrics
Event Map Table
Outage
Manager
NVRAM
Process Map Table

Remote Component Map Table
Fault Manager
CPU
Event
(IOS)
Usage Source
Detect Callbacks Syslog

Presentation_ID.scr
ATA Flash
Time Stamp
Temp Event Data
Crash Reason
Outage Data
Remote Component
Outage Detector
Internal Component
Outage Detector
NMS-2201
9627_05_2004_c2
Customer
Authentication
Customer Equipment
Ping
Detection Function
Baseline
SAA
APIs
Optional
Measurement Methods
72
Outage Data: AOT and NAF

Requirements of measurement metrics:
Enable calculation of MTTR, MTBF, availability, and SLA assessment
Ensure measurement efficiency in terms of resource (CPU, memory, and
network bandwidth)
Measurement metrics per object:

AOT: Accumulated Outage Time since measurement started
NAF: Number of Accumulated Failures since measurement started
Router 1
System Crash
Up
Down
NMS-2201
9627_05_2004_c2
System Crash
10
10
Time
AOT = 20 and NAF = 2
73
Outage Data: AOT and NAF

Object containment model
Router Device
Line Card
Logical Interface
Physical Interface
Containment independent property

Router 1
System Crash
Up
System Crash
10
10
Down
Time
20
Interface 1
20
Interface Failure
Up
NMS-2201
9627_05_2004_c2
Service Affecting
AOT = 27;
NAF = 3;
Router Device
AOT = 20;
NAF = 2;
10
10

Presentation_ID.scr
7
Time
Router 1
Interface 1
Interface
AOT = 7;
NAF = 1;
74
Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi
= 14/2
= 7 min
Object i
Measurement Interval (T2T1)
Up
Down
NMS-2201
9627_05_2004_c2
T1
TTR
TTR
10 min.
4 min.
Failure
Failure
T2
Time
75
Example: MTBF and MTTF

Find MTBF and MTTF for Object i
MTBFi = (T2 T1)/NAFi
MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi
MTBF = 700,000 = 1,400,000/2
MTTR = 699,993 = (700,000 7)
Object i
Measurement Interval (T2T1)

TBF
TTR
Up
Down
10 min.
T1
Failure
TTF
4 min.
Failure
T2
Time
(T2T1) = 1,400,000 min

NMS-2201
9627_05_2004_c2

Presentation_ID.scr
76
Example: Availability and DPM

Find availability and DPM for Object i
Availability (%) =
MTBF
MTBF + MTTR
* 100
Availability = 99.999% = (700,000/700,007) * 100

DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM
Measurement Interval = 1,400,000 min.

Object i Up
Down
NMS-2201
9627_05_2004_c2
10 min.
T1
Failure
4 min.
Failure
T2
Time
77
Planned Outage Measurement

To capture operation CLI commands both reload and
forced switchover
There is a simple rule to derive an upper bound of the
planned outage
If there is no NVRAM soft crash file, check the reboot reason or
switchover reason
If its reload or forced switchover, it can be considered as an upper
bound of the planned outage
Send Break
Operation
Caused
Outage
Reload
Planned Outage
Forced Switchover
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Upper Bound
of the Planned
Outage
78
Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down
May cause virtual network disconnection
May occurs event storm when hundreds of messages for each
flapping event
May make the object MTBF unreasonably low due to frequent
short failures
This unstable condition needs to get operators attention
COOL detects the flapping status
Catching very short outage event (less than the duration threshold)
Increasing the event counter,
Flapping status, if it becomes over the flapping threshold (3 event
counter) for the short period (1 sec); sends a notification
Stable status, if it becomes less than the threshold; sends another
notification
NMS-2201
9627_05_2004_c2
79
Data Persistency and Redundancy

Router
COOL
Event
Driven
Update
Periodic
Update
COOL
RAM
RAM
Outage Data
Outage Data
NVRAM
NVRAM
Persistent
Outage Data
Copy
FLASH
Copy
Persistent
Outage Data
Active RP
Persistent
Outage Data
FLASH
Persistent
Outage Data
Standby RP
Data persistency
To avoid data loss due to link outage or router itself crash
Data redundancy
To continue the outage measurement after the switchover
To retain the outage data even if the RP is physically replaced
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
80
Outage Monitor MIB

CISCO-OUTAGE-MONITOR-MIB
Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB
1.3.6.1.4.1.9.9.280
cOutageHistoryTable
Object-Type;
Object-Index;
IF-MIB
Event Reason Map Table
Event-Reason-Index;
Event-Time;
Event-Interval;
(Event Description)
ifTable
(Interface Object Description)
ENTITY-MIB
cOutageObjectTable
entPhysicalTable
Object-Type;
Object-Index;
(Physical Entity Object Description)
Object-Status;
Object-AOT;
Object-NAF;
Process MIB Map
CISCO-PROCESS-MIB
cpmProcessTable
(Process Object Description)
Remote Object Map Table

(Remote Object Description)
NMS-2201
9627_05_2004_c2
81
Configuration
MIB Display
Show CLI
Config CLI
Show event-table
Show object-table
Event Table
Object Table
COOL
run;
add;
removal
filtering-enable;
Cisco IOS
Configuration
Update
Customer Equipment
Detection Function
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
Update
82
Enabling COOL
ari#dir
Directory of disk0:/
1
-rw-
19014056
Oct 29 2003 16:09:28 +00:00
gsr-k4p-mz.120-26.S.bin
Obtain
Authorization
File
128057344 bytes total (109051904 bytes free)

ari#copy tftp disk0:
Address or name of remote host []? 88.1.88.9
Source filename []? auth_file
Destination filename [auth_file]?
Accessing tftp://88.1.88.9/auth_file...
Loading auth_file from 88.1.88.9 (via FastEthernet1/2): !
[OK - 705 bytes]
705 bytes copied in 0.532 secs (1325 bytes/sec)
ari#clear cool per
ari#clear cool persist-files
ari#conf t
Enter configuration commands, one per line. End with CNTL/Z.
ari(config)#cool run
Enable COOL
ari(config)#^Z
ari#wr mem
Building configuration...
[OK][OK][OK]
NMS-2201
9627_05_2004_c2
83
COOL
Pros
Accurate network availability for devices, components,
and software
Implementation with low network overhead.
Enables correlation between active and passive availability
methodologies
Cons
Only a few system currently have the COOL feature
Requires implementation in the router configurations of
production devices
Availability granularity limited by polling frequency
New Cisco IOS Feature
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
84
APPLICATION LAYER
MEASUREMENT
NMS-2201
9627_05_2004_c2
85
Application Reachability
Similar to ICMP Reachability

Method definition:
Central workstation or computer configured to send packets that
mimic application packets
How:
Agents on client and server computers and collecting data
Fire Runner, Ganymede Chariot, Gyra Research, Response
Networks, Vital Signs Software, NetScout, Custom applications
queries on customer systems
Installing special probes located on user and server

subnets to send, receive and collect data; NikSun and
NetScout
Unavailability:
Pre-defined QoS definition
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
86
Application Reachability
Pros
Actual application availability can be understood
QoS, by application, can be factored into the availability
measurement
Cons
Depending on scale, potential high overhead and cost can
be expected
NMS-2201
9627_05_2004_c2
87
DATA COLLECTION FOR ROOT

CAUSE ANALYSIS (RCA) OF
NETWORK OR DEVICE
DOWNTIME
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
88

Cisco IOS Embedded RMON
Alarm and event
History and statistics
Set thresholds in router configuration
Configure SNMP trap to be sent when MIB variable
rises above and/or falls below a given threshold
Alleviates need for frequent polling
Not an availability methodology by itself but can
add valuable information and customization to the
data collection method
NMS-2201
9627_05_2004_c2
89

Syslog Messages
Provide information on what the router is doing
Categorized by feature and severity level
User can configure Syslog logging levels
User can configure Syslog messages to be sent as
SNMP traps
Not an availability methodology by itself but can
add valuable information and customization to the
data collection method
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
90
Expression and Event MIB

Expression MIB
Allows you to create new SNMP objects based upon formulas
MIB persistence is supported a MIBs SNMP data persists across
reloads
Delta and wildcard support allows you to:
Calculate utilization for all interfaces with one expression
Calculate errors as a percentage of traffic
Event MIB
Allows you to create custom notifications and log them and/or send
them as SNMP traps or informs
MIB persistence is supported a MIBs SNMP data persists across
reloads
Can be used to test objects on other devices
More flexible than RMON events/alarms
RMON is tailored for use with counter objects
NMS-2201
9627_05_2004_c2
91

Embedded Event Manager
Underlying philosophy:
Embed intelligence in routers and switches to enable a
scalable and distributed solution, with OPEN interfaces for
NMS/EMS leverage of the features
Mission statement:
Provide robust, scalable, powerful, and easy-to-use
embedded managers to solve problems such as syslog and
event management within Cisco routers and switches
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
92
Embedded Event Manager (Cont.)

Development goal: predictable, consistent, scalable
management
Distributed
Independent of central management system
Control is in the customers hands

Customization
Local programmable actions:

Triggered by specific events
NMS-2201
9627_05_2004_c2
93
Cisco IOS Embedded Event Manager:

Basic Architecture (v1)
Syslog Event
SNMP Data
Other Event
Event Detector Feeds EEM

Syslog
Event Detector
SNMP
Event Detector
Other
Event Detector
Embedded Event Manager
EEM
EEM
EEM
Policies
Policies
Policies
Network
Knowledge
Notify
Switchover
Reload
Actions
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
94
EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet
The following policy actions can be established:
Generate prioritized syslog messages
Generate a CNS event for upstream processing by
Cisco CNS devices
Reload the Cisco IOS software
Switch to a secondary processor in a fully redundant hardware
configuration
EEM Version 2
EEM Version 2 adds programmable actions using the Tcl
subsystem within Cisco IOS
Includes more event detectors and capabilities
NMS-2201
9627_05_2004_c2
95
EEM Version 2 Architecture

Event Publishers
Syslog
Daemon
System
Manager
System
Manager
Syslog
Watchdog
Sysmon
Timer
Services
Posix
Process
Manager
HA
Redundancy
Facility
Counters
IOS Process
Watchdog
Redundancy
Facility
Event Detectors
SNMP
Application Embedded Event

Specific
Event Detector Manager Server
IOS Subsystems
Event
Subscribers to
Subscriber
Receive Application
Events, Publishes
Application Events
Using Application
Specific Event
Detector
NMS-2201
9627_05_2004_c2
Tcl Shell
EEM Policy
Interface
Counters and
Stats
More event
detectors!
Define policies or
programmable
local actions
using Tcl
Register policy
with EEM Server
Events trigger
policy execution
Tcl extensions for
CLI control and
defined actions
Subscribers to
Receive Events,
Implements Policy
Actions

Presentation_ID.scr
Cisco Internal Use Only
96
What Does This Mean to the Business?

Better problem determination
Widely applicable scripts from Cisco engineering and TAC
Automated local action triggered by events
Automated data collection
Faster problem resolution

Reduces the next time it happensplease collect
Better diagnostic data to Cisco engineering
Faster identification and repair
Less downtime
Reduce susceptibility and Mean Time to Repair (MTTR)
Better service
Responsiveness
Prevent recurrence
Higher availability
Not an availability methodology by itself but can add valuable

information and customization to the data collection method
NMS-2201
9627_05_2004_c2
97
INSTILLING AN
AVAILABILITY CULTURE
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
98
Putting an Availability Program

into Practice
Track network availability
Identify defects
Identify root cause and
implement fix
Reduce operating expense
by eliminating non value
added work
How much does an outage
cost today?
How much can i save thru
process and product
enhancements?
NMS-2201
9627_05_2004_c2
99
How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis
b. Add or improve active monitoring method
2. Processanalyze the data!

a. What caused an outage?
b. Can a root cause be identified and
addressed?
3. Implement improvements or fixes

4. Measure the results
5. Back to step 1are other metrics
needed?
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
100
If You Have a Network Availability Method

Use the current method and metric for improvement
Dont try to change completely
Use incremental improvements
Develop additional methods to gather data as identified
Concentrate on understanding unavailability

causesAll unavailability causes should be
classified at a minimum under:
Change, SW, HW, power/facility, or link
Identify the actions to correct unavailability causes

i.e., network design, customer process change, HW MTBF
improvement, etc.
NMS-2201
9627_05_2004_c2
101
Multilayer Network Design
SA Agent
Between Access
and Distribution
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN

Presentation_ID.scr
Internet
PSTN
102
SA Agent
between
Servers and
WAN Users
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
Internet
PSTN
103
COOL for HighEnd Core

Devices
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN

Presentation_ID.scr
Internet
PSTN
104
Trouble
Ticketing
Methodology
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
Internet
PSTN
105
AVAILABILITY MEASUREMENT
SUMMARY
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
106
Summary
Availability metric is governed by your business
objectives
Availability measurements primary goal is:
To provide an availability baseline (maintain)
To help identify where to improve the network
To monitor and control improvement projects
Can you identify Where you are now? for your

network?
Do you know Where you are going? as network
oriented business objectives?
Do you have a plan to take you there?
NMS-2201
9627_05_2004_c2
107
Complete Your Online Session Evaluation!

WHAT:
Complete an online session evaluation

and your name will be entered into a
daily drawing
WHY:
Win fabulous prizes! Give us your feedback!
WHERE: Go to the Internet stations located

throughout the Convention Center
HOW:
NMS-2201
9627_05_2004_c2
Winners will be posted on the onsite

Networkers Website; four winners per day

Presentation_ID.scr
108
NMS-2201
9627_05_2004_c2
109
Recommended Reading
Performance and Fault
Management
ISBN: 1-57870-180-5
High Availability Network

Fundamentals
ISBN: 1-58713-017-3
Network Performance
Baselining
ISBN: 1-57870-240-2
The Practical Performance

Analyst
ISBN: 0-07-912946-3
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
110
Recommended Reading (Cont.)

The Visual Display of Quantitative Information
by Edward Tufte (ISBN: 0-9613921-0)
Practical Planning for Network Growth

by John Blommers (ISBN: 0-13-206111-2)
The Art of Computer Systems Performance Analysis

by Raj Jain (ISBN: 0-421-50336-3)
Implementing Global Networked Systems Management: Strategies

and Solutions
by Raj Ananthanpillai (ISBN: 0-07-001601-1)
Information Systems in Organizations: Improving Business

Processes
by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)
Integrated Management of Networked SystemsConcepts,

Architectures, and Their Operational Application
by Hegering, Abeck, Neumair (ISBN: 1558605711)
NMS-2201
9627_05_2004_c2
111
Appendix A: Acronyms
AVGAverage
ATMAsynchronous Transfer Mode
DPMDefects Per Million
FCAPSFault, Config, Acct, Perf,
Security
GEGigabit Ethernet
HAHigh Availability
HDLCHigh Level Data Link Control
HSRPHot Standby Routing
Protocol
IPMInternet Performance Monitor
IUMImpacted User Minutes
MIBManagement Information Base
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
MTBFMean Time Between Failure

MTTRMean Time to Repair
RMEResource Manager Essentials
RMONRemote Monitor
SA AgentService Assurance Agent
SNMPSimple Network Management
Protocol
SPFSingle Point of Failure; Shortest
Path First (routing protocol)
TCPTransmission Control Protocol
112
BACKUP SLIDES
NMS-2201
9627_05_2004_c2
113
ADDITIONAL
RELIABILITY SLIDES
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
114
Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure Rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time to Failure)
Availability
NMS-2201
9627_05_2004_c2
115
Reliability Defined
Reliability:
1. The probability of survival (or no failure) for a
stated length of time
2. Or, the fraction of units that will not fail in the
stated length of time
A mission time must be stated
Annual reliability is the probability of
survival for one year
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
116
Availability Defined
Availability:
1. The probability that an item (or network, etc.) is
operational, and ready-to-go, at any point in time
2. Or, the expected fraction of time it is operational.
annual uptime is the amount (in days, hrs., min.,
etc.) the item is operational in a year
Example: For 98% availability, the annual availability is
0.98 * 365 days = 357.7 days
NMS-2201
9627_05_2004_c2
117
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
operational state to a non-operational state
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
118
How Reliable Is It?

MTBF Reliability:
R = e-(MTBF/MTBF)
R = e-1 = 36.7%
MTBF reliability is only 37%; that is, 63% of your

HARDWARE fails before the MTBF!
But remember, failures are still random!
NMS-2201
9627_05_2004_c2
119
MTTR Defined
MTTR stands for Mean Time to Repair
or
MRT (Mean Restore Time)

This is the average length of time it takes to repair an item
More technically, it is the mean time to go from a nonoperational state to an operational state
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
120
One Method of Calculating Availability

Availability =
MTBF
(MTBF + MTTR)
What is the availability of a computer with

MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
NMS-2201
9627_05_2004_c2
121
Uptime
Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs
Conversely, annual DOWNtime is,

8,760 hrs/year X (1- 0.9988)
= 10.5 hrs
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
122
Systems
Components In-Series
Component 1
Component 2
Components In-Parallel (Redundant)
RBD
Component 1
Component 2
NMS-2201
9627_05_2004_c2
123
In-Series
Part 1
Part 2
In-Series
NMS-2201
9627_05_2004_c2
Up
Down Up
Up
Down Up
Up
Down Up Down Up

Presentation_ID.scr
Down
Up
Down Up
Down
Up
124
In-Parallel
Part 1
Part 2
Up
Down
Up
Up
Down
Down
Up
Up
Up
Down Up
Down
Up
In-Parallel
NMS-2201
9627_05_2004_c2
125
In-Series MTBF
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component Failure Rate

= 1/2500 = 0.0004
System Failure Rate
= 0.0004 + 0.0004 = 0.0008
System MTBF
= 1/(0.0008) = 1,250 hrs.
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
126
In-Series Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component ANNUAL Reliability:

R = e-(8760/2500) = 0.03
System ANNUAL Reliability:
R = 0.03 X 0.03 = 0.0009
NMS-2201
9627_05_2004_c2
127
In-Series Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:
A = 0.996 X 0.996 = 0.992
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
128
In-Parallel MTBF
COMPONENT 1
MTBF = 2,500 hrs.
System MTBF*:
COMPONENT 2
= 2500 + 2500/2=
3,750 hrs.
MTBF = 2,500 hrs.

In general*,
i =1
MTBF
i
*For 1-of-n Redundancy of n Identical Components

with NO Repair or Replacement of Failed Components
NMS-2201
9627_05_2004_c2
129
1-of-4 Example
i =1
2500
i
2500
1
2500 + 2500
+ 2500
+
2
3
4
= 5,208 hrs.
In general*,
i =1
MTBF
i
*For 1-of-n Redundancy of n Identical Components

with NO Repair or Replacement of Failed Components
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
130
In-Parallel Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component ANNUAL Reliability:

R = e-(8760/2500) = 0.03
Un
re
li
System ANNUAL Reliability:
ab
ili t
R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06

NMS-2201
9627_05_2004_c2
131
In-Parallel Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:
Un
av
a
ila
b
ilit
A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984

NMS-2201
9627_05_2004_c2

Presentation_ID.scr
132
Complex Redundancy
Examples:
1-of-2
2-of-3
m-of-n
3
.
.
.
n
2-of-4
8-of-10
Pure Active Parallel

NMS-2201
9627_05_2004_c2
133
More Complex Redundancy

Pure active parallel
All components are on
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switchover reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
134
Networks Consist of Series-Parallel

Combinations of in-series and redundant
components
D1
A
B1
1/2
B2
NMS-2201
9627_05_2004_c2
D2
2/3
D3
135
Failure Rate
The number of failures per time:
Failures/hour
Failures/day
Failures/week
Failures/106 hours
Failures/109 hours called FITs (Failures in Time)
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
136
Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2
failures occurring
Another 4 units were tested for 6,000 hours with 1
failure occurring
The failed units are repaired (or replaced)
What is the approximate MTBF?
NMS-2201
9627_05_2004_c2
137
Approximating MTBF (Cont.)

MTBF
= 13*1000 + 4*6000
1+2
= 37,000
3
= 12,333 hours
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
138
Frequency
Modeling
MTBF
Normal
MTBF
Log-Normal
Weibull
Time-to-Failure
Frequency
Distributions
Exponential
MTBF
Time-to-Failure
NMS-2201
9627_05_2004_c2
139
Constant Failure Rate
The Exponential Distribution

The exponential function:
f(t) = e-t, t > 0
Failure rate, , IS CONSTANT
= 1/MTBF
If MTBF = 2,500 hrs., what is the failure rate?

= 1/2500 = 0.0004 failures/hr.
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
140
Failure Rate
The Bathtub Curve
DECREASING
Failure Rate
INCREASING
Failure Rate
CONSTANT Failure Rate

Time
Infant
Mortality
NMS-2201
9627_05_2004_c2
Useful Life Period
Wear-Out
141
The Exponential Reliability Formula

Commonly used for electronic equipment
The exponential reliability formula:
R(t) = e-t or R(t) = e-t/MTBF
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
142
Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs;
what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs
R =e-(8760/100000) = 91.6%
This says that the probability of no failure in one
year is 91.6%; or, 91.6% of all units will survive
one year
NMS-2201
9627_05_2004_c2
143
ADDITIONAL TROUBLE
TICKETING SLIDES
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
144
Essential Data Elements

Parameter
Format
Description
Date
dd/mmm/yy
Date Ticket Issued
Ticket
Alphanumeric
Trouble Ticket Number
Start Date
dd/mmm/yy
Date of Fault
Start Time
hh:mm
Time of Fault
Resolution Date
dd/mmm/yy
Date of Resolution
Resolution Time
hh:mm
Time of Resolution
Customers Impacted
Interger
Number of Customers that Lost Service; Number

Impacted or Names of Customers Impacted
Problem Description
String
Outline of the Problem
Root Cause
String
HW, SW, Process, Environmental, etc.
Component/Part/SW
Version
Alphanumeric
For HW Problems include Product ID; for SW

Include Release Version
Type
Planned/Unplanned
Identity if the Event Was Due to Planned

Maintenance Activity or Unplanned Outage
Resolution
String
Description of Action Taken to Fix the Problem
Note: Above Is the Minimum Data Set, However, if

Other Information Is Captured it Should Be Provided
NMS-2201
9627_05_2004_c2
145
HA Metrics/NAIS Synergy
Referral for
Analysis
Data Analysis
Operational
Process and Procedures
Analysis
Baseline availability
Determine DPM
Network reliability
improvement analysis
(Defects Per Million)
Trouble Tickets
by:
Problem management
Definitions
Planned/Unplanned
Root Cause
Resolution
Equipment
Data accuracy
Collection
processes
MTTR
NMS-2201
9627_05_2004_c2
Fault management
Resiliency assessment
Change management
Performance
management
Availability
management
Analyzed Trouble Ticket Data

Referral for Process/Procedural Improvement

Presentation_ID.scr
146
ADDITIONAL SA AGENT SLIDES
NMS-2201
9627_05_2004_c2
147
SA Agent: How It Works

SNMP
Management Application
1. User configures Collectors
through Mgmt Application GUI
2. Mgmt Application provisions
Source routers with Collectors
SA Agent
3. Source router measures and
stores performance data,
e.g.:
Response time
Availability
6. Application retrieves data from

Source routers once an hour
7. Data is written to a database
4. Source router evaluates

SLAs, sends SNMP Traps
5. Source router stores latest
data point and 2 hours of
aggregated points
8. Reports are generated

NMS-2201
9627_05_2004_c2

Presentation_ID.scr
148
SAA Monitoring IP Core

R2
R1
P1
P3
IP Core
R3
P2
Management System
NMS-2201
9627_05_2004_c2
149
Monitoring Customer IP Reachability
P1
Nw1
Nw3
TP1
P2
TPx
P3
Nw3
PN
NwN
P1-Pn Service Assurance Agent ICMP

Polls to a Test Point in the IP Core
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
150
Service Assurance Agent Features

Measures Service Level Agreement (SLA) metrics
Packet Loss
Response time
Throughput
Availability
Jitter
Evaluates SLAs
Proactively sends notification of SLA violations
NMS-2201
9627_05_2004_c2
151
SA Agent Impact on Devices

Low impact on CPU utilization
18k memory per SA agent
SAA rtr low-memory
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
152
Monitored Network Availability Calculation

Not calculated:
Already have availability baseline
Fault type, frequency and downtime may be more useful
Faults directly measured from management system(s)
NMS-2201
9627_05_2004_c2
153
Monitored Network Availability

Assumptions
All connections below IP are fixed
Management systems can be notified of all fixed
connection state changes
All (L2) events impact on IP (L3) service
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
154
ADDITIONAL COOL SLIDES
NMS-2201
9627_05_2004_c2
155
CLIs
Configuration CLI Commands
[no] cool run <cr>
[no] cool interface interface-name(idb) <cr>
[no] cool physical-FRU-entity entity-index (int) <cr>
[no] cool group-interface group-objectID(string) <cr>
[no] cool add-cpu objectID threshold duration <cr>
[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr>
[no] cool if-filter group-objectID (string)<cr>
Display CLI Commands

Router#show cool event-table [<number of entries>] displays all if not specified
Router#show cool object-table [<object-type(int)>] displays all object types if not specified
Router#show cool fru-entity
Exec CLI Commands

Router#clear cool event-table
Router#clear cool persistent-files
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
156
Measurement Example:
Router Device Outage
Reload (Operational) ,
Power Outage, or
Device H/W failure
Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4).

Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB.
Status: Up (1) Down (2).
Last-change: last object status change time.
AOT: Accumulated Outage Time (sec).
NAF: Number of Accumulated Failure.
NMS-2201
9627_05_2004_c2
157
Cisco IOS S/W Outage
Standby RP in Slot 0 Crash Using Address Error (4) Test Crash;
AdEL Exception It Is Caused Purely by Cisco IOS S/W
Standby RP Crash Using Jump to Zero (5) Test Crash;

Bp Exception It Can Be Caused by S/W, H/W, or Operation
NMS-2201
9627_05_2004_c2

Presentation_ID.scr
158
Measurement Example: Linecard Outage
Add a Linecard
Reset the Linecard
Down Event Captured

Up Event Captured
AOT and NAF Updated

NMS-2201
9627_05_2004_c2
159
Measurement Example: Interface Outage

1
12406-R1202(config)#cool group-interface ATM2/0.

12406-R1202(config)#no cool group-interface ATM2/0.3
Object Table
sh cool object 1 | include ATM2/0.

33 1
1054859087 0 0
0
35 1
1054859088 0 0
0
39 1
1054859090 0 0
0
41 1
1054859090 0 0
0
ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#shut
Shut ATM2.0
show cool event-table
Interface Down
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 1 1054859105 18
1
ATM2/0.1
1 35 1 1054859106 18
2
ATM2/0.2 Down Event
1 39 1 1054859107 17
3
ATM2/0.4 Captured
1 41 1 1054859108 18
4
ATM2/0.5
4
12406-R1202(config-if)#no shut
No Shut ATM2.0
show cool event-table
Interface
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 0 1054859146 41
1
ATM2/0.1
1 35 0 1054859147 41
2
ATM2/0.2
Up Event
1 39 0 1054859149 42
3
ATM2/0.4
Captured
1 41 0 1054859150 42
4
ATM2/0.5
NMS-2201
9627_05_2004_c2
Configure to Monitor All the Interfaces which

Includes ATM2/0; String, Except ATM2/0.3

Presentation_ID.scr
Object Table Shows AOT and NAF
sh cool object 1 | include ATM2/0.

33 1
1054859087 0 41
35 1
1054859088 0 41
39 1
1054859090 0 42
41 1
1054859090 0 42
1
1
1
1
ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5
160
Remote Device Outage
12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 1
3 Remote Devices Are

Added
sh cool object-table 4 | include remobj

1 1
1054867061 0
0 remobj.1
2 1
1054867063 0
0 remobj.2
3 1
1054867065 0
0 remobj.3
Object Table
12406-R1202(config-if)#shut
Shut Down the Interface Link Between the Remote

Device and Router
4
4
4
2
1
3
5
5
5
1054867105
1054867108
1054867130
42
47
65
2
3
10
remobj.2
remobj.1
remobj.3
12406-R1202(config-if)#no shut
4
4
4
1
3
2
4
4
4
1054867171
1054867193
1054867200
63
63
95
1
8
10
No Shut the Interface Link

remobj.1
remobj.3
remobj.2
sh cool object-table 4 | include remobj

1 1
1054867061 63
1 remobj.1
2 1
1054867063 63
1 remobj.2
3 1
1054867065 95
1 remobj.3
NMS-2201
9627_05_2004_c2
Down Event Captured

Presentation_ID.scr
Up Event Captured
Object Table Shows AOT and NAF

161

MTBF Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MTBF Presentation

Uploaded by

Copyright:

Available Formats

AVAILABILITY MEASUREMENT

2004 Cisco Systems, Inc. All rights reserved.

Developing an Availability Culture

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Why Measure Availability?

2004 Cisco Systems, Inc. All rights reserved.

Why Should We Care About

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Why Should We Care About

DowntimeCosts too Much!!!

2004 Cisco Systems, Inc. All rights reserved.

Cause of Network Outages

Source: Gartner Group

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Top Three Causes of Network Outages

WAN failure (e.g., major fiber

Critical services failure

2004 Cisco Systems, Inc. All rights reserved.

Method for Attaining a

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Where Are We Going?

Economic Value Added

Define Your End-State?

2004 Cisco Systems, Inc. All rights reserved.

Why Availability for Business

Availability as a basis for organizational

Resource allocation information

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

It Takes a Design Effort to Achieve HA

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

What Is High Availability?

Downtime per Year (24x7x365)

2004 Cisco Systems, Inc. All rights reserved.

Exceptions to the availability

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

How You Define Availability

Define what to measure

2004 Cisco Systems, Inc. All rights reserved.

Reliability is defined as the probability of survival

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

MTTR stands for Mean Time to Repair

2004 Cisco Systems, Inc. All rights reserved.

One Method of Calculating Availability

What is the availability of a computer with MTBF =

Conversely, annual DOWN time is,

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Networks Consist of Series-Parallel

2004 Cisco Systems, Inc. All rights reserved.

More Complex Redundancy

2004 Cisco Systems, Inc. All rights reserved.