Professional Documents
Culture Documents
SESSION NMS-2201
NMS-2201
9627_05_2004_c2
Agenda
Introduction
Availability Measurement Methodologies
Trouble Ticketing
Device Reachability: ICMP (Ping), SA Agent, COOL
SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent
Application
NMS-2201
9627_05_2004_c2
Associated Sessions
NMS-1N01: Intro to Network Management
NMS-1N02: Intro to SNMP and MIBs
NMS-1N04: Intro to Service Assurance Agent
NMS-1N41: Introduction to Performance Management
NMS-2042: Performance Measurement with Cisco IOS
ACC-2010: Deploying Mobility in HA Wireless LANs
NMS-2202: How Cisco Achieved HA in Its LAN
RST-2514: HA in Campus Network Deployments
NMS-4043: Advanced Service Assurance Agent
RST-4312: High Availability in Routing
NMS-2201
9627_05_2004_c2
INTRODUCTION
WHY MEASURE AVAILABILITY?
NMS-2201
9627_05_2004_c2
NMS-2201
9627_05_2004_c2
NMS-2201
9627_05_2004_c2
Technology Hardware
Links
20%
Design
Environmental
issues
Natural disasters
Process
consistency
User Error
and Process
40%
Software and
Application
40%
Software issues
Performance
and load
Scaling
Network design
Capacity
(unanticipated peaks)
Solutions validation
Power
Software quality
Inadvertent configuration
change
Change management
Protocol implementations
and misbehavior
Hardware fault
NMS-2201
9627_05_2004_c2
NMS-2201
9627_05_2004_c2
10
Revenue/Employee
Productivity
Time to market
Organizational mission
Customer perspective
Satisfaction
Retention
Market Share
11
12
Process Design
NMS-2201
9627_05_2004_c2
Network and
Physical Plant Design
13
INTRODUCTION
WHAT IS NETWORK
AVAILABILITY?
NMS-2201
9627_05_2004_c2
14
NMS-2201
9627_05_2004_c2
99.000%
3 Days
15 Hours
36 Minutes
99.500%
1 Day
19 Hours
48 Minutes
99.900%
8 Hours
46 Minutes
99.950%
4 Hours
23 Minutes
99.990%
53 Minutes
99.999%
5 Minutes
99.9999%
30 Seconds
15
Availability Definition
Availability definition is
based on business
objectives
Is it the user experience you are
interesting in measuring?
Are some users more important
than other?
Availability groups?
Definitions of different groups
NMS-2201
9627_05_2004_c2
16
17
Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time To Failure)
Engineered availability
NMS-2201
9627_05_2004_c2
18
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
OPERATIONAL STATE to a NON-OPERATIONAL STATE
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems
NMS-2201
9627_05_2004_c2
19
MTBF
(MTBF + MTTR)
Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs
20
RBD
D1
A
B1
1/2
B2
NMS-2201
9627_05_2004_c2
D2
2/3
D3
21
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switch-over reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2
22
MEASURING THE
PRODUCTION NETWORK
NMS-2201
9627_05_2004_c2
23
NMS-2201
9627_05_2004_c2
24
NMS-2201
9627_05_2004_c2
25
Types of Availability
Device/interface
Path
Users
Application
NMS-2201
9627_05_2004_c2
26
NMS-2201
9627_05_2004_c2
27
NMS-2201
9627_05_2004_c2
28
29
NETWORK AVAILABILITY
COLLECTION METHODS
TROUBLE TICKETING METHODS
NMS-2201
9627_05_2004_c2
30
Step II
Measure uptime ongoing
Track defects per million (DPM) or IUM or
availability (%)
Step III
Track customer impact for each ticket/MTTR
Categorize DPM by reason code and
begin trending
Identify initiatives/areas for a focus to
eliminate defects
NMS-2201
9627_05_2004_c2
31
32
33
8 x 24
x 106
100 x 24 x 365
Availability = 1 -
NMS-2201
9627_05_2004_c2
8 x 24
.
100 x 24 x 365
MTBF =
24 x 365 .
8
= 1095 (hours)
MTTR =
1095 x (1-0.978082) .
0.978082
= 0.24 (hours)
34
NMS-2201
9627_05_2004_c2
35
e
iv
t
tra
s
u
Ill
99.80
99.75
99.70
99.65
99.60
99.55
99.50
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun
Key takeaways
NMS-2201
9627_05_2004_c2
36
DPM
500
400
300
200
100
0
June
July
Aug
e
iv
t
tra
s
u
Ill
Sept
Oct
Backbone
NAS
PG
POP
Radius Server
VPN Radius Server
Dec
June
July
Aug
Sept
Other
339.5
424.9
394.7
362.2
Platform Related
49.2
82.5
104
52.6
Total DPM
388.7
507.4
498.7
414.8
------99.99% Target
100
100
100
100
Oct
Nov
Dec
100
100
100
July
Aug
Sept
Backbone
1.5
.8
15.7
2.3
NAS
21.7
19.4
27
26.1
PG
26
59.6
56.8
18.9
POP
3.9
.5
1.6
Radius Server
1.2
.3
VPN Radius
8.8
2.8
3.4
49.2
82.5
104
52.6
NMS-2201
9627_05_2004_c2
DPM by Cause
2500
DPM
2000
1500
1000
500
0
Dec
Dec
Unknown
Human Error
18.2
Environmental
36.1
Power
566.1
Other
HW
Config/SW
TOTAL
NMS-2201
9627_05_2004_c2
e
v
i
t
ra
t
s
u
Ill
Jan
Jan
Feb
Mar
Feb
Apr
Mar
May
Apr
May
80
95.2
115.2
23.6
8.98
87.7
106
68.8
18.4
133.4
127
31.4
11.1
89.7
19
14.8
145.7
136.2
212.4
37
314.2
604.4
884.3
512.7
553.6
474.3
422.5
240
406
101.6
117.5
20.2
106.6
201
3789.3
1202.2
1226
1293.1
1641.9
1964.8
38
15.1
14
12.42
Hours
12
10
8.49
e
v
i
t
a
r
t
s
lI lu
7.19
4
2
0
Jun
Jul
Aug
Sep
Oct
Nov
Dec
100
140
90
120
80
# of Faults
100
>24 Hr
80
12-24 Hr
60
4-12 Hr
40
1-4 Hr
20
<1 Hr
Jun
Jul
NMS-2201
9627_05_2004_c2
Aug
Sep
Oct
Nov
Dec
# of Total
>100
70
>100
60
>24 Hr
50
12-24 Hr
40
4-12 Hr
30
20
1-4 Hr
10
<1 Hr
Jun
Jul
Aug
Sep
Oct
Nov
Dec
39
Unplanned DPM
1000
900
800
700
600
500
400
300
200
100
0
Feb
Feb
Mar
Mar
Apr
Apr
Other
70
100
35
Process
90
80
55
HW
90
200
80
SW
60
140
50
TOTAL
310
520
220
e
v
i
t
a
r
t
s
lI lu
May
May
Jun
Jun
Jul
Jul
Aug
Aug
Sep
Sep
Oct
Oct
Nov
Nov
Dec
Dec
79
80
80
165
110
40
10
100
100
90
210
180
75
10
104
180
115
385
325
245
110
67
80
65
200
145
100
40
10
350
440
350
960
760
460
170
40
Key take-a-ways
Action plans
Identify areas of focus to enable
reduction of DPM to achieve network
availability goal
NMS-2201
9627_05_2004_c2
40
Cons
Some internal subjective/consistency process issues
Outages may occur that are not included in the trouble
ticketing systems
Resources needed to scrub data and create reports
May not work with existing trouble ticketing
system/process
NMS-2201
9627_05_2004_c2
41
AUTOMATED FAULT
MANAGEMENT EVENTS METHOD
NMS-2201
9627_05_2004_c2
42
Step II
Establish network availability baseline
Measure uptime on an ongoing basis
Step III
Track root cause and customer impact
Begin trending of availability issues
Identify initiatives and areas of focus
to eliminate defects
NMS-2201
9627_05_2004_c2
43
Event Log
Analysis of events
received from the
network devices
Analysis of accuracy
of the data
NMS-2201
9627_05_2004_c2
44
45
Cons
Requires an excellent change management/provisioning
process
Requires an efficient and effective fault management system
Requires a custom development
Does not account for routing problems
Not true end-to-end measure
NMS-2201
9627_05_2004_c2
46
NETWORK AVAILABILITY
DATA COLLECTION
SAMPLE OUTPUT
NMS-2201
9627_05_2004_c2
47
# of
Count of
Devices Incidents
Total Down
Time
hhh:mm:ss
%
Down
%
Up
Shortest
Outage
Duration
Mean
Time to
Repair
Longest Events
Outage
per
Duration Device
Host
Totals
2389
801
202:27:27
.0673% 99.9327%
0:00:19
0:20:47
7:48:46
24.42
Network
Totals
4732
1673
430:02:03
.1309% 99.8691%
0:00:24
0:22:36
9:49:35
14.90
Other
Totals
897
173
212:29:46
.0509% 99.9491%
0:00:17
0:26:07
2:16:10
16.84
GRAND
TOTAL
8018
2647
844:59:16
.0830% 99.9170%
0:00:20
0:23:10
6:38:11
18.72
NMS-2201
9627_05_2004_c2
48
Host Totals
30%
Host Totals
Network Totals
Other Totals
Network
Totals
59%
Other Totals
7%
Host Totals
30%
Host Totals
24%
Network
Totals
51%
NMS-2201
9627_05_2004_c2
Count of Incidents
Host Totals
Network Totals
Other Totals
Network
Totals
63%
Other Totals
25%
49
NMS-2201
9627_05_2004_c2
50
NMS-2201
9627_05_2004_c2
51
How:
Edge interfaces and/or devices are defined and pinged
on a determined interval
Unavailability:
Pre-defined, non-response from the interface
NMS-2201
9627_05_2004_c2
52
53
Cons
Point to multipoint implies not true end-to-end measure
Availability granularity limited by ping frequency
Maintenance of device databasemust have a solid
change management and provisioning process
NMS-2201
9627_05_2004_c2
54
How:
An agent is configured to SNMP poll and tabulate outage
times for defined devices or links; database maintains
outage times and total service time; sometimes trap
information is used to augment this method by providing
more accurate information on outages
Unavailability:
Pre-defined, non-redundant links, ports, or devices that
are down
NMS-2201
9627_05_2004_c2
55
1
Sampling Rate
NMS-2201
9627_05_2004_c2
56
How:
Utilizing existing NMS systems that are currently SNMP
polling to tabulate outage times for defined devices or links
A database maintains outage times and total service time
SNMP Trap information is also used to augment this
method by providing more accurate information on
outages
NMS-2201
9627_05_2004_c2
57
Cons
No canned SW to do this; custom development
Maintaining element device database challenging
Requires an excellent change mgmt and provisioning
process
Does not account for routing problems
Not a true end-to-end measure
NMS-2201
9627_05_2004_c2
58
NMS-2201
9627_05_2004_c2
59
How:
A data collector creates SA Agents on the routers to
monitor certain network/service performances; the data
collector then collects this data from the routers,
aggregates it and makes it available
Unavailability:
Pre-defined paths with reporting on non-redundant links,
ports, or devices that are down within a path
NMS-2201
9627_05_2004_c2
60
Case Study:
Financial Institution (Collection)
Internet
Web Sites
DNS
SA Agent Collectors
Remote Sites
NMS-2201
9627_05_2004_c2
61
NMS-2201
9627_05_2004_c2
62
1
x 106 = 100 probes out of 1 million will fail
10000
Availability = 1 -
NMS-2201
9627_05_2004_c2
1 .
= 0.9999
10000
63
Sample Size
Sample size is the number of samples that have
been collected
The more samples collected the higher the confidence that
the data accurately represents the network
Confidence (margin of error) is defined by
m=
1
sample size
m=
NMS-2201
9627_05_2004_c2
1
24
= 0.2041
m=
1
24 x 31
= 0.0367
64
Cons
Requires a system to collect the SAA data
Requires implementation in the router configurations
Availability granularity limited by polling frequency
Definition of the critical network paths to be measured
NMS-2201
9627_05_2004_c2
65
NMS-2201
9627_05_2004_c2
66
COOL Objectives
To automate the measurement to increase
operational efficiency and reduce operational cost
To measure the outage as close to the source of
outage events as possible to pin point the cause of
the outages
To cope with large number of network elements
without causing system and network performance
degradation
To maintain measurement data reliably in presents
of element failure or network partition
To support simplicity in deployment, configuration,
and data collection (autonomous measurement)
NMS-2201
9627_05_2004_c2
67
COOL Features
NMS
rd Party Tools
3rd
NetTools
C-NOTE
PNL
COOL
Access Router
Customer Equipment
NMS-2201
9627_05_2004_c2
68
NMS
NMS
NMS
Two-tier framework
Reduces performance impact on
the router
Provides scalability to the NMS
Makes easy to deploy
COOL
COOL
Access Router
Core Router
Access Routers
Customer Equipment
NMS-2201
9627_05_2004_c2
69
Outage Model
C
Access Router
Network
Management
System
RP
D
D
A
Physical
D
Interface
B
Power Fan, A
Etc.
Logical
Interface
Type
Objects Monitored
Link
MUX/
Hub/
Switch
Customer
Equipment
Link
Peer
Router
Failure Modes
Physical Entity
Objects
Interface Objects
Remote Objects
Software Objects
NMS-2201
9627_05_2004_c2
70
Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to be
defective (service degradation or complete outage)
Duration threshold: the minimum period beyond which an outage needs
to be reported (given SLA)
Start time: when the object outage starts
End time: when the outage ends
Duration
Threshold
Defect
Threshold
Up Event
Time
Down Event
Start Time
NMS-2201
9627_05_2004_c2
Outage Duration
End Time
71
Architecture
Customer Interfaces
Outage Monitor MIB
SNMP Polling
Configuration
SNMP Notification
CLI
Measurement
Metrics
Event Map Table
Outage
Manager
NVRAM
Fault Manager
CPU
Event
(IOS)
Usage Source
Detect Callbacks Syslog
ATA Flash
Time Stamp
Temp Event Data
Crash Reason
Outage Data
Remote Component
Outage Detector
Internal Component
Outage Detector
NMS-2201
9627_05_2004_c2
Customer
Authentication
Customer Equipment
Ping
Detection Function
Baseline
SAA
APIs
Optional
Measurement Methods
72
Router 1
System Crash
Up
Down
NMS-2201
9627_05_2004_c2
System Crash
10
10
Time
73
Logical Interface
Physical Interface
System Crash
Up
System Crash
10
10
Down
Time
20
Interface 1
20
Interface Failure
Up
NMS-2201
9627_05_2004_c2
Service Affecting
AOT = 27;
NAF = 3;
Router Device
AOT = 20;
NAF = 2;
10
10
7
Time
Router 1
Interface 1
Interface
AOT = 7;
NAF = 1;
74
Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi
= 14/2
= 7 min
Object i
Up
Down
NMS-2201
9627_05_2004_c2
T1
TTR
TTR
10 min.
4 min.
Failure
Failure
T2
Time
75
Up
Down
10 min.
T1
Failure
TTF
4 min.
Failure
T2
Time
76
MTBF
MTBF + MTTR
* 100
NMS-2201
9627_05_2004_c2
10 min.
T1
Failure
4 min.
Failure
T2
Time
77
Send Break
Operation
Caused
Outage
Reload
Planned Outage
Forced Switchover
NMS-2201
9627_05_2004_c2
Upper Bound
of the Planned
Outage
78
Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down
May cause virtual network disconnection
May occurs event storm when hundreds of messages for each
flapping event
May make the object MTBF unreasonably low due to frequent
short failures
This unstable condition needs to get operators attention
COOL detects the flapping status
Catching very short outage event (less than the duration threshold)
Increasing the event counter,
Flapping status, if it becomes over the flapping threshold (3 event
counter) for the short period (1 sec); sends a notification
Stable status, if it becomes less than the threshold; sends another
notification
NMS-2201
9627_05_2004_c2
79
Periodic
Update
COOL
RAM
RAM
Outage Data
Outage Data
NVRAM
NVRAM
Persistent
Outage Data
Copy
FLASH
Copy
Persistent
Outage Data
Active RP
Persistent
Outage Data
FLASH
Persistent
Outage Data
Standby RP
Data persistency
To avoid data loss due to link outage or router itself crash
Data redundancy
To continue the outage measurement after the switchover
To retain the outage data even if the RP is physically replaced
NMS-2201
9627_05_2004_c2
80
IF-MIB
Event Reason Map Table
Event-Reason-Index;
Event-Time;
Event-Interval;
(Event Description)
ifTable
(Interface Object Description)
ENTITY-MIB
cOutageObjectTable
entPhysicalTable
Object-Type;
Object-Index;
Object-Status;
Object-AOT;
Object-NAF;
CISCO-PROCESS-MIB
cpmProcessTable
(Process Object Description)
NMS-2201
9627_05_2004_c2
81
Configuration
MIB Display
Show CLI
Config CLI
Show event-table
Show object-table
Event Table
Object Table
COOL
run;
add;
removal
filtering-enable;
Cisco IOS
Configuration
Update
Customer Equipment
Detection Function
NMS-2201
9627_05_2004_c2
Update
82
Enabling COOL
ari#dir
Directory of disk0:/
1
-rw-
19014056
gsr-k4p-mz.120-26.S.bin
Obtain
Authorization
File
Enable COOL
ari(config)#^Z
ari#wr mem
Building configuration...
[OK][OK][OK]
NMS-2201
9627_05_2004_c2
83
COOL
Pros
Accurate network availability for devices, components,
and software
Accounts for routing problems
Implementation with low network overhead.
Enables correlation between active and passive availability
methodologies
Cons
Only a few system currently have the COOL feature
Requires implementation in the router configurations of
production devices
Availability granularity limited by polling frequency
New Cisco IOS Feature
NMS-2201
9627_05_2004_c2
84
APPLICATION LAYER
MEASUREMENT
NMS-2201
9627_05_2004_c2
85
Application Reachability
How:
Agents on client and server computers and collecting data
Fire Runner, Ganymede Chariot, Gyra Research, Response
Networks, Vital Signs Software, NetScout, Custom applications
queries on customer systems
Unavailability:
Pre-defined QoS definition
NMS-2201
9627_05_2004_c2
86
Application Reachability
Pros
Actual application availability can be understood
QoS, by application, can be factored into the availability
measurement
Cons
Depending on scale, potential high overhead and cost can
be expected
NMS-2201
9627_05_2004_c2
87
NMS-2201
9627_05_2004_c2
88
89
NMS-2201
9627_05_2004_c2
90
Event MIB
Allows you to create custom notifications and log them and/or send
them as SNMP traps or informs
MIB persistence is supported a MIBs SNMP data persists across
reloads
Can be used to test objects on other devices
More flexible than RMON events/alarms
RMON is tailored for use with counter objects
NMS-2201
9627_05_2004_c2
91
Mission statement:
Provide robust, scalable, powerful, and easy-to-use
embedded managers to solve problems such as syslog and
event management within Cisco routers and switches
NMS-2201
9627_05_2004_c2
92
NMS-2201
9627_05_2004_c2
93
SNMP Data
Other Event
SNMP
Event Detector
Other
Event Detector
EEM
EEM
EEM
Policies
Policies
Policies
Network
Knowledge
Notify
Switchover
Reload
Actions
NMS-2201
9627_05_2004_c2
94
EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet
The following policy actions can be established:
Generate prioritized syslog messages
Generate a CNS event for upstream processing by
Cisco CNS devices
Reload the Cisco IOS software
Switch to a secondary processor in a fully redundant hardware
configuration
EEM Version 2
EEM Version 2 adds programmable actions using the Tcl
subsystem within Cisco IOS
Includes more event detectors and capabilities
NMS-2201
9627_05_2004_c2
95
System
Manager
System
Manager
Syslog
Watchdog
Sysmon
Timer
Services
Posix
Process
Manager
HA
Redundancy
Facility
Counters
IOS Process
Watchdog
Redundancy
Facility
Event Detectors
SNMP
IOS Subsystems
Event
Subscribers to
Subscriber
Receive Application
Events, Publishes
Application Events
Using Application
Specific Event
Detector
NMS-2201
9627_05_2004_c2
Tcl Shell
EEM Policy
Interface
Counters and
Stats
More event
detectors!
Define policies or
programmable
local actions
using Tcl
Register policy
with EEM Server
Events trigger
policy execution
Tcl extensions for
CLI control and
defined actions
Subscribers to
Receive Events,
Implements Policy
Actions
96
Less downtime
Reduce susceptibility and Mean Time to Repair (MTTR)
Better service
Responsiveness
Prevent recurrence
Higher availability
97
INSTILLING AN
AVAILABILITY CULTURE
NMS-2201
9627_05_2004_c2
98
99
How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis
b. Add or improve active monitoring method
NMS-2201
9627_05_2004_c2
100
101
SA Agent
Between Access
and Distribution
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
102
SA Agent
between
Servers and
WAN Users
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
Internet
PSTN
103
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
104
Trouble
Ticketing
Methodology
Access
Distribution
Core/Backbone
Core
Building Block
Additions
Server Farm
NMS-2201
9627_05_2004_c2
WAN
Internet
PSTN
105
AVAILABILITY MEASUREMENT
SUMMARY
NMS-2201
9627_05_2004_c2
106
Summary
Availability metric is governed by your business
objectives
Availability measurements primary goal is:
To provide an availability baseline (maintain)
To help identify where to improve the network
To monitor and control improvement projects
107
WHY:
NMS-2201
9627_05_2004_c2
108
NMS-2201
9627_05_2004_c2
109
Recommended Reading
Performance and Fault
Management
ISBN: 1-57870-180-5
Network Performance
Baselining
ISBN: 1-57870-240-2
NMS-2201
9627_05_2004_c2
110
111
Appendix A: Acronyms
AVGAverage
ATMAsynchronous Transfer Mode
DPMDefects Per Million
FCAPSFault, Config, Acct, Perf,
Security
GEGigabit Ethernet
HAHigh Availability
HDLCHigh Level Data Link Control
HSRPHot Standby Routing
Protocol
IPMInternet Performance Monitor
IUMImpacted User Minutes
MIBManagement Information Base
NMS-2201
9627_05_2004_c2
112
BACKUP SLIDES
NMS-2201
9627_05_2004_c2
113
ADDITIONAL
RELIABILITY SLIDES
NMS-2201
9627_05_2004_c2
114
Network Design
What Is Reliability?
Reliability is often used as a general term that
refers to the quality of a product
Failure Rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time to Failure)
Availability
NMS-2201
9627_05_2004_c2
115
Reliability Defined
Reliability:
1. The probability of survival (or no failure) for a
stated length of time
2. Or, the fraction of units that will not fail in the
stated length of time
A mission time must be stated
Annual reliability is the probability of
survival for one year
NMS-2201
9627_05_2004_c2
116
Availability Defined
Availability:
1. The probability that an item (or network, etc.) is
operational, and ready-to-go, at any point in time
2. Or, the expected fraction of time it is operational.
annual uptime is the amount (in days, hrs., min.,
etc.) the item is operational in a year
Example: For 98% availability, the annual availability is
0.98 * 365 days = 357.7 days
NMS-2201
9627_05_2004_c2
117
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)
or, to a failure (MTTF)
More technically, it is the mean time to go from an
operational state to a non-operational state
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems
NMS-2201
9627_05_2004_c2
118
NMS-2201
9627_05_2004_c2
119
MTTR Defined
MTTR stands for Mean Time to Repair
or
NMS-2201
9627_05_2004_c2
120
MTBF
(MTBF + MTTR)
NMS-2201
9627_05_2004_c2
121
Uptime
Annual uptime
8,760 hrs/year X (0.9988)
= 8,749.5 hrs
NMS-2201
9627_05_2004_c2
122
Systems
Components In-Series
Component 1
Component 2
RBD
Component 1
Component 2
NMS-2201
9627_05_2004_c2
123
In-Series
Part 1
Part 2
In-Series
NMS-2201
9627_05_2004_c2
Up
Down Up
Up
Down Up
Up
Down Up Down Up
Down
Up
Down Up
Down
Up
124
In-Parallel
Part 1
Part 2
Up
Down
Up
Up
Down
Down
Up
Up
Up
Down Up
Down
Up
In-Parallel
NMS-2201
9627_05_2004_c2
125
In-Series MTBF
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
126
In-Series Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
127
In-Series Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:
A = 0.996 X 0.996 = 0.992
NMS-2201
9627_05_2004_c2
128
In-Parallel MTBF
COMPONENT 1
MTBF = 2,500 hrs.
System MTBF*:
COMPONENT 2
= 2500 + 2500/2=
3,750 hrs.
i =1
MTBF
i
129
1-of-4 Example
i =1
2500
i
2500
1
2500 + 2500
+ 2500
+
2
3
4
= 5,208 hrs.
In general*,
i =1
MTBF
i
130
In-Parallel Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Un
re
li
ab
ili t
131
In-Parallel Availability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component Availability:
A = 2500 (2500 + 10) = 0.996
System Availability:
Un
av
a
ila
b
ilit
132
Complex Redundancy
Examples:
1-of-2
2-of-3
m-of-n
3
.
.
.
n
2-of-4
8-of-10
133
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switchover reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201
9627_05_2004_c2
134
D1
A
B1
1/2
B2
NMS-2201
9627_05_2004_c2
D2
2/3
D3
135
Failure Rate
The number of failures per time:
Failures/hour
Failures/day
Failures/week
Failures/106 hours
Failures/109 hours called FITs (Failures in Time)
NMS-2201
9627_05_2004_c2
136
Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2
failures occurring
Another 4 units were tested for 6,000 hours with 1
failure occurring
The failed units are repaired (or replaced)
What is the approximate MTBF?
NMS-2201
9627_05_2004_c2
137
= 13*1000 + 4*6000
1+2
= 37,000
3
= 12,333 hours
NMS-2201
9627_05_2004_c2
138
Frequency
Modeling
MTBF
Normal
MTBF
Log-Normal
Weibull
Time-to-Failure
Frequency
Distributions
Exponential
MTBF
Time-to-Failure
NMS-2201
9627_05_2004_c2
139
NMS-2201
9627_05_2004_c2
140
Failure Rate
DECREASING
Failure Rate
INCREASING
Failure Rate
Infant
Mortality
NMS-2201
9627_05_2004_c2
Wear-Out
141
NMS-2201
9627_05_2004_c2
142
Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs;
what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs
R =e-(8760/100000) = 91.6%
This says that the probability of no failure in one
year is 91.6%; or, 91.6% of all units will survive
one year
NMS-2201
9627_05_2004_c2
143
ADDITIONAL TROUBLE
TICKETING SLIDES
NMS-2201
9627_05_2004_c2
144
Format
Description
Date
dd/mmm/yy
Ticket
Alphanumeric
Start Date
dd/mmm/yy
Date of Fault
Start Time
hh:mm
Time of Fault
Resolution Date
dd/mmm/yy
Date of Resolution
Resolution Time
hh:mm
Time of Resolution
Customers Impacted
Interger
Problem Description
String
Root Cause
String
Component/Part/SW
Version
Alphanumeric
Type
Planned/Unplanned
Resolution
String
145
HA Metrics/NAIS Synergy
Referral for
Analysis
Data Analysis
Operational
Process and Procedures
Analysis
Baseline availability
Determine DPM
Network reliability
improvement analysis
(Defects Per Million)
Trouble Tickets
by:
Problem management
Definitions
Planned/Unplanned
Root Cause
Resolution
Equipment
Data accuracy
Collection
processes
MTTR
NMS-2201
9627_05_2004_c2
Fault management
Resiliency assessment
Change management
Performance
management
Availability
management
146
NMS-2201
9627_05_2004_c2
147
Management Application
1. User configures Collectors
through Mgmt Application GUI
2. Mgmt Application provisions
Source routers with Collectors
SA Agent
3. Source router measures and
stores performance data,
e.g.:
Response time
Availability
148
P1
P3
IP Core
R3
P2
Management System
NMS-2201
9627_05_2004_c2
149
P1
Nw1
Nw3
TP1
P2
TPx
P3
Nw3
PN
NwN
NMS-2201
9627_05_2004_c2
150
Throughput
Availability
Jitter
Evaluates SLAs
Proactively sends notification of SLA violations
NMS-2201
9627_05_2004_c2
151
NMS-2201
9627_05_2004_c2
152
NMS-2201
9627_05_2004_c2
153
NMS-2201
9627_05_2004_c2
154
NMS-2201
9627_05_2004_c2
155
CLIs
Configuration CLI Commands
[no] cool run <cr>
[no] cool interface interface-name(idb) <cr>
[no] cool physical-FRU-entity entity-index (int) <cr>
[no] cool group-interface group-objectID(string) <cr>
[no] cool add-cpu objectID threshold duration <cr>
[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr>
[no] cool if-filter group-objectID (string)<cr>
NMS-2201
9627_05_2004_c2
156
Measurement Example:
Router Device Outage
Reload (Operational) ,
Power Outage, or
Device H/W failure
157
Measurement Example:
Cisco IOS S/W Outage
Standby RP in Slot 0 Crash Using Address Error (4) Test Crash;
AdEL Exception It Is Caused Purely by Cisco IOS S/W
NMS-2201
9627_05_2004_c2
158
Add a Linecard
Reset the Linecard
159
Object Table
ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#shut
Shut ATM2.0
show cool event-table
Interface Down
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 1 1054859105 18
1
ATM2/0.1
1 35 1 1054859106 18
2
ATM2/0.2 Down Event
1 39 1 1054859107 17
3
ATM2/0.4 Captured
1 41 1 1054859108 18
4
ATM2/0.5
4
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#no shut
No Shut ATM2.0
show cool event-table
Interface
**** COOL Event Table ****
type index event time-stamp interval hist_id object-name
1 33 0 1054859146 41
1
ATM2/0.1
1 35 0 1054859147 41
2
ATM2/0.2
Up Event
1 39 0 1054859149 42
3
ATM2/0.4
Captured
1 41 0 1054859150 42
4
ATM2/0.5
NMS-2201
9627_05_2004_c2
1
1
1
1
ATM2/0.1
ATM2/0.2
ATM2/0.4
ATM2/0.5
160
Measurement Example:
Remote Device Outage
12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 1
12406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 1
12406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1
Object Table
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#shut
4
4
4
2
1
3
5
5
5
1054867105
1054867108
1054867130
42
47
65
2
3
10
remobj.2
remobj.1
remobj.3
12406-R1202(config)#interface ATM2/0
12406-R1202(config-if)#no shut
4
4
4
1
3
2
4
4
4
1054867171
1054867193
1054867200
63
63
95
1
8
10
Up Event Captured