Professional Documents
Culture Documents
Table of Contents
Introduction
The Necessity of Performance Management
2
2
Metro Ethernet Performance Challenges A Solution to the Challenges of Metro Ethernet Performance Service-Centric Performance Management of Metro Ethernet
KPIs Resource Monitoring KPIs End-to-End Measurement KPIs Traffic Analysis KPIs Service Monitoring Aggregation, Abstraction and Analysis Aggregation Abstraction Analysis
3 4 6
6 7 8 9 9 10 10 10 11
12 14
Introduction
Today Metro Ethernet is a billion-dollar market, according to market-research firm IDC, and by their prediction is expected to reach $6.3 billion worldwide by 2008. The advantages of Metro Ethernet lie in its ability to move data at higher speeds for lower costs than alternative technologies, thereby providing a compelling argument to use Ethernet as a WAN transport service. Enterprise customers are increasingly aware of this benefit, as well as others related to improved management cost and ease of use. With significant expertise and comfort around Ethernet, Enterprises are seeking to adopt Metro Ethernet as a cost-effective networking technology. Similarly, Service Providers have anticipated the needs of their customers and are building out new Metro Ethernet networks and services. They not only foresee the promise of new revenue potential and higher flexibility in bandwidth provisioning but operational and capital savings as well. However to harness the full potential of Metro Ethernet, there are considerable service quality challenges that must be addressed. To do so, it is imperative that Service Providers and their customers integrate a comprehensive Performance Management platform into their overall management delivery chain. After briefly reviewing the unique performance characteristics of Metro Ethernet, this paper will describe the challenges associated with guaranteeing service quality. It will go on to explain the paramount role of Performance Management in addressing those challenges to ensure reliable service delivery and customer-centric Quality of Service. In doing so, Performance Management provides proactive information that can be used to take appropriate action, preemptively and rapidly, before conditions worsen and before impacting customers. Performance Management is more effective than traditional fault management solutions that are inherently reactive by nature. By combining real-time and historical monitoring and analysis of network performance for the purpose of detecting ongoing and impending degradations, performance management is a proactive discipline. Trend analysis predicts a point in time where capacity will be reached and performance compromised. Long-term historical depth of detailed metrics supports traffic engineering and troubleshooting. Across these capabilities, threshold analysis is performed outside the equipment, allowing for the creation of performance-based traps as opposed to traps coming from within the equipment. This opens the opportunity to more sophisticated aspects of preventative performance problem analysis, such as identifying deviations from a baseline, accounting for transient peaks, and detecting hazardous combinations of stressed performance. More advanced Performance Management with a service-centric focus can then abstract those measurements into easily understood information within the context of service and business parameters.
As summarized in Table 1, limitations of Metro Ethernet can pose service level performance risk to the service provider and their customers. Only a sound Performance Management approach can address these issues.
Table 1: Key Aspects of Metro Ethernet Compared to Frame Relay and ATM Transport ATM/Frame Relay Topology Strengths Limitations Less flexibility and greater complexity of service activation, provisioning and change. Higher cost of operations. Lack of over-provisioning and congestion controls. Performance is at risk unless carefully managed/engineered.
Static PVC-based, long-distance Highly reliable, guaranteed channels providing permanent bandwidth and predictable point-to-point connections. performance. Inherent congestion avoidance ensures consistently low latency. Dynamic, connection-less, point High speed, flexible, low cost to point or any to any packet replacement of Frame Relay flow. and ATM.
Table 2: Traffic Profile Parameters Parameter CIR (Committed Info Rate) Description The maximum rate at which service frames can be delivered, on average. CIR settings limit the frame size to emulate speed throttling, e.g.: speeds such as 2Mbps, 10 Mbps, 50 Mbps over a 100 Mbps 802.1 interface. CIR will need to meet end-to-end service level objectives. CBS is the maximum allowable frame size in accordance with the CIR. The maximum rate at which service frames can be delivered above the CIR, on average. These frames will have no service level objective. EBS is the maximum allowable frame size in accordance with the EIR. Frame rates above the CIR (and frame sizes above the CBS) may be discarded, depending on the EIR parameter. Frame rates and sizes above the EIR and EBS are subject to discard.
CBS (Committed Burst Size) EIR (Excess Information Rate) EBS (Excess Burst Size)
The traffic profiles are a key point of reference against which Metro Ethernet performance and service levels are measured. In Figure 1, a simplistic Ethernet Line (E-Line) architecture is shown for the purpose of illustrating the use of traffic profiles against the physical and virtual components that must be managed. This requires Performance Management that employs a combination of collection, analysis and reporting techniques that tie Metro Ethernet traffic statistics to UNIs, Customer Equipment (CEs), Provider Equipment (PEs) and end-to-end service levels.
UNI CE
UNI
CE
UNI: User-Network Interface CE: Customer Equipment EVC: Ethernet Virtual Connection U-PE: User-Facing Provider Equipment
In doing so, service-centric Performance Management allows for prioritization of identified performance problems, troubleshooting, end-to-end service level management and proactive capacity planning based on customer, service and resource impact. The following service-centric Performance Management collection and analysis components are required to reliably offer and manage Metro Ethernet services.
KPIs
Multiple techniques are necessary to gather sufficient information to manage Metro Ethernet performance. Resource monitoring, end-to-end testing and traffic flow analysis KPIs are all needed in combination to confirm service levels, anticipate capacity limitations, detect problems as they develop and troubleshoot.
Access U-PE
Aggregation
Edge
Core
Edge
Access
CE
PE-AGG
N-PE
N-PE
U-PE
End-to-End Measurements
Key Performance Indictors (KPIs) and detailed troubleshooting metrics that deliver real value. Some equipment vendors offer Element Management Systems (EMS), such as the Alcatel 5620 Service Aware Manager (SAM) and Cisco IP Solution Center (Cisco ISC), which support configuration, provisioning and monitoring of their own devices. Performance data is gathered by these systems, by SNMP and through proprietary methods, but always in a controlled way. The information that can be drawn from them through APIs and log files is generally the same as can be gathered directly. Yet in the case of Metro Ethernet, the EMS information source is more extensive. Table 4 provides a non-exhaustive list of KPIs that are the most important to managing Metro Ethernet performance and service levels. The analysis behind these KPIs is not particularly complex, involving raw-to-rate and other basic calculations. More advanced KPIs can be obtained when performing aggregation, abstraction and analysis, as described in the next section. Metro Ethernet services will have QoS policies imposed to make it more reliable and deterministic. The network KPIs listed in Table 4 are derived with respect to each CoS within each VLAN, where available. This includes VLANs (or customer ports) that carry only one CoS level.
Table 4: Resource KPIs KPI Device KPIs Availability Description Percentage of time a physical or logical resource is available for use. A variety of protocols (such as SNMP and ICMP) are typically used.
CPU, Memory, Calculated as a percentage of capacity. Excessive levels, either individually or in combination, indicate Buffer Utilization resource overloading which can result in delays in data transfer and a high rate of packet errors or dropped packets. CIR Utilization Queue Drop Percentage of customer utilization levels relative to the CIR. Discarded packets, including those measured as Tail and Random drop. Implies congestion or improper CoS policy, too much over-provisioning and/or excessive customer usage beyond capacity. The percentage of frames or packets that were detected in error during transmission and were discarded. An excessive error rate causes a high incidence of retransmission. This can be an indication of the quality of the transmission line. Measured relative to submitted and transmitted traffic, dropped traffic is an indicator of congestion or improper CoS policy. Also based on the number of frames discarded due to frame rates and frame sizes in excess of EIR/EBS.
Network KPIs
Errors
Dropped Frames
The percentage of time that an end Typically a resource monitoring element responds to transaction requests. entity that performs end-to-end testing. The average time it takes for a request to reach a target. Percentage of time a device or interface is Cisco IP SLAs and Juniper RPM available for use. probes configured by a monitoring entity. Ideally one probe for each One way and round-trip delay of layer 2 application, CoS and type of and 3 transactions. protocol that go through the path. Variation of delay. Percentage of undelivered frames compared to total number of frames submitted. Alcatel OAM MAC Ping
reports can have no logical order or connectedness. Volumes of information collected across multiple customers and network elements would otherwise be a costly and time-prohibitive process to assemble, aggregate, abstract and organize. The service model is fed by Service, Customer, Inventory and Traffic Profile information from service activation and provisioning systems. The automatic feed of information to the model during provisioning plays a critical role in terms of immediacy, accuracy and operational efficiency. The same process performed manually on a multi-customer, multi-service scale is simply not feasible, nor accurate. Vendor EMS solutions such as Cisco ISC and Alcatel 5620 SAM, as well as multi-vendor OSS provisioning platforms such as those offered by Cramer, MetaSolv and Syndesis, provide proven Metro Ethernet provisioning solutions that are capable of delivering servicecentric topology feeds to Performance Management. As configuration changes are made and customers are added, the model is dynamically updated so that the correct resources can be monitored relative to all the appropriate service and customer dimensions, as indicated in Figure 3. The model therefore provides a structure for gathering, storing, aggregating, organizing and locating information.
Service Modeling
Metro Ethernet is a service-centric technology which requires performance measurements against customer SLAs as defined between customer UNIs. Service Modeling is the foundation to service-centric Performance Management of Metro Ethernet services. It understands relationships between resources, the services they support, the customers subscribing to those services and their respective performance indicators. Without this structure, performance management can not support higher valued management tasks such as service-centric problem identification, customer resource impact analysis, service-oriented capacity planning and business-prioritized troubleshooting. Without the aid of such a model, KPIs and
Customer Profiles Names Sites CE Names Ports VLAN IDs VLAN CoSs EVCs
Provisioning
Bandwidth Profiles CIR CBS EIR EBS
To reiterate, the goal of the service model is to interrelate Ethernet service attributes, KPIs and detailed performance metrics end to end. However, considering the rapidly growing state of Metro Ethernet business requirements and technology, no rigid performance management service model can possibly achieve that goal. If the model design is flexible and open to change, the performance management solution will be able to adapt and support both vendor agnostic and service-centric objectives through the evolution of Metro Ethernet. The underlying model design has a direct impact on flexibility and maintainability. Models built upon relational tables of KPIs and service topology attributes are difficult to maintain. Table interlinking is generally linear and hence requires a complex series of iterative changes to take place. Conversely, object-oriented models treat service and resource properties and their KPIs as modular objects that can be linked hierarchically. Changes to the intra-model relationships can be made centrally. Global changes to the way KPIs are associated, aggregated or analyzed can be made in a single step through inheritance.
Grouping of KPIs to various service model dimensions allows considerable condensing of information into basic status views. Examples include: Availability of resources supporting a particular customer, availability of PE resources in general, interfaces with the most errors in a particular location, number of dropped packets for the different QoS classes on a group of VLANs.
Abstraction
Multiple KPIs are combined into a composite KPI to infer the health of a resource or interface through the use of a single value (aka: level of wellbeing). Beyond a simple reduction of many indicators to one, the use of composite indicators helps identify problems not otherwise noticeable at a granular level. Examples of abstracted KPI calculations include: - Composite Platform Status: CPU usage, memory usage and buffer usage values that exceed a hazardous level* are counted, weighted and combined. Ideally, CPU usage is considered for all CPUs of the monitored device; memory usage includes process and kernel values; and buffer utilization is relative to the maximum number of buffers allowed. - Prolonged Interface Hazards: Based on the amount of time a network interface has been in a 'degraded' mode, a condition where the average rate of errors in combination with the average utilization (for inbound and outbound traffic) reaches hazardous levels*.
Aggregation
KPI data is reduced from short time intervals to longer time intervals through temporal aggregation. This is a basic but important factor when accommodating large amounts of information over time. For example, KPIs such as end-to-end measurements of Jitter collected every 5 minutes are aggregated into hourly, daily and weekly increments to support SLA reporting.
* Hazardous performance levels should be based on industry standards and best practices but are more effective when refined via empirical use-case analysis.
10
Analysis
Advanced KPI analysis produces proactive information at two levels. 1) Early warnings (through alerts) of existing and potential performance problems, based on real-time analysis and 2) Long term predictions of capacity utilization levels based on historical analysis. The calculations behind these KPIs are as follows.
Analysis Baseline Linear, multi-level thresholds Dynamic threshold analysis Pattern distributions, standard deviation and percentile Historical trending
Description Records hourly, daily, weekly, monthly norms, typically against network utilization KPIs. Multiple upper and lower limits that detect different levels of performance degradation severity. Automatic adjustment of thresholds in relation to baseline. By tracking a so-called tunnel of normality, detects sudden and significantly high abnormalities or changes. Improves accuracy by differentiating brief peaks in performance from sustained problematic conditions. Predicts where utilization levels will be in the near and distant future. Calculations based on historical data improve accuracy. Time-to-capacity values are critical to capacity planning.
11
Service-Customer-Resource Relationships
With Metro Ethernet topology parameters inserted into the model, as illustrated previously in Figure 3, service to customer to resource relationships are established to achieve logical associations of performance information. The model holds relationships, as shown in Figure 4, between service provider edge devices (U-PEs, N-PEs), physical and virtual UNI to UNI connection components (ports, VLANs, VPLSs, Pseudowires, LSPs), traffic prioritization by QoS (DiffServ, ToS) and CoS (802.1p), customer edge devices (CEs), customer names and locations. Some of this information comes from the Service Configuration and Activation systems as change occurs and some of it is drawn from the environment itself during resource monitoring. Probe set-up for end-to-end measurements as discussed earlier relies completely on this model to know which QoS classes to test through what interfaces using which equipment. However, as not all devices support the creation of end-to-end probes, the model should be equipped to query all network devices and discover which of them do support this function. For example, starting in the top left corner of Figure 4 and moving clock-wise, customers have E-Line and E-LAN services with end-to-end KPIs that show the health of those services. The services are delivered to customer sites where their devices (switches) reside. The devices have physical and virtual interfaces that connect to specific provider devices (multi-service provider edge devices) and their specific interfaces. Those provider devices are mapped back to the end-to-end service level KPI measurements that traverse the core from edge-to-edge (PE to PE). Each interface is mapped Figure 4: Points of Entry to Paths of Interconnection to EVCs, LSPs and VLANs, each with their own Quality of Service utilization and performance levels. Each VLAN with
Points of Entry Aggregated & Abstracted KPIs
Once the probes are in place, end-to-end measurements are associated with all the other aspects of the model: services, customers, sites, resources and QoS as represented in Figure 4. A structured service-centric approach to performance management is fully realized when, based on the service model, network managers, service support personnel and account managers can move seamlessly from one type of information to another. As demonstrated in Figure 4, regardless of the entry point into the model, relevant workflow-based access (not any-to-any access) is possible.
its class of service performance is mapped back to the customers. With this interconnected relevance, an account
End-to-End KPIs
executive can, for instance, confirm Metro Ethernet utilization levels for his customers by name, location or service, allowing proactive recommendation of upgrade before exceeding
Devices
capacity and suffering poor performance. Another important aspect here is the ability to "manage by
exception" in relevant ways. Static, multi-level and dynamic threshold analysis, and the performance-based alerts they produce, will also fall within the structure of the model to determine who, how and where a problem lies. A simple process of elimination of performance alerts through services, customers, locations - groups and nested subgroups - can be
12
supported by the model. The result is an acute ability to understand dependencies and to prioritize action based on service criticality and business impact, as opposed to traditional first-come first-serve event management models. Beyond the operations and account management purposes of the model to deliver relevant actionable information internally, customer-facing reporting1 brings value as well. Whether through premium performance reporting or Managed Services offerings, those reports help customers gain insight into class-of-service loading, impending capacity shortages and provide them the rational for justifying service upgrades.
1 Managing IP VPN Networks: What to Expect from Your Service Provider, InfoVista White Paper, http://www.infovista.com/pdf/WP_InfoVista_Managing_IPVPN_Networks.pdf, 2005.
13
Conclusion
As more and more forms of communication are migrated to Enterprise and consumer networks (voice, video, databases, data storage, imagery, real-time applications) the demand for bandwidth is rapidly accelerating. Metro Ethernet offers the highest speed and lowest price per bit to date, making it a great opportunity to service providers and their customers. However, the dynamic, indeterminate nature of Metro Ethernet necessitates continuous supervision. Performance Management with a service-centric and customer-centric focus is the only option to sustaining reliability amid constant changes to subscribership, customer usage, end to end connectivity paths and traffic engineering policies. Only Service-centric Performance Management can relate Quality of Service from the perspective of the customer to performance from the perspective of the technology. To protect the bottom line - customer success and satisfaction Metro Ethernet requires proactive monitoring and reporting that detects sudden changes in performance, service level degradations, capacity utilization levels and their evolution, and impending violation of SLA objectives. A Metro Ethernet Performance Management service model that is automatically populated and updated with ongoing changes is the key to the 'top-down' and 'bottom-up' approach to Performance Management that is required to achieve those goals.
14
World and European Headquarters InfoVista S.A. 6, rue de la Terre de Feu 91952 Courtaboeuf Cedex Les Ulis France Tel: +33 (0)1 64 86 79 00 Fax: +33 (0)1 64 86 79 79 North American Headquarters InfoVista Corporation 12950 Worldgate Drive Suite 250 Herndon,VA 20170 USA Tel: +1 703 435 2435 Fax: +1 703 435 5122 Asia-Pacific Headquarters InfoVista (Asia-Pacific) Pte Ltd Block 750 C,#03-16/17 Chai Chee Road TechnoPark @Chai Chee Singapore 469003 Tel: +65 6449 7641 Fax: +65 6449 3054
Copyright 2006 InfoVista S.A.All rights reserved.All other trademarks appearing in this document are acknowledged as the trademarks of their respective owners. InfoVista, VistaFoundation, VistaInsight,VistaOperations Center, VistaCapacity Planner,VistaService Manager, VistaView, VistaFinder,VistaNext,Vista Plug-in, VistaPortal, VistaMart, VistaNotifier, VistaLink, VistaProvisioner and VistaBridge are trademarks of InfoVista S.A.