Professional Documents
Culture Documents
Shahid Akhtar--Chief Engineer of Next Generation Networks at Etisalat. Mr. Shahid Akhtar has over 18 years of experience in the field of telecommunications, both with vendors and carriers. He has served in senior management roles at Nortel and Ciena, and is currently Chief Engineer of Next Generation Networks at Etisalat. He has authored or co-authored 3 US patents. He holds a B.Sc and M.Sc in Electrical Engineering from Washington University in St-Louis, U.S.A.
Huawei Technologies
while on the other, be able to ensure the same reliability in multi-service IP bearer networks as in traditional networks.
fter years of operation experiences, everyone knows you can never be too careful about guaranteeing reliability in telecom networks. On one hand, we must provide customers with brand-new service experiences in a highly efficient, flexible way,
Experts Forum
Designing Reliability in Converged Multi-Service Networks
Foreword
Increasingly networks based on packet and forwarding technology are being used to provide the core transport function for the new multi-service networks. Since these networks are planned to bear a majority of services for a typical carrier, the availability of such networks becomes of paramount importance. It would also be of some value if we are able to engineer different levels of reliability for different types of services. The base network, which all services share should be of the highest levels of reliability as it needs to support the service which has the highest reliability requirement.
class of SONET/SDH which is optimized for data bearing ability can emulate Ethernet capability. RPR (Resilient Packet Ring) or RPR over SONET/SDH The RPR function was designed for bearer access, so it is a natural fit for these conditions. Although RPR is not exactly Ethernet, it works with the Ethernet functions in other devices to create an Ethernet layer. Traditional Ethernet Typical Ethernet based on spanning trees can be used in this layer. Several features have been added by new standards bodies to traditional Ethernet to make it more carrier-class. The MSAN (Multi-Service Access Node) or access layer. The MSAN can be used to emulate the functionality of IP-DSLAM, access gateway or NG-DLC. The MSAN typically connects to the Ethernet layer using Ethernet. FTTH nodes such as OLTs can also be used in the access layer. This paper describes the methods and techniques to engineer high levels of availability in the above described converged network.
Network topology
Network topology is perhaps the most important factor in network availability. Redundancy can be provided in the network
Huawei Technologies
to route the traffic around link or node failures. Site level redundancy can also be provided, in case a whole site fails, the network should be able to restore itself. Since networks are sometimes built using layers on top of each other, a lower layer may induce failures on multiple links in the upper layer. This issue can be handled by grouping links that fail together and planning for the group to fail together.
Security
Networks and nodes may fail due to the natural failure rate of hardware or due to software bugs. They can also fail, if a person or system actively tries to make the network fail. Various security measures are put into place to make sure that any unauthorized action cannot take the network down.
Restoration speed
Restoration speed or restoration time is another key measure of reliability. Traditionally the SONET/SDH layer provided 50ms of restoration time. Over time, this has been accepted as a requirement for a lot of services. However, some detailed studies have revealed that most services can withstand up to 1s of restoration time. However, faster restoration time will lead to better network availability. The MTTR (Mean Time To Repair) is used to indicate the network restoration capability, and MTBF (Mean Time Between Failures) is used to indicate network availability. The functional relationship of reliability, for MTTR and MTBF is: Availability = MTTR / (MTBF+MTTR).
Configuration control
Several studies on network failures and their causes have shown that configuration failures account for a large percentage of all failures. A number of mechanisms can be put into place for testing and checking configurations before they are loaded on the network. The following sections detail each of the above factors in each of the converged network layers.
of these options is shown in Fig.2 and described below: A full mesh is used to obtain a very high level of connectivity and reliability in the core. However, it becomes very expensive as the number of nodes grows. The number of links in a full m e s h n e t w o rk i s L = N ( N - 1 ) / 2 . Generally it is quite hard to have a full mesh of fiber cables through the core, so some SRLG (Shared Risk Link Group) should be expected. Partial mesh is used when a full mesh is considered expensive. Often an agreed degree of connectivity is chosen for a partial mesh, for example the partial mesh network in figure2 is 3-connected. Such a network where each link lies on a separate fiber cable can withstand two link failures simultaneously, assuming appropriate traffic engineering has been done for those failures. Finally a ring network is the most economic restorable network. It can withstand a single link loss or node loss. Sometimes the core P networks are built out in a dual plane format. In this case, each site contains two routers and these router planes are used to transport traffic in a parallel manner. Typically a partial mesh is used when dual plane is used.
Network Topology
As mentioned before, the converged network can be put into four layers, the P layer, the PE layer, the Ethernet layer and the access layer. In this section topology options for each of the converged network layers are explored.
Node reliability
Node reliability is part of network reliability. However, nodes are typically expected to be much more reliable than links as they typically have redundant hardware. It is more difficult to provide node level reliability in a network and
Experts Forum
Designing Reliability in Converged Multi-Service Networks
Restoration Speed
The speed at which failure is restored is referred to as MTTR as defined earlier. Given the definition of availability mentioned earlier, faster restoration speeds will increase availability. In addition, a certain speed of restoration may be required for specific services in order to meet their SLA requirements. The following describes the restoration options in each of the layers of the converged network.
tunnels which may in turn require tunnels from every end-point to every other endpoint. This typically will need N2 tunnels in the network, which can add significant complexity. IGP based protection. If such stringent protection times are not required, traffic can run over the IGP or over LDP (Label Distribution Protocol). Both of these depend on IGP restoration which typically acts within 0.5 - 2s. Both of these methods assume a ver y fast failure detection mechanism. PoS links can recognize failures in just a few milliseconds, however Ethernet links may take several seconds. A new protocol called BFD is being developed which can detect failure over Ethernet in 10-40ms. Other protection methods for RSVP tunnels. RSVP tunnels can also be protected using a few other methods like backup tunnel methods. However these are typically not used in most networks.
SDH uses SDH for restoration which can reach a 50ms restoration duration; RPR has its own restoration mechanisms which are similar to SDH and typically work within 50ms.
Node Reliability
Earlier we discussed how network topology can protect against link failure or even node failure. However at the edge of the network where access devices connect to single nodes, a node failure can often cause an outage to a specific group of access devices. Nodes are protected against failure in a number of ways. The following discusses how hardware and software are organized to protect against node failure as well as some new mechanisms like graceful restart and Non-stop routing.
Hardware failures
Node failure can occur if a critical part of the router incurs a hardware failure. Most routers are designed to withstand any single hardware failure. Switch fabrics, control processors, fans and power supplies are usually supplied i n a re d u n d a n t m a n n e r. Mo s t l i n k connections, such as SONET/SDH or Ethernet are also done in a redundant fashion with at least two cards. However, even with such precautions node failure can occur in a situation where a double
The muti-service network, should be of the highest levels of reliability as it needs to support the service which has the highest reliability requirement.
5 SEPTEMBER 2006 . ISSUE 23
Huawei Technologies
Software failures
A more frequent reason for node failure is software failure. A typical software failure would result in a software processor function moving to the redundant unit. However, when both processors undergo a software failure, then a node failure occurs. A common occurrence for software failure happens when a noncritical task, such as diagnostics incurs a serious software failure (like a TRAP). Unless the software is modular and written as tasks, this might cause a full processor failure. Sometimes software failures occur in multiple nodes at the same time, due to a similar problem happening at the same time. This is termed a catastrophic failure. A number of mechanisms are employed to eliminate or reduce software failure. Software is written in a manner where exception conditions are handled gracefully. Software is organized in modular fashion so that each key task has an independent software task which has its own memory, disk, real-time and stack pre-allocated. These mechanisms decouple different software processes (tasks) so that they do not induce failures in other software processes. The redundant processor unit receives frequent updates so that it can take over very quickly in case of processor failure. This is discussed more in the following Graceful Restart and Non-stop routing section.
packet forwarding processor to work even if the control processor is not working. This is required in several possible events, such as when the control processor is receiving a software upgrade. During a planned switch-over to the inactive control processor, the graceful restart feature allows the neighboring routers to continue to send traffic to this router. This is done by sending messages to the neighbors prior to the scheduled downtime to make sure they do not take down the connecting links. The forwarding plane of the router with the scheduled downtime continues to forward packets using the forwarding table supplied earlier. This works well when a router is scheduled to be down, what happens if the control processors go out of service unexpectedly? Generally the standards based graceful restart does not handle this case, however a number of additional capabilities allow the routers to continue to forward packets even in this case. This is typically achieved by allowing the forwarding plane to process continuity packets such as Hello messages so the neighboring routers will not detect the loss of the control plane of their neighbor.
messages. In this way the redundant processor is updated about the state of all the protocols (such as BGP) and take over in an almost hitless fashion (less than 50ms). Non-stop routing is becoming popular with router vendors and carriers as it seems to be more robust than using graceful restart only.
Security
A number of mechanisms are employed to secure the network from hacking and unauthorized access. These can be grouped in access security where suspicious or non-conforming traffic is dropped. Authentication to make sure the correct traffic enters the network and encryption of sensitive traffic. A full discussion on security mechanisms is beyond the scope of this paper.
Configuration Protection
To protect the network from incorrect or unauthorized configurations, a number of processes need to be employed. These include the following: All changes in configuration need a team review. This protects against errors from a single person. There is software available which can check for common errors in configurations for most of the popular routers available today. The previous configuration should be saved in case the new configuration causes serious errors.
Non-stop routing
Since the control plane typically runs on redundant processors, the probability of both processors being out of service is negligible. However, the time it takes for the redundant processor to take over can be several seconds or even minutes. The elimination of this outage time is the primary function of Graceful Restart. Another method of eliminating this outage time is to update the redundant processor in a way it can take over in a hitless manner. There are two key options to achieve this objective. The redundant processor can run in a mirrored fashion as the main processor, receiving every message that the main processor receives and running exactly the same code as the main processor. This is typically not implemented as a software fault would manifest itself in both processors at the same time. A second and more popular option is to checkpoint all state information from the main processor to the redundant processor via specific
Graceful restart
Modern routers have separate control and data planes, that is, packet forwarding and control message processing is done independently and by different processors. Packet forwarding processors (typically network processors) reside in the interface cards and the central processors (typically in the CPU card) work on the control messages from other routers. Earlier monolithic routers typically had one processor which forwarded data packets as well as processed control messages. This separation of planes allows the