You are on page 1of 7

Experts Forum

Designing Reliability in Converged Multi-Service Networks

Designing Reliability in Converged MultiService Networks


By Shahid Akhtar

Shahid Akhtar--Chief Engineer of Next Generation Networks at Etisalat. Mr. Shahid Akhtar has over 18 years of experience in the field of telecommunications, both with vendors and carriers. He has served in senior management roles at Nortel and Ciena, and is currently Chief Engineer of Next Generation Networks at Etisalat. He has authored or co-authored 3 US patents. He holds a B.Sc and M.Sc in Electrical Engineering from Washington University in St-Louis, U.S.A.

1 SEPTEMBER 2006 . ISSUE 23

Huawei Technologies

while on the other, be able to ensure the same reliability in multi-service IP bearer networks as in traditional networks.

fter years of operation experiences, everyone knows you can never be too careful about guaranteeing reliability in telecom networks. On one hand, we must provide customers with brand-new service experiences in a highly efficient, flexible way,

SEPTEMBER 2006 . ISSUE 23

Experts Forum
Designing Reliability in Converged Multi-Service Networks

Fig.1 Four layers of the NGN bearer network

Fig.2 Topology modes of the core P network

We must ensure equivalent reliability in multi-service IP bearer networks as in traditional networks.

Foreword
Increasingly networks based on packet and forwarding technology are being used to provide the core transport function for the new multi-service networks. Since these networks are planned to bear a majority of services for a typical carrier, the availability of such networks becomes of paramount importance. It would also be of some value if we are able to engineer different levels of reliability for different types of services. The base network, which all services share should be of the highest levels of reliability as it needs to support the service which has the highest reliability requirement.

The Typical Converged Network


The typical architecture of converged networks can be modeled in 4 layers as indicated in Fig.1. These layers are the following: The P router layer. Used as a stable core transport layer, it does not have any customer specific configurations to transport traffic from the PE layer. The PE router layer. Used as an access layer to the P router layer. Typically it also houses the PE function of IP-VPN as well as VPLS. The Ethernet layer. It is used as an aggregation layer to the router (Layer 3) layer. These are typically located in small central offices in a service provider. There are a number of technologies that fit into this layer: Ethernet on SONET/SDH The new 3 SEPTEMBER 2006 . ISSUE 23

class of SONET/SDH which is optimized for data bearing ability can emulate Ethernet capability. RPR (Resilient Packet Ring) or RPR over SONET/SDH The RPR function was designed for bearer access, so it is a natural fit for these conditions. Although RPR is not exactly Ethernet, it works with the Ethernet functions in other devices to create an Ethernet layer. Traditional Ethernet Typical Ethernet based on spanning trees can be used in this layer. Several features have been added by new standards bodies to traditional Ethernet to make it more carrier-class. The MSAN (Multi-Service Access Node) or access layer. The MSAN can be used to emulate the functionality of IP-DSLAM, access gateway or NG-DLC. The MSAN typically connects to the Ethernet layer using Ethernet. FTTH nodes such as OLTs can also be used in the access layer. This paper describes the methods and techniques to engineer high levels of availability in the above described converged network.

Factors Affecting Reliability


There are a number of factors which affect network availability in such networks. They are described below. A more detailed discussion on each of these factors is given later.

Network topology
Network topology is perhaps the most important factor in network availability. Redundancy can be provided in the network

Huawei Technologies

Fig.3 PE layer connection modes

to route the traffic around link or node failures. Site level redundancy can also be provided, in case a whole site fails, the network should be able to restore itself. Since networks are sometimes built using layers on top of each other, a lower layer may induce failures on multiple links in the upper layer. This issue can be handled by grouping links that fail together and planning for the group to fail together.

node failures typically entail a longer restoration time.

Security
Networks and nodes may fail due to the natural failure rate of hardware or due to software bugs. They can also fail, if a person or system actively tries to make the network fail. Various security measures are put into place to make sure that any unauthorized action cannot take the network down.

Restoration speed
Restoration speed or restoration time is another key measure of reliability. Traditionally the SONET/SDH layer provided 50ms of restoration time. Over time, this has been accepted as a requirement for a lot of services. However, some detailed studies have revealed that most services can withstand up to 1s of restoration time. However, faster restoration time will lead to better network availability. The MTTR (Mean Time To Repair) is used to indicate the network restoration capability, and MTBF (Mean Time Between Failures) is used to indicate network availability. The functional relationship of reliability, for MTTR and MTBF is: Availability = MTTR / (MTBF+MTTR).

Configuration control
Several studies on network failures and their causes have shown that configuration failures account for a large percentage of all failures. A number of mechanisms can be put into place for testing and checking configurations before they are loaded on the network. The following sections detail each of the above factors in each of the converged network layers.

of these options is shown in Fig.2 and described below: A full mesh is used to obtain a very high level of connectivity and reliability in the core. However, it becomes very expensive as the number of nodes grows. The number of links in a full m e s h n e t w o rk i s L = N ( N - 1 ) / 2 . Generally it is quite hard to have a full mesh of fiber cables through the core, so some SRLG (Shared Risk Link Group) should be expected. Partial mesh is used when a full mesh is considered expensive. Often an agreed degree of connectivity is chosen for a partial mesh, for example the partial mesh network in figure2 is 3-connected. Such a network where each link lies on a separate fiber cable can withstand two link failures simultaneously, assuming appropriate traffic engineering has been done for those failures. Finally a ring network is the most economic restorable network. It can withstand a single link loss or node loss. Sometimes the core P networks are built out in a dual plane format. In this case, each site contains two routers and these router planes are used to transport traffic in a parallel manner. Typically a partial mesh is used when dual plane is used.

PE layer network topology


The PE layer would typically connect using chains to connect to two different P routers or via a dual homed connection to two P routers. This is shown in the Fig.3. This allows the PE routers to connect in a topology which allows failure of any P router without an outage.

Network Topology
As mentioned before, the converged network can be put into four layers, the P layer, the PE layer, the Ethernet layer and the access layer. In this section topology options for each of the converged network layers are explored.

Ethernet layer topology


The Ethernet layer is typically arranged in a similar topology to the PE layer. Ethernet nodes are arranged in a dual homed fashion or as chains connected to two PE routers. If Ethernet or Ethernet over SDH is used, then dual homing can be achieved by connecting the Ethernet rings to two routers and using VRRP between the routers.
SEPTEMBER 2006 . ISSUE 23

Node reliability
Node reliability is part of network reliability. However, nodes are typically expected to be much more reliable than links as they typically have redundant hardware. It is more difficult to provide node level reliability in a network and

Core P network topology


The core P network can be a full mesh, partial mesh or a ring network. Each one

Experts Forum
Designing Reliability in Converged Multi-Service Networks

MSAN or access layer topology


The MSAN topology is very similar t o t h e Et h e r n e t t o p o l o g y. In m a n y cases MSANs are single homed to a single Ethernet switch via Ethernet link aggregation. RSTP or MSTP based protection is used for MSAN rings connected to multiple Ethernet switches.

Restoration Speed
The speed at which failure is restored is referred to as MTTR as defined earlier. Given the definition of availability mentioned earlier, faster restoration speeds will increase availability. In addition, a certain speed of restoration may be required for specific services in order to meet their SLA requirements. The following describes the restoration options in each of the layers of the converged network.

tunnels which may in turn require tunnels from every end-point to every other endpoint. This typically will need N2 tunnels in the network, which can add significant complexity. IGP based protection. If such stringent protection times are not required, traffic can run over the IGP or over LDP (Label Distribution Protocol). Both of these depend on IGP restoration which typically acts within 0.5 - 2s. Both of these methods assume a ver y fast failure detection mechanism. PoS links can recognize failures in just a few milliseconds, however Ethernet links may take several seconds. A new protocol called BFD is being developed which can detect failure over Ethernet in 10-40ms. Other protection methods for RSVP tunnels. RSVP tunnels can also be protected using a few other methods like backup tunnel methods. However these are typically not used in most networks.

SDH uses SDH for restoration which can reach a 50ms restoration duration; RPR has its own restoration mechanisms which are similar to SDH and typically work within 50ms.

Node Reliability
Earlier we discussed how network topology can protect against link failure or even node failure. However at the edge of the network where access devices connect to single nodes, a node failure can often cause an outage to a specific group of access devices. Nodes are protected against failure in a number of ways. The following discusses how hardware and software are organized to protect against node failure as well as some new mechanisms like graceful restart and Non-stop routing.

Hardware failures
Node failure can occur if a critical part of the router incurs a hardware failure. Most routers are designed to withstand any single hardware failure. Switch fabrics, control processors, fans and power supplies are usually supplied i n a re d u n d a n t m a n n e r. Mo s t l i n k connections, such as SONET/SDH or Ethernet are also done in a redundant fashion with at least two cards. However, even with such precautions node failure can occur in a situation where a double

Restoration in the IP/MPLS layer


In the P and PE networks, IP and MPLS protocols are used for routing and control. There are a number of restoration options available in this layer. FRR (Fast Re-Route) based protection. If quicker restoration (better than 50ms) is desired, then RSVP tunnel based FRR protection should be employed. This does require all traffic to run over RSVP

Restoration in the Ethernet layer


As explained earlier, the Ethernet layer can be made up using native Ethernet, VPLS, Ethernet over SDH or RPR. For each of these options, the following describes their restoration options. Native Ethernet typically uses RSTP for restoration which typically converges in 1-2s; VPLS uses MPLS for restoration, several MPLS restoration options are available as described above; Ethernet over

The muti-service network, should be of the highest levels of reliability as it needs to support the service which has the highest reliability requirement.
5 SEPTEMBER 2006 . ISSUE 23

Huawei Technologies

hardware fault occurs.

Software failures
A more frequent reason for node failure is software failure. A typical software failure would result in a software processor function moving to the redundant unit. However, when both processors undergo a software failure, then a node failure occurs. A common occurrence for software failure happens when a noncritical task, such as diagnostics incurs a serious software failure (like a TRAP). Unless the software is modular and written as tasks, this might cause a full processor failure. Sometimes software failures occur in multiple nodes at the same time, due to a similar problem happening at the same time. This is termed a catastrophic failure. A number of mechanisms are employed to eliminate or reduce software failure. Software is written in a manner where exception conditions are handled gracefully. Software is organized in modular fashion so that each key task has an independent software task which has its own memory, disk, real-time and stack pre-allocated. These mechanisms decouple different software processes (tasks) so that they do not induce failures in other software processes. The redundant processor unit receives frequent updates so that it can take over very quickly in case of processor failure. This is discussed more in the following Graceful Restart and Non-stop routing section.

packet forwarding processor to work even if the control processor is not working. This is required in several possible events, such as when the control processor is receiving a software upgrade. During a planned switch-over to the inactive control processor, the graceful restart feature allows the neighboring routers to continue to send traffic to this router. This is done by sending messages to the neighbors prior to the scheduled downtime to make sure they do not take down the connecting links. The forwarding plane of the router with the scheduled downtime continues to forward packets using the forwarding table supplied earlier. This works well when a router is scheduled to be down, what happens if the control processors go out of service unexpectedly? Generally the standards based graceful restart does not handle this case, however a number of additional capabilities allow the routers to continue to forward packets even in this case. This is typically achieved by allowing the forwarding plane to process continuity packets such as Hello messages so the neighboring routers will not detect the loss of the control plane of their neighbor.

messages. In this way the redundant processor is updated about the state of all the protocols (such as BGP) and take over in an almost hitless fashion (less than 50ms). Non-stop routing is becoming popular with router vendors and carriers as it seems to be more robust than using graceful restart only.

Security
A number of mechanisms are employed to secure the network from hacking and unauthorized access. These can be grouped in access security where suspicious or non-conforming traffic is dropped. Authentication to make sure the correct traffic enters the network and encryption of sensitive traffic. A full discussion on security mechanisms is beyond the scope of this paper.

Configuration Protection
To protect the network from incorrect or unauthorized configurations, a number of processes need to be employed. These include the following: All changes in configuration need a team review. This protects against errors from a single person. There is software available which can check for common errors in configurations for most of the popular routers available today. The previous configuration should be saved in case the new configuration causes serious errors.

Non-stop routing
Since the control plane typically runs on redundant processors, the probability of both processors being out of service is negligible. However, the time it takes for the redundant processor to take over can be several seconds or even minutes. The elimination of this outage time is the primary function of Graceful Restart. Another method of eliminating this outage time is to update the redundant processor in a way it can take over in a hitless manner. There are two key options to achieve this objective. The redundant processor can run in a mirrored fashion as the main processor, receiving every message that the main processor receives and running exactly the same code as the main processor. This is typically not implemented as a software fault would manifest itself in both processors at the same time. A second and more popular option is to checkpoint all state information from the main processor to the redundant processor via specific

Graceful restart
Modern routers have separate control and data planes, that is, packet forwarding and control message processing is done independently and by different processors. Packet forwarding processors (typically network processors) reside in the interface cards and the central processors (typically in the CPU card) work on the control messages from other routers. Earlier monolithic routers typically had one processor which forwarded data packets as well as processed control messages. This separation of planes allows the

Spare Capacity Planning


In order for the restoration mechanisms to work properly appropriate bandwidth (and other resources) need to be reserved on the spare routes. Planning this spare capacity in IP/MPLS or Ethernet networks is typically done using offline tools which optimize the use of this capacity such that multiple failures use the same spare capacity. A detailed discussion on this issue is also beyond the scope of this document. Editor: Xu Ping x.ping@huawei.com
SEPTEMBER 2006 . ISSUE 23

You might also like