You are on page 1of 20

Link Aggregation with Failover

4/15/2009

Contents
Introduction
Terminology
References
Link Aggregation Types
Topologies
Direct Connect
Private Network
Local Network
Remote Network
Data Domain Link Aggregation and Failover
Bond Functions Available in Linux Distribution
Hash Methods Used
Link Failures
Other Link Aggregation
Cisco
Sun
Windows
AIX
HPUX
Data Domain Link Aggregation and Failover in the Customers Environment
Normal Link Aggregation
Failover of NICs
Failover Associated with Link Aggregation
Recommended Link Aggregation
Switch Information

Introduction
This document describes the use of link aggregation and failover techniques to maximize throughput on
networks with Data Domain systems installed. The basic topologies are described with notes on the usefulness
of different aggregation methods, so the right method can be chosen for the site.
The goal of Link Aggregation is to evenly split the network traffic across all the links or ports that are in the
aggregation group. This is done to maximize the network throughput on the LAN or LANs until the maximum
computer speed is encountered. Normally the aggregation is between the local system and the network device
or system that it is connected. Normally a system is connected to a switch or router. In theory aggregation
allows the system to send data on both links at the same time and therefore can get up to double the
throughput.
There are a few things that can impact how well the aggregation actually performs.
1. Speed of the switch

2.
3.
4.
5.
6.
7.

How much the DDR can process


Network overhead
Acknowledging and coalescing out of order packets
Aggregation method may not effectively distribute the data evenly across all the links
Number of clients
Number of streams (connections) per client

For impact 1, normally the switch can handle the speed of each link that is connected to it, but it may lose some
packets all the packets coming from several ports are concentrated on one uplink all running at maximum
speed. Note: this implies that only one switch can be used for port aggregation coming out of a system. For
most of the implementations this is true, but there are some network topologies that allow for link aggregation
across multiple switches.
Impact 2 addresses the DDR systems. DDR systems and programs processing rate is limited. As the hardware
gets faster and the use of parallel processing improves DDR systems will support a higher network throughput,
but as the processing speed increases the network link speed will also increase. For example, with the current
systems it makes sense to aggregate 1 GbE links but not 10 GbE links because one 10 GbE can provide
enough data to saturate the processing power of the current DDR systems. As the system speed improves it
will make sense to aggregate 10 GbE links.
Impact 3 addresses the inherent overhead of the network programs. This overhead will guarantee that the
transfer speed will never reach 100%. The throughput will always be reduced by the overhead it takes to create
and send a packet of data through the system until it is put onto the wire. There is an inherent delay separating
the sending of packets on Ethernet.
Impact 4 deals with the case that the packets may be out of order. The network program will need to coalesce
out of order packets into the original order. If the link aggregation mode allows the packets to be sent out of
order and the protocol requires that they be put back into the original order this added overhead may impact the
throughput speed to where the specific mode of link aggregation that causes out of order packets should not be
used.
5. Can a single client drive data fast enough to fully utilitize multiple aggregated links? In most cases, either the
physical or OS resources cannot drive data at multiple Gbps. Also, due to hashing limitations, multiple clients
would be required to push data at those speeds.
6. The number of streams, which translates to separate connections, can play a significant role in link utilization
depending on the hashing that is used.
A final impact deals with the effectiveness of the aggregation method used. If two systems are connected
together by direct connect cables, the use of Layer 2 (MAC) hashing or Layer 3 (IP) hashing would not provide
any aggregation at all. All the packets would go over the same link. In general the number of systems that will
be communicating with the Data Domain system will be small. So the aggregation method used will need to
work for a limited number of target systems.
The number of links that are aggregated will depend on the switch performance, the DDR system and
application performance and the link aggregation mode used.

Terminology:
DDR
Data Domain appliance, a Linux system used to perform only Data Domain operations.

EtherChannel
This is a term used by Cisco to define the bundling of network links as described under Ethernet
Channel. With Cisco there are three ways to form an EtherChannel: manually, automatically using
PAgP, and automatically using LACP. If it is done manually both sides have to be setup by the

administrator. If one of the protocols is used, the specific packets with the specific protocol are sent to
the other side to where the EtherChannel is setup based on the information in the packets.

Ethernet Channel
This is multiple of individual Ethernet links that is bundled into a single logical link between systems.
This provides a higher throughput than a single link does. The term used by Cisco to identify this is
EtherChannel. The actual throughput is dependent on the number of links bundled together, the
individual link speed of the individual links and the switch or router that is actually being used. If a link
within the Ethernet Channel fails the normal traffic over the failed link is sent over the remaining links
within the bundle.

LACP
Link Aggregation Control Protocol (LACP) provides a dynamic network aggregation as defined in IEEE
802.3ad standard. This is not available in DDOS 4.9 and before.

Link Aggregation
Using multiple Ethernet network cables or ports in parallel, Link Aggregation increases the link speed
beyond the limits of any one single cable or port. Link aggregation is usually limited to being connected
to the same switch. Other terms used are EtherChannel (from Cisco), Trunking, Port Trunking, Port
aggregation, NIC bonding, and Load balancing. There are proprietary methods that are used, but the
main standard method is IEEE 802.3ad. Link aggregation can be used for a type of failover too.

Load Balancing
Aggregation methods used to try to distribute loads across all available links or ports.

Port Aggregation Protocol (PAgP)


This is Ciscos Proprietary networking protocol providing logical aggregation of Ethernet ports. This is
used in Ciscos EtherChannel. This is the older method used by Cisco. Later releases of their software
use the standard LACP to provide the same type of functions. Note PAgP EtherChannels do not
interoperate with LACP EtherChannels. This is not supported by DDRs.

Round Robin
Each new packet is sent to the least busy link or port. This usually means that packets are not sent to
the first link or port until packets have been sent to all the other links or ports, but it may take into
account the packet size in the distribution of packets.

RSTP
Rapid Spanning Tree Protocol, IEEE 802.1W, allows a network topology with bridges to provide
redundant paths. This allows for failover of network traffic among systems. This is an extension to the
spanning tree protocol (STP). The two names are used inter-changeably.

TOE
TCP Offload Engine Network cards (NIC) that have the full TCP/IP stack on the card.

Trunking
Trunking is the use of multiple communication links to provide an aggregated data transfer among
systems. For computers this may be referred to as port trunking to distinguish from other types of
trunking such as frequencies sharing.
Note: Cisco uses the term trunking to refer to VLAN tagging not link aggregation, whereas other
vendors use this term in reference to link aggregation.

References:
Catalyst 4500 Series Switch Cisco IOS Software Configuration Guide (also used for the 4900 Series Switch too)
Release 12.2(44)SG, available from the Cisco Documentation site.
Cisco Documentation, http://www.cisco.com/univercd/home/home.htm
IEEE 802.3 Standard http://standards.ieee.org/getieee802/802.3.html
Also available under: http://iweb.datadomain.com/eweb/technical_library/Vendor/Cisco/
IEEE 802.3ad Standard is Clause 43 under IEEE802_3-sec3.pdf of the standards documents listed.
Linux distribution documentation, http://www.kernel.org/
Linux Ethernet Bonding Driver HOWTO, http://www.cyberciti.biz/howto/question/static/linux-ethernet-bondingdriver-howto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-singleinterface.html
Linux Ethernet Bonding Driver HOWTO: http://www.cyberciti.biz/howto/question/static/linux-ethernet-bondingdriver-howto.php, http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-singleinterface.html
Wikipedia, http://en.wikipedia.org/wiki/Main_Page
Various links on the web as noted within the document by hotlinks

Link Aggregation Types


Link aggregation needs to balance the number of packet across all the links within the aggregation group with
minimum impact on the splitting, assembling, and reordering of out of order packets.
Currently IEEE 802.3ad is the accepted standard. This can be used by most systems that can support Link
aggregation, but there is no one size fits all. There are other aggregation types that may work better in some
situations such as round robin which is not part of the IEEE 802.3ad standard.
The IEEE 802.3ad standard is contained in clause 43 of the IEEE 802.3 standard that is freely available on the
web. In the IEEE standards the term clause 43 can be thought of as chapter 43. Clause 43 is part of the IEEE
802.3-2005 Section Three of the pdf file on the IEEE web site. A large part of the IEEE 802.3ad standard is
the LACP. This is a protocol that is used to coordinate the aggregation between the two systems that are
directly connected. Note: This standard does not identify how the actual link is selected to send a packet, but it
does emphatically mention two things: packets within a conversation should always be kept in order and
should packets should not be duplicated. For the purposes of this document, conversion is the same as
connection. Although the aggregation process is highly described in this standard, the only thing that has an
impact on systems outside the local system is the LACP and the messages send that use it. Note: The LACP
protocol is not supported in releases 4.9 and before.
If the IEEE standard is not used the aggregation on both sides has to be setup manually. Other than round
robin the DDR uses the Linux bonding modules balance XOR type to provide link aggregation. As implied in
the name the aggregation is done by doing a XOR function on one or more of the addresses and/or port
numbers within the packet headers. This aggregation has to be manually setup on both sides and the
aggregation used needs to match. For example if Layer 3+4 is used on the DDR it needs to be also used on the
system connected to the DDR.
An important consideration is the network topology. Important things to consider in the network topology are

The equipment directly connected to the DDR


o It may be the media server or another DDR
o If it is a switch or a router, the make and model number should also be determined
Whether the target system is local or remote, there may be a gateway involved
The DDR may be on a private network or shared with the rest of the customers network
The number of target computer systems that will be connected
Single DDR or multiple DDRs.

Each part of this information will have an impact on the type of Link aggregation that is used.
Consider what systems will be doing the link aggregation. Normally link aggregation configuration requires
coordination from both the DDR system and the switch. Another type of link aggregation configuration can be
handled from the DDR system only (both transmit and receive). There is at least one network topology where a
switch may not be part of the configuration, i.e. direct connect. This will need the link aggregation to be
configured between the DDR and the Media Servers. If the DDR is on the local network and is communicating
with many systems then using Layer 2 (MAC address) could be acceptable. If connection path goes through a
router/gateway then layer 3 (IP address only) or Layer 3+4 (IP address and the port number) may be needed.
The link aggregation will need to address the use of different speed links, for example: using both 1 GbE and
10GbE. The 10 GbE TOE cards may have aggregation on the cards and not support aggregation off the card.
Most aggregation methods do not support links running at different speeds so it should be avoided..
There is also the question of the use of fail-over. Failover can be considered to be part of aggregation. Most
link aggregation modes include an failover component by allowing data transfer to continue in a degraded state.
For example, one of the links goes down the link aggregation can recognize this and drop that link from the
aggregation list and continue with one less link. The customer may feel full failover is more important than link
aggregation. Instead of aggregating over multiple links, these links can be configured in full failover mode
where idle spares that carries no data would be setup until the active link fails. This way there would be no
degradation of throughput if the one link fails and data is sent over the other. One or more would be kept in a
standby mode until it is needed.
Administration network interface is also needed with DDRs. For direct connections and one to one server
connections there is a separate Ethernet interface for this, but this could also be part of the link aggregation
unless there is a physical separation needed between the links.

Topologies
The basic types of network topologies are described below, along with their differing suitability for various types
of aggregation methods.

Direct Connect
The Data Domain system is directly connected to one or more backup servers. To be able to provide link
aggregation within this topology will require multiple links between each backup server and the Data Domain
system. Usually link aggregation is not done with this topology, especially with multiple backup servers,
because of the limited number of links available on the Data Domain system.

Data Domain
Network switch
Backup/media server

Business Servers

Tape Library

Private Network
This topology is the same as the direct connect except the connections are through a switch rather than being
directly connected. This would normally be used to connect multiple media server to multiple DDRs. The link
aggregation would be between a DDR and the switch or between a media server and the switch. The
aggregation would be to get the data to and from the switch. In this case the aggregation between the DDR and
the switch would be independent of the aggregation used between the media server and the switch. Note:
there is a possible special case where the switch would be only a pass through and would be transparent to the
aggregation. That would not be the norm and is discussed in further detail later.

Data Domain

Network switch

Private network switch

Backup/media server

Business Servers
Tape Library

Local Network
The Data Domain system is connected to the backup server through a common switch. In the previous network
topologies shown the Data Domain system may have a connection through the common switch to handle
administration and maintenance tasks which need not be part of the aggregation. In this example the data is
also being sent through the shared network.

Data Domain System

Network switch

Backup/media servers

Business Servers

Tape Library

Remote Network
This is similar to the local network except that connection is through a router before it gets to the media server.
There will normally be switch in between the DDR and the router unless the router also provides switch
functionality. What is important to note in this diagram is that there is a gateway function that is involved in the
network data flow. It is important to maximize the data throughput between the DDR and the media servers. So
normally the DDR will be located on the same LAN and use the same switch as the media server. There may
be cases that with multiple media servers some of them may be on separate VLANs. The DDR would need to
go through at least one gateway to get to them. It is not expected that the remote network will go across a WAN.
Across a WAN topology is likely to be the case for DDR with replication. Normally the data flow in replication is
low enough where it does not need aggregation and also the WAN would tend to make aggregation ineffective.
Yet there has been one customer that has asked about it and may be pursuing it.

Data Domain

Network router

Network router

Tape Library
Business Servers
Backup/media servers

Data Domain Link Aggregation and Failover


There are two link aggregation methods supported by Data Domain:

Round Robin and


Balanced-xor (setup manually on both sides).

The balanced-xor aggregation is selected by choosing the specific hash that is supported:

Layer 2 or
Layer 3+4.

There are four virtual interfaces that can be used to define the aggregation or failover:

veth0,
veth1,
veth2,
veth3.

Any of the physical links that are available on the system can be included: eth0, eth1, eth2, eth3, etc. The onboard links (eht0 and eth1) have only recently been allowed to be added to the aggregation group. Older
installations of the Data Domain software may not allow those two links to be aggregated.
To specify aggregation of eth2 and eth3 in the virtual interface veth0 one of the following commands would be
used:
Net aggregate add veth0 mode roundrobin interfaces eth2 eth3
The first network packet sent to veth0 will be forwarded to one of the interfaces and the next packet would
be forwarded to the other. Sending of packets will continue to alternate between the interfaces until there
are no more packets or a link fails. If eth3 loses physical connection all packets are sent through eth2 until
the eth3 link is brought back up. To make this effective the other side of the network will also need to be
setup to do round robin. For direct connect (the only topology that is recommended for round robin) the
media server will have to be able to setup and support round robin.
Net aggregate add veth0 mode xor-L2 interfaces eth2 eth3
The aggregation used would be balanced-xor. The packets are distributed across eth2 and eth3 based on
XOR of the source and destination MAC addresses. Because there are only 2 links to be aggregated the
lowest bit is used to determine the interface to use for the packet. If the result is 0 one interface will be
chosen. If the result is 1 the other interface will be used. To get the packets to be spread across the two
links requires that data is sent to more than one destination and the MAC addresses of the destination
needs to be different in such a way that XOR results provide a different number. This means that one
address needs to be odd and the other needs to be even. If there are three links that are aggregate, the
XOR result is split 3 ways. There has to be at least two media servers there must be at least two media
servers with odd and even MAC addresses to get any aggregation at all. In general, this aggregation should
not be used with less than 4 media servers.
Net aggregate add veth0 mode xor-L3L4 interfaces eth2 eth3
The aggregation used with this command will also be balanced-xor. The packets are distributed across eth2
and eth3 based on the XOR of the source IP address, destination IP address, source port number, and the
destination port number. The result gives a number in which the lowest bit is used to determine which link
to use to send the packet. An even result will go over one and an odd result will go over the other. With
three links the result is divided by 3 with the remainder determining which interface to use. This aggregation
would be used when there are a lot of connections (there is one connection per stream) or a lot of media
servers or both. This is the mode of choice for Data Domain, but some switches do not support this type of
hashing.
Net failover add veth0 interfaces eth2 eth3

This is not aggregation but the command will group together interfaces eth2 and eth3 for failover. There is
only one failover type supported. If the active physical link goes away the data is sent to the second
physical link. The active interface is determined by which link comes up first when it is setup. This is
nondeterministic. It is dependent on several factors such as switch activity, network activity, and which
interface is brought up first when they are enabled. The active one can be determined by specifying one of
the links as primary. The primary interface will always be set as active if it is UP and RUNNING.

Functions available in Linux distribution


The following is a summary of the aggregation and failover modes and hashing available. A more complete
description can be found in Documentation/networking/bonding.txt in the Linux distribution:

Mode Options
1. balance-rr or BOND_MODE_ROUNDROBIN (0)
Aggregation using Round Robin
Failover with degradation
Normally a good type to use with direct connect or something equivalent
To get full aggregation both ends of the link needs to be set up to use round robin
2. active-backup or BOND_MODE_ACTIVEBACKUP (1)
Failover method used by Data Domain
Works only when one or more standby links are in the group
There is one active and all others in the group are stanby
The active link is non-deterministic unless a primary is specified
3. balance-xor or BOND_MODE_XOR (2)
Send transmit to a specific NIC based on specified hash method being used
Default (Source MAC address XOR Destination MAC address) modulo size of aggregation
group
Note: this only aggregates transmissions.
The receive needs to be aggregated on the other end
This mode is referred to as static because of the manual setup that is needed.

Hash method used:


1. Layer 2
Uses (source MAC XOR destination MAC) modulo count of links in aggregation group
This works best when there are many hosts and they are connected to the same switch
All packets to or through a specific MAC address goes through the same link
2. Layer 3+4
Uses ((source port XOR dest port) XOR ((source IP XOR dest IP) AND 0xffff)
modulo count of links in aggregation group
This works best with many connections and/or many media servers
This can work with as little as one media server
For packets that do not include the port number such as IP fragmentation packets and non-TCP
and non-UDP packets this method will use the IP address only. For non-IP packets the Layer 2
mode is used. It is because of these exceptions that this is not IEEE compliant. Note that the
Data Domain network configuration is set up so that packets are not fragmented.
The DD CLI simplifies the interface by not requiring the administrator to specify any more than is necessary.
Therefore there is no option to specify mode 3, balance-xor, directly. Rather by specifying the hash method to
use the CLI will set the mode to number 3.
The aggregation method used is very important to getting the desired performance. In general the aggregation
of choice is mode xor-L3L4 along with many streams, but if the DDR and media servers are directly
connected and there are enough links to do aggregation then mode roundrobin may work best. There are

some switches that do not support port number hashing. In this case mode xor-L3L4 will not work.
Consider also that the best aggregation may be to have each media server use a different link instead of
grouping them together. Consider the following example:

four media servers


each media server is sending data at the same time
there are 4 links available on the DDR,

Assign a different IP address to each link and setup up each media server to send data to one unique IP
address on the media server. That way the throughput will approach 4 times a single link speed verses around
2.5 times if aggregation is used. This is very dependent on the expected traffic pattern from the media servers.

Link failures
A link can fail at several places. It can occur in the driver, the wire, the switch, the router, or the remote system.
For failover to work the program (this is the bonding module in the Data Domain case) must be able to
determine that a link to the other side is down. This information is normally provided by the hardware driver.
For a simple case consider a direct connect were the wire is disconnected. The driver can sense that the carrier
is down and will report this back to the bonding module. The bonding module will mark it as down and switch to
a different link. The bonding module will continue to monitor the link and when it comes back up it will mark it as
up. If the restored link is marked as the primary the data will be switched back to using that link again.
Otherwise the data flow will stay on the current link.
Note: the failover method that is currently supported is for directly attached hardware. The driver can sense
when the directly attached link is no longer functioning, but beyond that it gets a little harder. Consider the case
that there is a switch or maybe two in the middle. Can the driver determine that the connection to the remote
system has failed and therefore it needs to switch to the backup? This is possible if the switch provides a link
fault signaling similar to what is defined in IEEE 802.3ae. This is supported by the Fujitsu 10GbE switch and a
similar thing is supported by Cisco. This is rather limited network topology where the systems are directly
connected via switches and there are no other routes available. This would be an extension of the direct
connect to the media server. Currently the driver and the bonding module does not support the link fault
signaling because it is not widely available too limited of a network topology
For a more complex case consider the local network but with a switch and a router in the network path. There
are at least two distinct paths that can be followed to get to the router. Failures have to be able to be detected
on any part of each network path. For example if there is a failure at the one port on the router that the DDR link
connected via the switch, the driver would have to be able to determine that the remote link is down and mark
that link as down. In this case the switch itself would be able to switch the signal to the other path between the
switch and the router and a failover at the DDR is not needed. Once again the DDR need only determine that
there is a failure between its NIC and the switch or router to which it is attached.
There are two types of failover. One is failover to a standby. The standby is not being used until a failure
happens and the traffic is redirected to the standby link. This is a waste of resources if there is never a failover.
This is the method used by Data Domain when the bonding method failover is specified:
Net add failover veth1 interfaces eth3
Another type of failover is failover with degradation. In this method there is no standby. All the links in the
group are being used. If there is a failure the failed link is removed and the rest of the network traffic from that
link is redirected to the other links in the group. This is the failover associated with link aggregation, but it can
become complex if the bonding driver has to determine if a path to the target system no longer exists and it
needs to not send data to that link.

Other Link Aggregation


The link aggregation used is dependent on what network equipment the DDR is connected and the network
topology. The equipment connected to the DDR could be a switch or router, and the target system. So it is
important to understand what aggregation is provided by other systems. Most switches and routers support

LACP link aggregation (IEEE 802.3ad standard). Some offer proprietary aggregation types. If they offer
aggregation they support the XOR of Layer 2 to define which packet goes to which port.

Cisco
Some of the older Cisco switches and routers only support the older proprietary protocol, PAgP. The Data
Domain system will not support this type of aggregation. Fortunately, the newer switches and routers support
the IEEE 802.3ad standard. When using Cisco switches and routers the IEEE 802.3ad should be used with
Layer 3 and 4 hashing. It may be possible in some cases to set the aggregation with PAgP to round robin, but
that is not currently supported for the DDR when connected to a switch or a router because of through put
delays from potential packet ordering issues. At high speeds with fast retransmissions out of order packets can
generate many more packets which would decrease the overall performance.

Nortel
Nortel supports an aggregation called Split Multi-Link Trunking which uses LACP_AUTO mode link aggregation

Sun
The initial version 10 Solaris and earlier models supported Sun Trunking. Later releases of Solaris 10 and
beyond support IEEE 802.3ad standard in communicating with switches. Back-to-back link aggregation is
supported in which two systems are directly connected over multiple ports. The balancing of the load can be
done with L2 (MAC address), L3 (IP address), L4 (TCP port number), or any combination of these. Note the
DDR currently only supports L2 or L3+L4. Link aggregation can run in either passive mode or active mode. At
least one side must be in active mode. The DDR always uses active mode.
Sun trunking supports round robin type of aggregation. This type of aggregation could be used if the DDR is
connected directly to a Sun system.
For more information on Sun Aggregation refer to the following:
http://docs.sun.com/app/docs/doc/816-4554/fpjvl?l=en&q=%22link+aggregation%22&a=view
For more information on Sun Trunking refer to the following:
http://docs.sun.com/source/817-3374-11/preface.html

Windows
Microsofts view of Link aggregation is that it is a switch problem or a hardware problem. So Microsoft feels that
it should be handled by the switch/router and the NIC card. There is nothing in the OS that directly supports it.
Rather if the customer wants it they should get NIC cards that support it and either have a special driver to
initiate it or use the switch to drive it. In the current documentation for their server 2008 they refer to the support
of PAgP an old proprietary Cisco aggregation protocol:
http://blogs.technet.com/winserverperformance/
They also refer to Receive-Side Scaling (RSS):
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
This refers to a way to allocate a program to handle packets across NIC cards which are normally tied to
specific CPUs. There are drivers from outside of Microsoft that at least provide passive IEEE 802.3ad support if
not active. Passive support means that the Windows system will respond to the to the IEEE 802.3ad protocol
packets, but it will not generate them. For direct connect this may be the only way to have a directly connected
aggregated link. The following link provides Microsofts view of servers for 2008:
http://technet2.microsoft.com/windowsserver2008/en/library/59e1e955-3159-41a1-b8fd047defcbd3f41033.mspx?mfr=true
If the Windows server is not directly connected then it is not important to the DDR system if or how Link
aggregation is provided by Windows. That would be between the windows server and the switch/router.

It is still TBD for more specific information on which NIC cards support Link aggregation.

AIX
According to an AIX and Linux administration guide AIX supports EtherChannel and IEEE 802.3ad types of link
aggregation as mentioned in the RSCT administration guide:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.rsct.doc/rsct_aix5l53/bl5a
dm05/bl5adm0559.html
When using DDR, the round robin available through the EtherChannel can be used when directly connected.
IEEE 803.3ad can be used if Layer 4 hashing is included. If it is not directly connected then it is dependent on
the switch or router being used.
AIX uses a variant of EtherChannel for backup, referred to as EtherChannel backup. This is similar to the active
backup supported by the Linux bonding driver and does not need any handshake from the equipment connected
to the links except to have multiple links available.

HPUX
The link aggregation product is referred to as HP Auto Port Aggregation (APA). As with the Link bonding this
product also provides either a full standby failover or a degradation failover by overloading other links with in an
aggregation group. The aggregation can use Layer 2, Layer 3, and/or Layer 4 hashing for aggregating across
the links. It also supports the IEEE 802.3ad standard. A summary of the product is given here:
http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=J4240AA
The administration guide can be found here:
http://docs.hp.com/en/J4240-90039/index.html
According to the administration guide, direct connect server to server is supported, but round robin type of
aggregation does not seem to be. This is further brought out in figure 3-4 in the document where for direct
connect it is recommended to have many connections for load balancing to be effective. With round robin
multiple connections are not required for effective aggregation. With this understanding the HPUX systems
would not support round robin with a directly connected system

Data Domain Link Aggregation and Failover in the Customers


Environment
The history of the use of failover and link aggregation in the Data Domain products is as follows:
1. Failover was added in release 4.3
2. Link aggregation was added in release 4.4
3. Source routing was set as the default in release 4.5 to allow separate NICs to reside on the same network
and the response packets would get sent to the correct route. This means the following settings are the default:
net option set net.ipv4.route.src_ip_routing 1
net option set net.ipv4.route.flush 2
4. Since July 2008 a primary can be specified for failover. If the primary is up it will always be the active link.
5. Since July 2008 release for 4.4.4, 4.5.2, and beyond the on-board NICs can be included in link aggregation.

Normal Link Aggregation


Normally link aggregation is between the DDR system and the equipment it is connected to. For example, if the
DDR is directly connected to a media server then the link aggregation is between the DDR and the media
server. All the links in the aggregation bond would be directly connected to the media server and the media
server would need to provide a compatible aggregation to the type used in the DDR.

If the DDR is connected to a switch the link aggregation is between the DDR and the switch. All the links in the
aggregation bond would be directly connected to this switch and the switch would need to be setup to handle
the aggregation chosen. What type of aggregation is done on the target system which may be connected to
this same switch or a different switch is independent of what method is used by the DDR.
There is a case that there may be one or more switches between the DDR and the target system, but it is still
considered a direct connect direct to the target server. The following diagram shows the network topology of
this setup. Notice that there is a separate switch for each link and the switches do not communicate with each
other. This is important because the IP address for each link on the DDR is the same. This target server would
also have to have a similar setup with the IP at the media server being shared.
Network switch A
eth2

en5

en4
Data Domain
Appliance

eth3

Network switch B

Backup/media servers

This setup may be done to handle distances that are too long for direct connect, but the user still wants to
directly connect the two systems. In this case the aggregation handshake would be between the two end
systems. It would be expected that round robin would be used and would have to be set up on both sides.
There are some concerns with this setup as dealing with failures. If the link between the target system and the
switch goes down the local system would have to be able to detect this and send everything over the other
link(s). For example, suppose the link between switch B and en4 is broken. The media server would sense the
carrier is lost and route the traffic to en5, but the driver for eth3 on the DDR would also have to be able to sense
this and indicate a carrier down condition to the bonding module so the DDR would also route all the traffic
through eth2. With the current software and switch hardware this is not done and sense the switches are
isolated the packets would just get dropped.

Failover of NICs
The case of the pure failover is different. In this case the bonded links do not necessarily need to be connected
to the same switch or router as long as all the links in the bond can transfer data to the target system. With
failover the data is not split among the links. It is sent over one link only and that link is referred to as the active
link. A single virtual IP is shared across all links in the failover bond, but the MAC does not necessarily need to
be the same. While the active link is up the other links are idle. When a failure occurs, the DDR sends the
packets to another link and redirects the receive packets to another link through the use of ARP. To get the
receive packets to go to the active link, the ARP is turned off for all the links except the active link and a
gratuitous ARP is sent on the new active link. This sets to the ARP cache in the associated switches and
routers.

Failover Associated with Link Aggregation


When aggregation is being performed there is a limited failover capability. The bonding module can sense
when a link is down and for link aggregation the bonding module will mark the link as down and will remove it
from the list of active aggregated links. This will also be conveyed to the associate switch or directly connected
system through the aggregation network protocol. If an aggregation network protocol is not supported the other
side will also sense that a link is down and stop using it for aggregation. Once the link is brought back up this
condition will be sensed by the aggregation software and data will flow over the link again.
While a link within the aggregated group is down the data will be distributed among the remaining links. So
communication will be maintained, but the throughput will be degraded from the level that is able to be achieved
by the full aggregation. Note this degradation may have a temporary impact on the retransmission of packets,
but over time that will be corrected as timeout values get corrected.

Recommended Link Aggregation


The following is what should be considered when trying to decide on aggregation. If no aggregation is to be
done then failover should be considered. Therefore the last choice given is failover as an alternative to
aggregation. When considering aggregation some important things to consider: How many simultaneous
Media Server actively doing backups? If the number is less then 4 xor-L2 will not be very effective. As the
number of aggregated links increase the number of active clients will also need to increase if xor-L2 is to be
used. With three aggregated links the number of active clients should be above 5. What is the network
topology? If it route goes through gateways using xor-L2 wont work because it the destination mac will be the
gateway router.
Direct Connect
1. mode roundrobin (if it is supported by the media servers)
2. separate NIC per media server (if there are enough NICs)
3. mode xor-L3L4
4. failover (if aggregation cannot be used)
Private Network (less than 4 active client or route has gateways in the path)
1. separate NIC per media server (if there are enough NICs)
2. mode xor-L3L4
3. mode xor-L2 (if there are a suitable number of clients)
4. failover (if aggregation cannot be used)
Private Network (more than 4 active client and the network path as no gateways)
1. separate mode xor-L2
2. mode xor-L3L4
3. separate NIC per media server (if there are enough NICs)
4. failover (if aggregation cannot be used)
Local Network (less than 4 active client or route has gateways in the path)
1. separate NIC per media server (if there are enough NICs)
2. mode xor-L3L4
3. mode xor-L2 (if there are a suitable number of active clients)
4. failover (if aggregation can not be used)
Local Network (more than 4 active client and the network path as no gateways)
1. mode xor-L2
2. mode xor-L3L4
3. separate NIC per media server (if there are enough NICs)
4. failover (if aggregation can not be used)
Remote Network (normally through gateway and routers)
1. separate NIC per media server (if there are enough NICs)
2. mode xor-L3L4
3. failover (if aggregation can not be used)

Switch information
Link aggregation is setup on both sides of a link. The link aggregation does not necessarily have to match on
both sides of the link. For example, the DDR may be set to xor-L3L4 but the switch may be set to src-ip. A
good rule of thumb to follow is to keep the aggregations close, such as xor-L3L4 on the DDR and src-dst-port on
the switch. The reason for this is that if an aggregation is good enough for one direction it is good eough for the
other direction.
Aggregation on the switch is used to distribute traffic being received by the DDR. If the main set of operations
being done is backup the switch aggregation is very important. Backup network traffic is mostly data being
received by the DDR.

Because of the limited number of clients communicating with the DDR the recommended aggregation method is
balance-xor with Layer 3+4 hashing. To support this, the device directly connected to the DDR, e.g. switch or
router (see the Normal Link Aggregation), needs to support src-dst-port or at least src-port load balancing. This
section uses the vendors documentation to provide potential switches that may work with the Layer 3+4
hashing and also some that may not. There are no plans to validate or certify these. The final authority whether
a switch supports the desired aggregation is to physically try it. For example, there is at least one case where
round robin was desired and tried and it worked satisfactory even though it is listed that it is not supported. Note
again, even though round robin may be supported by a switch the aggregation performance is poor or even
worst then not having it. This is mostly due to the out of order packets.
Note: There are few switches that supports layer 3 + 4 aggregation. The supported aggregation may be for
layer 3 only or layer 4 only. Matching layer 4 (port aggregation) with layer 3 + 4 (IP address and port
aggregation) is not a problem, but be aware that it may cause data to be sent on one link and received on a
different link, but the concern of out of order packets shold not occur. Which link the data is sent on is not
important as long as all the data associated with a connection is sent on the same link.
Definitions:
Dest := Destination
IP := IP address
L4 := Layer 4 of the network stack, i.e. TCP
MAC := mac or hardware address
Port := TCP port number
Src := Source
SW := software
Switch brand
& model

Switch
Vendor SW
Release

Src
MAC

Dest
MAC

SrcDest
MAC

Src
IP

Dest
IP

SrcDest
IP

Src
L4
Port

Dest
L4
Port

SrcDest
L4
Port

Round
Robin

Cisco Catalyst
6500 CatOS
Cisco Catalyst
6500 IOS
Cisco Catalyst
3560
Cisco Catalyst
2960
Cisco Catalyst
3750
Cisco Catalyst
4500/4948/4924

8.6

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

12.2SXF

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

12.2(44)SE

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

12.2(44)SE

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

12.2(44)SE

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

12.2(37)SG

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

For directly connected systems the support for round robin is as follows:

Sun - yes
AIX - yes, it can
HPUX - no
Windows maybe, it depends on the NIC software, but dont count on it.

Cisco Configuration
Set the etherchannel mode to on:
Manually set the ports to participate in the channel group

DDR Configuration
xor-l3l4
xor-l2

Cisco Load Balance Configuration


src-dst-port
src-dst-mac

Appendix
This appendix gives more details as to what aggregation is normally offered by the Linux system being used by
Data Domain. The other options are not made available because they do not provide better aggregation or
failover than what is already available. It is expected that is section will be used by developers.
Data domain uses the link aggregation and failover provided by the bonding module available in the Linux
distribution. The bonding module was developed separate from Linux OS, but is now provided with each
distribution under drivers/net/bonding. For each mode used on the system a separate bonding module is
loaded. To do this each bonding module instance is tied to a specific virtual interface. The names used by Data
Domain are: veth0, veth1, veth2, veth3. You can see these along with all the physical interfaces available and
what the aggregation or failover is being used by using the command:
net show settings

Bond functions available in Linux distribution


The following is a summary of the bonding modes and hashing available. A more complete description can be
found in Documentation/networking/bonding.txt in the Linux distribution:

Mode Options
1. balance-rr or BOND_MODE_ROUNDROBIN (0)
Aggregation using Round Robin
Failover with degradation
Normally a good type to use with direct connect or something equivalent
To get full aggregation both ends of the link needs to be set up to use round robin
2. active-backup or BOND_MODE_ACTIVEBACKUP (1)
Failover method used by Data Domain
Works only when one or more standby links are in the group
The active link is non-deterministic unless a primary is specified
3. balance-xor or BOND_MODE_XOR (2))
Note: this only aggregates transmissions. The receive needs to be aggregated on the other end
Send transmit to a specific NIC based on specified hash method being used
Default (Source MAC address XOR Destination MAC address) modulo size of aggregation
group
This mode is used when mode XOR is specified in the CLI
4. broadcast or BOND_MODE_BROADCAST (3)
Failover send everything on all links in group
This mode is not available when using the Data Domain shell CLI
5. 802.3ad or BOND_MODE_8023AD (4)
IEEE dynamic link aggregation using the same hash as used by mode 3
The aggregation is determined by the hash method chosen, layer 2 is the default
Requires the driver to support ethtool
Requires the switch to support the IEEE 802.3ad standard, specifically the protocol
Requires the same IP address and MAC address across all the slaves
All the slaves must run at the same speed and be connected to the same switch
This mode is not available when using the Data Domain shell CLI
6. balance-tlb or BOND_MODE_TLB (5)
Aggregation and failover, does not require switch support, but deals with transmit only.
Aggregated according to load of each link and the link speed.
Different speed links can be used.

7.

The MAC addresses across the links do not have to be the same.
Matching MACs may be required for receive aggregation, but it is not tied to this.
The associated drivers must support ethtool interface
This is not currently supported by the Data Domain shell CLI.

balance-alb or BOND_MODE_ALB (6)


Adaptive link aggregation for both transmit and receive without switch support
Uses arp to control which link the receive uses. Different MAC address are used for each link
Links can be added to and removed from the aggregation group.
The switch should not be using aggregation to this system.
Links with different speeds can be used, but not recommended.
The associated drivers must support ethtool interface
This is not currently supported by the Data Domain shell CLI.

Hash method used:


1. Layer 2
Uses (source MAC XOR destination MAC) modulo count of links in aggregation group
This works best when there are many hosts and they are connected to the same switch
All packets to or through a specific MAC address (first hop peer) goes through the same link
2. Layer 3+4
Uses ((source port XOR dest port) XOR ((source IP XOR dest IP) AND 0xffff)
modulo count of links in aggregation group
This works best with many connections
This can work with many systems or just two systems
For packets that do not include the port number such as IP fragmentation packets and non-TCP
and non-UDP packets this method will use the IP address. For non-IP packets the Layer 2
mode is used. It is because of these exceptions that this is not IEEE compliant. Note that the
configuration is set up so that the packets are not fragmented.
3. Layer 2+3
Uses ((source MAC XOR dest MAC) XOR ((source IP XOR dest IP) AND 0xffff)
This works best when there are many target hosts.
Not available from the bonding version used by Data Domain
Note, in some cases there is the comment that a feature is not supported by the CLI. This means that the
feature is in the code, but cannot be activated through the CLI. In the case of hash method three, the feature in
later versions of the Linux code, but it is not in the current version used by Data Domain.
The CLI simplifies the interface by not requiring the administrator to specify any more than is necessary.
Therefore there is not an option to specify mode 5, 802.3ad, directly. Rather by specifying the hash method to
use the CLI will set the mode to number 5. Modes 6 and 7 do not support hash mode 2. So if these are
enabled in the CLI they would be referenced only by the mode name. For example, for mode 7 all that would be
specified would be balance-alb.
The aggregation method used is very important to getting the desired performance. Consider the different
network topologies that may be used, but especially consider the number of target systems. If there is only one
target system then aggregation methods that use the MAC address or the IP address will not be effective
because the addresses will always be the same. Also consider having a limited number of target systems. As
long as the addresses allow the traffic to go over different links there will be some aggregation, but if one system
has much more data then the other or if the target systems are not transferring data at the same time then the
aggregation will be not provide the desired performance. This is why it is recommended the Layer 3+4 hash
method be used along with using many streams. The multiple streams will create multiple connections and
each connection will have at least one unique port number. If the port numbers are suitably distributed they will
distribute the traffic across multiple links based on the port number.

You might also like