Professional Documents
Culture Documents
Introduction
Everything changes. In the early 90s the microprocessor was a prized possession. By the year 2000, PCs were running microprocessors at GHz of clock speeds. But the way, in which I/O was carried out, remained much or less, the same. The processor is now capable of delivering data at blistering speeds but the I/O subsystem that is supposed to accept it, is incapable of receiving the same. The bottleneck is the shared bus architecture.
As shown in the Fig 1 the PCI devices are all attached to a parallel PCI bus, which they all contend for. In this kind of a scenario, contention is inevitable. The performance chart is shown below.
PCI-X DDR* (133 MHz) 8 Gbps 16 Gbps Shared Parallel Bus Bus Contention
QDR** 32 Gbps
Though the maximum bandwidth shown in the tables looks enormous, the fact is that the bandwidth at hand, turns out be about 533 Mbps for PCI 66 MHz version. Also, due to the shared nature of the PCI bus, as the frequency of operation is increased, the fanout has to be lowered. This means that the number of devices that can be attached to the bus decreases. So PCI does not look like a viable option for the next generation I/O systems, though it looks poised to exist for quite some time due to wide market acceptability. What could be the solution to the bus contention issue? It is the use of Serial Switched Architectures. InfiniBand is a technology that employs a serial switched architecture.
The FC technology is a proven technology in the field of data storage. GigE is also coming up in a big way. Networking is the USP of GigE. But what about server clustering? Server clustering needs a low overhead, quick messaging service that is very reliable. This is where InfiniBand scores. Unlike other networking technologies InfiniBand is designed to bypass the multi-layered protocol-processing overhead. The comparison in other areas is shown in the graphic.
Application focus Server I/O & Clustering Local Area Networks Storage Area Networks
Data Transport/ Reliability High Reliability Data Packets dropped during Congestion-No failover capability High Reliability
Fibre Channel
Systems Management Built-in, in-band fabric and H/W management. No form factors or built-in management systems No form factors or built-in management systems
Components of InfiniBand
System Area Layout
Fig 4. InfiniBand topology Fig 4. Shows the InfiniBand topology in its most basic form. The node could be server, a PC an I/O device like RAID subsystem. The fabric may be a single switch or an interconnection of switches and routers. All connections in this topology are switched i.e. they are point to point, thus eliminating congestion. Also due to the serial nature, they require only four cables instead of the wide parallel connection of the PCI bus.
Fig 5. An system level view of the basic topology In the system level view (Fig 5.) there are certain elements that need explanation. The leftmost part of the figure depicts the internals of a node. The memory controller is connected to a Host Channel Adapter (HCA), which is the entry point of the node into the fabric. The HCA provides an interface for InfiniBand to integrate with the Operating System. The HCA links the node with the switch, which in-turn is connected to a number of Target Channel Adapters (TCA). The TCA interfaces present target I/O devices like RAID and JBOD subsystems with the InfiniBand fabric. Each TCA serves a specific kind of target though Multi-utility TCAs are also a possibility. These channel adapters contain ports. A single TCA/HCA can contain more than a single port. These ports connect the node to the fabric and vice-versa.
InfiniBand Architecture
As is evident from the fig 7. InfiniBand operates via a Network Protocol Stack. This protocol stack has been compared with the OSI model layers for convenience.
Fig 7. InfiniBand Protocol Stack compared with the OSI network Model
Source: InfiniBand Architecture Tutorial Hot Chips by Daniel Cassiday (InfiniBand Trade Association)
At the top client layers communicate in the form of Transactions. These transactions are composed of Messages that are moved through the transport layer. These messages are then further divided into packets at the network layer as shown in the graphic. IB routers can rout these packets across network domains. The routers use a global identifier called GID[3] for this purpose. For subnet routing in the data-link layer an identifier local to the subnet is used, known as the LID [3]. An IB switch generally does this.
At the lowest layer of the stack (which corresponds to the physical and data-link layers of the OSI model) the standards are more or less, similar to FC. InfiniBand uses both optic Fibre cables and copper cables. The IB error rate is 10-12 and uses 8B/10B-encoding standards. 8B/10B means that for every 8 bits of data to be sent, 10 bits are actually sent over the physical cabling. A new concept of aggregating links into physical lanes [6] of 4 or 12 cables is also supported. They are known as 4X and 12X respectively. Moreover, the IB cabling is fully duplex, i.e. a 4X channel contains 4 send and 4 receive lanes. This combination gives a faster throughput. Though there are 4 lanes, they are a single entity for management issues. IB incorporates a concept of segmenting bandwidths using virtual lanes (VL) [6]. These VLs are formed by a multiplexing arrangement where unrelated data can flow sharing the same link. IB has configurations of 1,2,4,8 & 15 virtual lanes. V15 is only used for network management and the rest are data lanes. By implementing this, IB allows multipoint communication among nodes and provides better utilization of the fabric. IB provides a method to logically group together nodes, which are otherwise physically distant. This is known as partitioning [6]. It is analogous to VLAN s in Ethernet data networks.
For Receive Queue the only type of operation is Post Receive Buffer, which identifies a buffer into which a client may send to or receive data from through a Send, RDMAWrite, RDMA-Read operation.
Types of services:
IB provides 5 different types of transport services [6]: Reliable Connection Unreliable Connection Reliable Datagram Unreliable Datagram Raw Datagram
InfiniBand
Advantages Lower Cost Simpler for chip to chip Clustering Clustering Scalability Quality of Service Security Fault Tolerance Multi-Cast Fabric Convergence PCB, Copper & Fiber
Conclusion
The response to IB has been positive. As per analysts, very soon a huge percentage of servers will be IB enabled. This growth will take place when IB becomes native with the server motherboard. It is predicted that soon the use of IB as a technology for clustering, storing as well as networking will ensue. The predictions may be positive but the IT world is such that what is hot property today may be obsolete tomorrow. So what lies in store for InfiniBand, is for time to tell.
Glossary
1. AGP Advanced Graphics Processor 2. BW - Bandwidth 3. CPU Central Processing Unit 4. CQE Completion Queue Element 5. DDR Double Data Rate 6. FC Fibre Channel 7. GID Global Identifier 8. GigE Gigabit Ethernet 9. HCA Host Channel Adapter 10. IB InfiniBand 11. IBTA InfiniBand Trade Association 12. ISA Industry Standard Architecture 13. LID Local Identifier 14. PCI - Peripheral Component Interconnect 15. QDR Quadruple Data Rate 16. QP Queue Pair 17. RAM Random Access Memory
18. RDMA Remote Dynamic Memory Access 19. SNIA Storage Networking Industry Association 20. TCA Target Channel Adapter 21. VI Virtual Interface 22. VL Virtual Lanes 23. WQE Work Queue Element References 1. InfiniBand Architecture Tutorial Hot Chips by Daniel Cassiday (InfiniBand Trade Association) 2. Introduction to the value proposition of InfiniBand by Marc Staimer (Dragon Slayer Consulting) 3. An introduction to InfiniBand Architecture by Odysseas Pentakalos (http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html) 4. How PCI works? By Jeff Tyson (http://www.howstuffworks.com/pci.htm) 5. Understanding InfiniBand by Gene Risi & Philip Bender 6. Building Storage Networks - 2nd Edition by Marc Farley (Storage Networking Industry Association)