You are on page 1of 21

A R C H I T E C T U R E

G U I D E

Hitachi High-performance NAS Platform, Powered by BlueArc

Hitachi Data Systems

Hitachi High-performance NAS Platform, Powered by BlueArc


Executive Summary
Hitachi High-performance NAS Platforms SiliconServer Architecture, provided by BlueArc, enables a revolutionary step in file servers by creating a hardware-accelerated file system that can scale throughput, IOPS, and capacity well beyond conventional software-based file servers. With its ability to virtualize a large storage pool of up to 512TB of multitiered storage, High-performance NAS Platform can scale with growing storage requirements and become a competitive advantage for business processes. This document sets forth the technical details of the architecture to better help technical readers understand the unique hardwareaccelerated design and the object-based file system. Introduction With the massive increase in the number of desktop users, high-end workstations, application servers, high-performance computing (HPC), and nodes in compute clusters over the last decade, conventional network attached storage (NAS) solutions have been challenged to meet the resulting acceleration in customer requirements. While file server vendors have offered systems with faster off-the-shelf components and CPUs as they became available, storage demands have far outpaced the ability of these CPU-based appliances to keep up. To meet the increasing performance and capacity requirements, companies have been forced to deploy multiple NAS appliances concurrentlyreducing the benefit of NAS, decentralizing data, and complicating storage management. Many organizations looking for a solution to this performance deficit turned to storage area network (SAN) implementations, but there are challenges with SANs that they did not experience with NAS. The first is high infrastructure cost. Adding one or two expensive Fibre Channel host bus adapters (HBAs) to each high-end workstation, each application and database server, and each cluster node, is an expensive proposition compared to using existing Ethernet network interface cards (NICs). Expensive file system license fees and maintenance costs and complexity add to the burden. But by far the biggest challenge to customers is that a SAN alone does not provide the standards-based shared file access needed for simple data management. Another solution that some organizations are beginning to look at is the concept of storage clusters or grids. Both will be referred to as storage-grids for the remainder of this discussion. For conventional NAS appliance vendors that cannot scale performance or capacity, this strategy is not an option, but rather a necessity. Although storage-grids are interesting, they are far from ready for prime time. Consider the rise of compute-clusters as an allegory. In the summer of 1994 Thomas Sterling and Don Becker, working as contractors to NASA, built a clustered computer consisting of 16 DX4 processors connected by channel bonded Ethernet. They called their machine Beowulf. Now, years later, compute-clusters are commonly used in research and are gaining wider acceptance in commercial enterprises. The key to this acceptance is that the complex software that ties compute-clusters together and distributes tasks to the nodes has finally begun to mature to a point where companies can rely upon them for stable services. Some aspects of compute-clusters will translate directly to storage-grids; however, there are enormous complexities that are introduced as well. Locking, cache coherency, client side caching and many other aspects of sharing a file system make it a daunting task. This will be

solved over time, but as with compute-clusters, this will take a significant amount of time to mature. The Internet Engineering Task Force (IETF) is proposing a new standard called pNFS. pNFS is an extension to NFSv4 and will help focus the industry towards a standardsbased solution. The SiliconServer Architecture design illustrates a commitment to standardsbased protocols and methods while taking advantage of its unique hardware architecture for acceleration of data flow. Moreover, the SiliconServer Architecture delivers a significantly less complex solution by keeping the node count low, while still achieving the desired performance. It provides the fastest nodes for storage grids, ensuring reduced complexity and cost, while delivering best in class performance and scalability. The SiliconServer Architecture was developed in 1998 to overcome the limitations of scaling an individual CPU-based NAS server. It was a fresh approach to the problem with the fundamental belief that file services could be accelerated using hardware-based state machines and massive parallelization, in the same way Ethernet and TCP/IP had been accelerated by switch and router vendors. The network vendors moved from software-based solutions to hardware-accelerated solutions, to accelerate packet flow processing. It seemed logical that file services should follow this same evolution, and an evolutionary new design would be required to attain the same benefits as experienced in the networking. SiliconServers Founding Design Principles SiliconServer adheres to key design principleskeeping it simple, maintaining a standardsbased approach, ensuring data availability while enabling significant increases in performance and capacity. High-performance NAS Platform uses the fourth-generation SiliconServer, which delivers significant performance, throughput, and capacity gains, as well as an excellent return on investment. This Architecture Guide discusses the SiliconServer Architecture in detail, including its hardware architecture, unique object-based file system structure, and methods of acceleration. It also looks at how this architecture enables innovative product designs to achieve performance and capacity levels dramatically higher than competitors in the network storage market and, in fact, allows the High-performance NAS Platform to be a NAS and iSCSI hybrid product that delivers the benefits of both SAN and NAS.

The High-performance NAS Platform is the fastest filer node to date, and Hitachi Data Systems will continue to enhance offerings at the node level, as well as ensure a path to storage-grids for organizations that exceed the throughput of the High-performance NAS Platform. The High-performance NAS Platform supports all lines of storage offered by Hitachi Data Systems. Organizations can have the industrys most powerful storage and the fastest NAS in one solution.

Architecting a Better Network Storage Solution


It Starts with the Hardware Design Looking at the network sectors technology advancements, networking progressed from routing functionality on standard UNIX servers, to specialized CPU-based appliances, to hardware-based router and switch solutions, where all processing is handled in the hardware, utilizing custom firmware and operating systems to drive the chipsets. SiliconServer Architecture was an evolutionary design that applied these proven market fundamentals to the NAS market. The goals were to solve the many pain points experienced by organizations currently employing NAS and SAN in the areas of performance, scalability, backup, management, and ease of use. The requirements were to build the best possible single or clustered NAS server that could track and even exceed the requirements of most customers, and could also be leveraged to create a simpler storage grid to scale even further as the data crunch continued to grow. To create the next-generation NAS server, the architecture required: Significantly higher throughput and IOPS than CPU-based appliances and servers Highly scalable storage capacity that would not reduce system performance The ability to virtualize storage in order to extract maximum performance Adherence to traditional protocol standards Flexibility to add new innovative features and new protocol standards Unlike conventional CPU-based NAS architectures, where you never really know how fast they are going to go until you try them, SiliconServer Architecture is designed from the start to achieve a certain level of performance. The design engineers decide up front how fast they want it to go based on what they think is achievable at acceptable cost with the appropriate technologies. This is the beauty of a silicon-based design. The on-paper goals translate directly into the final product capabilities. SiliconServer Architecture has consistently produced the sought-after performance anticipated in the design process and has met or exceeded customer expectations for a network storage solution.

The Modular Chassis The High-performance NAS Platform chassis design was the first critical design consideration, as it would need to scale through newer generations of modules supporting increased throughput, IOPS, and scalability. The modular chassis design therefore needed to scale to 40Gbit/sec total throughput. The passive backplane design was chosen to support these requirements. The backplane has no active components and creates the foundation for a high-availability design, which includes dual redundant hot-pluggable power supplies and fans, as well as dual battery backup for NVRAM.

The passive backplane incorporates pathways upon which Low Voltage Differential Signaling (LVDS) guarantees low noise and very high throughput. The ANSI EIA/TIA-644 standard for Low Voltage Differential Signaling (LVDS) is well suited for a variety of applications, including clock distribution, point-to-point and point-to-multipoint signal distribution. Further discussion of LVDS is beyond the scope of this paper; however, a simple Internet search will return significant information if you wish understand LVDS technology. The hardware-based logic or Field Programmable Gate Arrays (FPGA) connect directly to these high-speed LVDS Pipelines (also known as the FastPath Pipeline), meeting the high-throughput requirement of current and future product designs. A key advantage of this design is the point-to-point relationship between the FPGAs along the pipelines. While traditional computers are filled with shared buses requiring arbitration between processes, this pipeline architecture allows data to transfer between logical blocks in a point-to-point fashion, ensuring no conflicts or bottlenecks. For example, data being processed and transferred from a network process to a file system process is completely independent of all other data transfers. It would have no impact on data moving to the storage interface, for example. This is vastly different from conventional file servers where all IO must navigate through shared buses and memory, which can cause significant performance reductions and fluctuations. The backplane provides separate pipelines to transmit and receive data, meeting only on the storage module, in order to guarantee full duplex performance. The convergence at the storage module allows a read directly from cache after a write, and it is discussed further in this Architecture Guide.

The Modules Four physical modules are inserted into rear of the chassis. These are the Network Interface Module (NIM), two File System Modules (FSA and FSB), and the Storage Interface Module (SIM). Those who need to accelerate performance in a CIFS-intensive environment have the option to choose the new File System Module X (FSX) instead of the File System A module. Each module has clear responsibilities and typically operates completely independently from the others, although the FSA/FSX and FSB modules do have a cooperative relationship. Next-generation modules will continue the advancement of performance, port count, updated memory, and FPGA speeds. Network Interface Module (NIM) Responsible for: High-performance Gigabit Ethernet (GigE) connectivity Hardware processing of protocols OSI Layers 1-4 Out-of-band management access The NIM is responsible for handling all Ethernet-facing I/O functions corresponding to OSI Layer 1-4. The functions implemented on the NIM include handling Ethernet and Jumbo Ethernet frames up to 9000 bytes, ARP, IP protocol and routing, and of course the TCP and UDP protocols. The NIM works as an independent unit within the architecture. It has its own parallel state machines and memory banks. Like the overall architecture design, the TCIP/IP stack is serviced in hardware on the NIM module. This design allows it to handle 64,000 sessions concurrently. Multiple hardware state machines, programmed into FPGAs, running in a massively parallel architecture, ensure that there are no wait-states. This results in nearly instantaneous network response, the highest performance, and the lowest latency. In fact, the predecessor to the NIM was one of the worlds first TCP Offload Engines (TOE), similar to the ones used in some PC-based appliances today.

The purpose-built NIM provides an ideal network interface to the High-performance NAS Platform. A key difference between the NIM and an off-the-shelf TOE card is the substantial amount of resources available. While most TOE cards have no more than 64MB of buffer memory, the NIM has more than 2.75GB of buffer memory supporting the parallel state machines in the FPGAs. This allows the NIM to handle significantly higher throughput and more simultaneous connections. TOE cards used in PC-based architectures are also limited by PCI and memory bus contentions in the server, whereas pipelines in the Highperformance NAS Platform are contention free. Also, TOE cards used on NAS filers usually only handle certain protocols, putting the burden on the central CPU to handle other protocols, which affects overall performance and functions, whereas in the High-performance NAS Platform FPGAs handle virtually all protocols. The current NIM module of the High-performance NAS Platform offers six GigE ports with SFP (Small Form-factor Pluggable) media to allow for either optical or copper physical interconnects. The NIM supports link-aggregation (IEEE802.3ad) including the Link Aggregate Control Protocol (LACP), thus supporting dynamic changes to the aggregation and enabling higher availability and higher throughput to the data, which is critical for highperformance shared data environments. Future NIM modules will scale to higher numbers of GigE and 10GigE ports, allowing for increased throughput and connectivity. The NIM card also has four shared Fast-Ethernet ports for out-of-band management, which allows for direct access and/or connection to the System Management Unit (SMU) and the other devices that make up the total solution. File System Modules (FSA/FSX and FSB) Responsible for: Advanced features OSI Layer 5, 6, and 7 protocols o NFS, CIFS, iSCSI, NDMP Security and authentication SiliconFS (hardware file system) Object store layer File system attribute caching Metadata cache management NVRAM logging
FSX

The two File System Modules work collaboratively to deliver the advanced features of the High-performance NAS Platform. The FSB board handles data movement and the FSA handles data management. The FSA is not in line with the pipeline. Rather, this module controls the several advanced management and exception processing functions of the file system much like the supervisor module of a high-end network switch controls the higher order features of the switch. Snapshot, quotas, and File and Directory Locking are a few examples of processes managed by the FSA module. It will accomplish these tasks by sending instructions to the FSB module, which will actually handle the data control and movement associated with these tasks. The FSA module has dedicated resources in support of its supervisory role, including 4GB of memory. Administrators have the option to use FSX instead of FSA to accelerate performance in a CIFS-intensive environment. As mentioned, the FSB module handles all data movement and sits directly on the FastPath pipeline, transforming, sending, and receiving data to and from the NIM and the storage interface module (SIM). The FSB Module contains the highest population of FPGAs in the system and also contains 19.5GB of memory distributed across different functions. It is the

FSB module that moves and organizes the data via the Object Based SiliconFS file system. The file system is discussed in detail later in this Architecture Guide. When the FSB module receives a request from the NIM module it will inspect the request to determine what is required to fulfill the request, notify the FSA module of the arrival of the new request, and take any action that the FSA may deem necessary, if any. The protocol request is decoded and transformed into the Object Store API for further processing. This critical point is an example of where the parallel state-machine architecture really shows its benefit. Several functions will execute simultaneously: The data is pushed into NVRAM to guarantee the data is captured The data is pushed across the High Speed Cluster Interconnect to update the cluster partner NVRAM if it exists The data is sent over the Fastpath pipeline to the SIM for further processing A response packet is formed Upon successful completion of all of these elements, the response packet can be transmitted by the FSB back to the NIM which will in turn send the response back to the client, thus what would be four serial steps in a traditional file server are collapsed into a single atomic parallel step. This, kind of parallelization occurs throughout the entire system whenever possible. Storage Interface Module (SIM) Responsible for: Fibre Channel processing SCSI command processing Sector cache management Parallel striping Cluster inerconnect NVRAM mirroring The SIM module has two distinct responsibilities. The first role is the handling and management of raw data on the SAN storage back-end. The second responsibility is for the high-availability features of the SAN and the cluster interconnect (when configured in a cluster). The SIM provides the redundant backend SAN connection to the storage pool using four Gigabit Fibre Channel ports. The SIM logically organizes LUNs on the SAN into a virtualized pool of storage so that data is striped across an adequate number of drives in order to provide the high-speed throughput required for the NAS head. This parallel striping is a key advantage as it allows more drive spindles to be involved in all data transfers, ensuring the best storage performance. (The virtualized storage is covered in detail later in this Architecture Guide.) The SIM provides the high-availability failover capabilities for clustered systems. The SIM has two HSCI (High Speed Cluster Interconnect) ports used for both the cluster communications as well as the avenue for mirroring NVRAM between nodes for the highest degree of data protection. The SIM card uses two 10GigE ports for clustering, which provide an extremely fast HSCI connection. These connections are required for the additional N-way cluster performance and can handle the increased inter-node communication required support a high-performance storage-grid.

Memory (Buffers and Caches) In order to achieve the performance design goal, there are a number of considerations to take into account. In particular, the minimum memory bandwidth requirements throughout the system are critical. High-performance NAS Platform has a robust set of memory pools, each dedicated to certain tasks, and these pools must operate within certain tolerances in order to achieve the desired performance. The amount of memory in each module is summarized below. This memory is distributed across various tasks on each module. By segregating memory pools (there are several dozen in the entire system) and ensuring that each has adequate bandwidth, the SiliconServer Architecture ensures that memory access will never be a bottleneck. This is critical to sustaining the High-performance NAS Platforms high-throughput performance. High-performance NAS Platform model 2200 has 32GB of distributed memory, cache, and NVRAM distributed as follows: NIM Module 2.75GB network processing FSA 4GB protocol handshaking and file system management FSB 15.5GB metadata NVRAM and control memory SIM 9.75GB sector cache and control memory Total memory 32GB In designing memory requirements for a high-speed system, two key elements must be taken into consideration. First, peak transfer rates on an SDRAM interface cannot be sustained due to various per-transfer overheads. Then, for the various memories contained in each of the modules, the memory bandwidth must be doubled to support simultaneous high-performance reads and writes, as data is written into memory and pulled out. Thus, memory bandwidth in the architecture is designed to have approximately 2.5 times (2.5x) the bandwidth required to sustain throughput. There are also areas of the architecture where memory bandwidth is even greater. The SIM, for example, has 8GB of raw block sector cache. On this module, the transmit and receive Fastpath pipelines intersect, as data that is written into the sector cache must be immediately available for reading by other users even though the data may not yet have made it to disk. In this scenario, four simultaneous types of access into the memory must be considered: Writes coming from the FSB Reads being returned to the FSB Updates of the cache from data coming from the SAN Reads from the cache in order to flush data to the SAN As a result, the SIMs sector cache must deliver 5x the bandwidth of desired throughput of the overall system, which it does. Field Programmable Gate Arrays (FPGA) At the heart of the SiliconServer Architecture is a unique implementation of parallel state machines FPGAs. An FPGA is an integrated circuit, which can be reprogrammed in the field, enabling it to have the flexibility to perform new or updated tasks, support new protocols or resolve issues. Upgrades are accomplished via simple upgrade as performed on switches or routers today, which can change the FPGA configuration to perform new functions or protocols. Todays FPGAs are high-performance hardware components with their own memory, input/output buffers, and clock distributionall embedded within the chip. FPGAs are similar

to ASICs (Application Specific Integrated Circuits), used in high-speed switches and routers, but ASICs are not reprogrammable in the field, and are generally used in a high-volume, non-changing product. Hardware developers sometimes do their initial designs and releases on an FPGA as they allow for quick ad hoc changes during the design phase and short production runs. Once the logic is locked down, they move the logic to an ASIC as product volumes ramp and all features are locked in, to get to a fixed lower cost design. Yet in the SiliconServer Architecture, the FPGA serves as the final design implementation in order to provide the flexibility to add new features and support new protocols in hardware as they are introduced to the market. High-performance switches and routers use FPGAs and ASICs to pump network data for obvious reasons. Now, with the High-performance NAS Platform, the same capability exists for network storage. For an analogy of how the FPGAs work, think of them as little factories. There are a number of loading docks called Input/Output blocks, workers called logic blocks, and connecting everything up are the assembly lines called Programmable Interconnects. Data enters through an input block, much like a receiving dock. The data is examined by a logic block and routed along the Programmable Interconnect to another logic block. Each logic block is capable of doing its task unfettered by whatever else is happening inside the FPGA. These are individual tasks, such as looking for a particular pattern in a data stream, or performing a math function. The logic blocks perform their tasks within strict time constraints so that all finish at virtually the same time. This period of activity is gated by the clock-cycle of the FPGA. High-performance NAS Platform FPGAs operate at 50 million cycles per second.
Clock Cycles
Memory Memory

FPGA FPGA

TCP/IP NFS Block Retrieval Metadata Block Allocation NVRAM iSCSI Metadata Fibre Channel Snapshots TCP/IP NFS Block Retrieval CIFS Virtual Volumes NDMP iSCSI Metadata Block Allocation NDMP

Memory Memory

FPGA FPGA

Memory Memory

FPGA FPGA

Memory Memory

FPGA FPGA

Given the 750,000+ logical blocks inside the High-performance NAS Platform modules, this yields a peak processing capability of approximately 50 trillion tasks per secondover 10,000 times more tasks than the fastest general purpose CPU. (NOTE: As of this writing, Intels fastest microprocessor was rated for 3.8 billion tasks per second.) This massive parallel processing capability is what drives the architecture design and allows throughput improvement of nearly 100 percent per product generation. This contrasts sharply with conventional network storage servers, which rely on general-purpose CPUs that have only been able to scale at approximately 30 percent per product generation.

The fundamental failing of the CPU is that with each atomic step, a software delay is introduced as tasks, which are serially queued-up to be processed, demand system resources in order to execute. When a client machine makes a request to a software appliance, every attempt is made to fulfill that request as far through the compute process as possible. The steps include the device driver of the network card initially receiving the request through error checking and translation of the request into the file system interface. However, this is a best effort strategy. In fact, it can only be a best effort strategy, because the CPU at the heart of software appliances is limited to performing only one task at a time and must, by definition, time-share.
Metadata Lookup Block Allocation OS Operation RAID Metadata Block Retrieval Fetch NVRAM Write RAI D Rebuild

CPU CPU
Clock Cycle Main Memory Main Memory

This issue is exacerbated when advanced features or processes such as snapshot, mirroring, clustering, NDMP backup, and in some cases even RAID protection must be handled by the CPU. Each of these processes cause variations and slowdowns that adversely impact the throughput of the traditional architectures as the CPUs processing capability is diluted from having to time-share among these various tasks. Virtualized Storage, Object-Store, and File System Now that the exclusive hardware advantage of the SiliconServer Architecture has been discussed, it is equally important to examine and understand the external storage system and the unique file system that take advantage of the high-performance and highly scalable architecture. As data passes through layers of FPGAs, the data itself is organized and managed via several layers within the Silicon File System. The best way to understand these layers is to work from the physical storage up to the actual file system seen by the end users.
Virtual File System Cluster Name Space with single root up to 512TB
NAS Cluster

Virtual Servers

Up to 32 Virtual Servers per System


Multiple File Systems Per Storage Pool Multiple dynamic Virtual Volumes per File System

Virtual Storage Pools

Storage Pool File System File System

Storage Pool File System Virtual Volumes

Virtual Tiered Storage

Parallel RAID Striping with hundreds of spindles per span

Parallel Striping The product is designed to meet two requirements: first, protect the data, and second, provide the high throughput and performance needed to feed the High-performance NAS Platform quickly enough to keep up with its throughput potential and each customers requirements. The SAN is usually two or more redundant Fibre Channel switches, which allows for a scalable back end and high availability. The switches are cross-connected to the storage systems and the High-performance NAS Platform SIM module, providing the highavailability failover paths. The LUNs are usually configured for RAID-5 protection, providing failed-disk protection and reconstruction through rotating parity, again ensuring both high availability and good read/write performance. The SIM Module then stripes up to 32 LUNs into a larger logical unit called a Stripe. Stripes are organized into a higher-level entity known as a Span. New Stripes may be added to a Span at any time, without requiring any downtime, allowing dynamically scalable volumes and thin provisioning. This design allows the High-performance NAS Platform to scale in both capacity and back-end performance. Customers optionally scale performance by adding more storage systems or scale capacity by simply adding more disks to existing storage systems. I/O is issued to the Span, which is in turn sent to an underlying Stripe, and all of the disk drives within that Stripe are brought to bear to achieve the enhanced performance required to sustain throughput. This feature, called SiliconStack, allows the High-performance NAS Platform to scale storage without a reduction in performance. In fact, as more storage is added, this actually increases the performance of the High-performance NAS Platform, as it provides more spindles and controllers to feed the High-performance NAS Platforms highthroughput capability. This combined with the SiliconFS allows scalability up to 512TB.

Virtualization and Tiered Storage The High-performance NAS Platform delivers advanced virtualization features. Its Virtual Servers feature enables partitioning of storage resources, allows server consolidation, and provides multiprotocol support. When use patterns change or spikes in I/O demand occur, administrators can balance workloads and rapidly respond. They can also create up to 32 virtual servers per node or cluster within the same management framework, easily coordinating throughput by dedicating ports and separate IP addresses to virtual servers. In addition to presenting an entire file system, the High-performance NAS Platform delivers flexible partitioning, called Virtual Volumes. Administrators may not wish to expose the entirety of the file system to everyone, and through the use of Virtual Volumes, they can present a subset of the file system space to a specific group or user. Virtual Volumes are logical containers, which can be grown and contracted with a simple size control implementation. Client machines see changes to the size of Virtual Volumes instantly. When shared as an NFS export or CIFS share, the user or application sees only the available space assigned to the Virtual Volume. Administrators can use them to granularly control directory, project, or user space. The sum of the space controlled by the Virtual Volumes may be greater than the size of the entire file system. This over-subscription approach, sometimes referred to as thin provisioning, provides additional flexibility when project growth rate is indeterminate. This allows administrators to present the appearance of a larger volume to their users, and purchase the additional storage as needed, while showing a much larger storage pool than is actually available.

10

Further granular control can be realized by assigning user and group quotas to a Virtual Volume. Each Virtual Volume may have its own set of user and group quotas and default quota values can be assigned undefined users and groups. Of course, both hard and soft quotas are supported by the system as well as quota by file count. The High-performance NAS Platform has the intrinsic property of allowing administrators to more granularly control their storage expenditure. Its unique Multi-Tiered Storage (MTS) feature allows administrators to choose the right Fibre Channel or Serial ATA disk for the specific application and customer requirements. High-performance Fibre Channel drives can be used for the highest throughput and I/O requirements, while lower cost, higher capacity Serial ATA drives can be used for lower throughput applications or nearline storage. As storage technology continues to get faster and achieve higher capacity, the Highperformance NAS Platform will continue to accommodate and enhance the value of these mixed media types, and as well as reduce the cost of storage management via its ability to migrate storage between the storage tiers. It also has policy-based engine which allows administrators to classify data based on predefined rules, such as data type or last access date. Data can be migrated transparently across storage tiers. The High-performance NAS Platform complements powerful Hitachi HiCommand Tiered Storage Manager software. This integration allows organizations to combine the advanced file-based virtualization framework with the industry-leading blockbased virtualization provided by the Hitachi Universal Storage Platform and Hitachi Network Storage Controller.

High-performance NAS Platform

SAN SAN
DOC

FS2

PPT

FS1

XLS

Unique Object Store The Object Store is a layer between the normal presentation of a file system view to the user and the raw blocks of storage managed by the SIM. An object is an organization of one or more raw blocks into a tree structure. Each element of the object is called an Onode. Objects are manipulated by logic residing in the FPGAs located on the FSB module.

11

Data 0

Root Onode Left

Data 1

Data 2 Indirect Onode D Inidirect Onode B Direct Onode A Data 3 Root Onode RIght Data 4 Direct Onode C Data 5

Data 6 Inidirect Onode E Direct Onode F Data 7

Data 8 Direct Onode G Data 9

The primary element at the base of an object is called the Root Onode. Each Root Onode contains a unique 64-bit identifier called the Object Identifier (OID) as well as the meta-data information relevant to the object. Root Onodes point either directly to Data Onodes, to Direct Pointer Onodes, or to Indirect Pointer Onodes depending on the amount of content to be stored. These pointer Onodes are simply the connectors that ultimately lead to the Data Onodes. Via this extensibility, High-performance NAS Platform can support a single object of as large as the entire file system, or billions of smaller files in a very efficient method. For each object, two versions of the Root Onode are maintained. They are referred to as the Left and Right Root Onodes. At any given moment, one of these Root Onodes is atomically correct while its partner is subject to updates and changes. In combination with the NVRAM implementation, this ensures that data integrity is preserved even in the case of a system failure. NVRAM recovery is discussed later in this Architecture Guide. Finally, Root Onodes are versioned when snapshots are taken so that previous incarnations of the object can be accessed. Different kinds of objects serve different purposes. User data is contained in a file_object. A directory_name_table_object contains file and directory names in various formats (dos short names, POSIX, etc.), file handles, a crc32 hash value and the associated OID that points to the location of another object such as a subdirectory (another directory_name_table_object) or a file (file_object). Directory and file manipulation, Snapshot, and other features benefit from this object implementation versus a more traditional file level structure. One key example of this is delivered via unique object called a directory_tree_object. For each directory_name_table_object, there exists a peer called the directory_tree_object. This is a sorted binary search tree (BST) of Onodes containing numeric values (hashes). First converting the directory/file name to lowercase and then applying a crc32 algorithm against it derive these hashes. The payoff comes when it is time to find a directory or a file. When a user request asks for a particular file/directory by name that value is again converted to lowercase, the crc32 algorithm is applied and then an FPGA on the FSB module executes a binary search of numeric values (as opposed to having to do string compares of names) to locate the position

12

within the directory_name_table_object at which to begin the search for the required name. The result is a quantum improvement in lookup speed. Where all other network storage servers break down, High-performance NAS Platform maintains its performance even with very densely populated directories. High-performance NAS Platform can support over four million files in a single directory, while keeping directory search times to a minimum and sustaining overall system performance. This is one of the reasons that High-performance NAS Platform is ideal for Internet services companies, as they have millions and millions of files, and fewer directories allows for a simplified data structure. In addition to a lot of files within a directory, the High-performance NAS Platforms Object Store allows the file system itself to be significantly larger, currently supported up to 256TB. Compare this to other file systems that theoretically support 16TB to 32TB but are often limited to less than half this size due to performance penalties. This capability combined with the Cluster Name Space feature allows High-performance NAS Platform to support up to 512TB today, and even more in the future, as these are not architectural limits. Client machines have no concept of objects, but rather see only the standards-based representation of files. Via the NFS or CIFS protocols they expect to work with string names and file handles; thus, High-performance NAS Platform presents what is expected by these clients and handles all the conversion to objects transparently to ensure perfect compatibility. This view of what the clients expects is the job of yet another FPGA, also located on the FSB module, which presents the Virtual File System layer to the clients. For those clients that require or prefer block level access, High-performance NAS Platform supports iSCSI. iSCSI requires a view of raw blocks of storage. The client formats, and lays down its own file system structure upon this view. To make this happen, High-performance NAS Platform simply creates a single large Object up to 2TB in size (this is an iSCSI limitation) within the Object Store, which is presented as a run of logical blocks to the client. Since the iSCSI volume is just another object, features like Snapshot or dynamic growth of the object are possible. By implementing an Object Store, High-performance NAS Platform delivers many outstanding File System characteristics beyond just performance: Maximum supported volume size: Currently 256TB, architected for 2PB Maximum supported object size: Currently 256TB, architected for 2PB Maximum supported capacity: Currently 512TB, architectured for over 2PB Maximum objects per directory: Approximately four million o Depending on the amount of attributes the object contains and the file name lengths themselves o Note that the High-performance NAS Platform can perform at its maximum capability even with this kind of directory population as long as the back-end physical storage can sustain the throughput Maximum number of snapshots per volume: 1024

13

NVRAM Protection The current FSB module contains 2GB of NVRAM for storing writes and returning fast acknowledgements to clients. The NVRAM is partitioned in half so that one half is receiving data while the other is flushed to disk (check-pointed). The NVRAM halves are organized into smaller pages, which are dynamically assigned to the various file systems, based on how heavily they are being accessed.
NVRAM

FPGA
Fastpath Fastpath

High Speed Cluster

Check-pointing, is the process of flushing writes to disk. At check-point time, either the left or right Root Onode is written to while the other Onode is frozen, becoming the atomically correct version. This process cycles back and forth every few seconds. In the event a file system recovery is needed later, this frozen version of the Root Onode is used to restore the file system quickly to a consistent check-pointed state. For example, in the case of a power outage, that atomically correct version of the Root Onode becomes critical. First, the alternate Root Onode is made to look exactly the same as the atomically correct version. This process is called a rollback. Then, the contents of NVRAM are replayed against the objects. In this way, customer data is guaranteed to be complete and intact at the end of the recovery. In a High-performance NAS cluster, the NVRAM is further partitioned, half for storing local write data, and half to store the cluster partners write data. In this way, even if one of the nodes fails completely, the remaining partner node can complete the recovery process. Life of a Packet To tie the hardware and software architecture together, it is a good exercise to understand how a typical read or write operation is handled through the system. The following diagram and the detailed steps walk through both a write and a read operation. The diagram is a simplification of the design, but it highlights the major blocks of the architecture. For simplicity, the inbound and outbound FPGAs are shown as a single block, but they are actually separate.

14

Write Example 1. A network packet is received from one of the GE interfaces on the NIM. 2. The incoming packet is saved on the NIM into memory by the FPGA 3. If the incoming packet is a network-only request, such as a TCP session setup, it is processed to completion and sent back out to the requesting client. 4. Otherwise, the FPGA will gather additional related incoming packets. 5. The complete request is passed over the LVDS FastPath to the FSB module. 6. The first FPGA on the FSB Module stores the message in its own memory and then attempts to decode the incoming message, simultaneously notifying the FSA of the arrival in case exception processing will be required. While most requests are handled directly by this FPGA, the FSA Module processor handles exception cases; however, only the header information required for decoding the request is sent to the FSA for processing. 7. Once the request is decoded, the Object Store then takes over and this FPGA will send the data in parallel to the NVRAM, update the Meta-Data Cache, send an additional copy of the write request over the cluster interconnect pipeline if there is a cluster partner, begin the formulation of a response packet, and pass the request to the SIM module via the FastPath pipeline. 8. Once the NVRAM acknowledges that the data is safely stored, the response packet is shipped back to the NIM, letting it know that the data has been received and is protected (see Step 12 and 13 below). This allows the client to go on processing without having to wait for the data to actually be put on disk. 9. In parallel with the above operations, an FPGA on the SIM receives the write request and updates the Sector Cache with the data. At a specified timed interval of just a few seconds, or when half of the NVRAM becomes full, the SIM will be told by the FSB module to flush any outstanding data it has to disks. This is done in such a way as to maximize large I/Os whenever possible in order to achieve the highest throughput to the storage.

Read Example Steps 1 through 6 of are virtually the same as the previous example. 7. Since this is a read request NVRAM is not involved 8. There are certain kinds of read requests that have to do with metadata lookups. The Object Store on the FSB module will attempt to find the relevant data in this case in its MetaData Cache and if successful will respond rapidly without having to retrieve the Meta-Data from disk. Otherwise the lookup mechanism kicks in as described earlier, taking advantage of the directory_tree_object BST method. At various points in this process request will be passed onto the SIM module for any data necessary to the lookup. Once the OID of the target object is found, processing moves to the SIM. 9. The SIM module has an ample sector cache. The FPGA on the SIM will look to see if it can satisfy read requests from here. Otherwise, it will formulate a Fibre Channel request to retrieve the data from disk and store it in the sector cache. Both Meta-Data and data requests are handled this way. 10. Once the relevant data has been retrieved, the SIM will pass the data back to the FSB module. 11. The Object Store will update the Meta-Data cache as necessary and re-apply the RPC layers in order to create a well-formed response. 12. The response packet is passed to the NIM Module. 13. The FPGA on the NIM will appropriately organize the response into segments that comply with TCP/IP or UDP/IP and of course Ethernet formatting. 14. Finally the NIM transmits the response out the Gigabit Ethernet interface. This packet walk-through should help to tie together the hardware and software architecture and interactions of how data flows through High-performance NAS Platform.

15

Benefits of the SiliconServer Architecture: Industry-leading Performance, Throughput, and Capacity First and foremost, the benefits of the SiliconServer Architecture are high-performance transaction, high throughput, and high capacity. High-performance NAS Platform delivers the highest performance of any single filer on the market, and it will continue to increase this advantage through its modular design. High-performance NAS Platform has achieved SPECsfs97_R1.v3 benchmark results, exceeding those of any network storage solution utilizing a single High-performance NAS projecting a single file system. Published results for the High-performance NAS Platform are 98,131 ops/sec (SPECsfs97_R1.v3) * and an overall response time (ORT) of 2.34 ms. This is more than 272 percent higher throughput than high-end system results from other leading NAS vendors. Dual Clustered Highperformance NAS presenting a single file system using cluster name space, achieved 195,502 ops/sec * and an ORT of 2.37 ms, 286 percent higher than other dual clustered systems. The highly efficient High-performance NAS Platform clustering provided unheard of linear performance over the single node results with less than 1percent loss in efficiency. These results are available at the SPEC.org website, http://www.spec.org/sfs97r1.
*SpecSFS benchmarks published under High-performance NAS Corporation using third-party storage systems.

These tests clearly demonstrate that the High-performance NAS Platform can sustain the responsiveness in both high concurrent user and cluster computing environments, speeding applications such as Life Sciences research, 3D Computer Generated Imagery, Visualization, Internet services and other compute intensive environments. While the test results are a proof point for transactional performance, raw throughput is also critical for many of todays large digital content applications. Here, High-performance NAS Platform also excels providing 3.2Gb/sec in throughput. The combination of both high transaction rates and throughput allows High-performance NAS Platform to excel in mixed environments where there are users and batch processes demanding both aspects of performance with minimal wait times. High-performance NAS Platform leads network storage performance in both dimensions. High-performance NAS Platform currently supports up to 512TB under the cluster namespace with file systems of up to 256TB; however, it is architected to support up to 2PB. The actual amount of storage supported will continue to scale as testing, memory, and requirements continue to grow. Although the unique Object-store-based file system is the enabler for this capability, the hardware allows these large file systems to continue to be accessed at the highest performance levels even as the file system begins to fill with a high number of files. The combined hardware and software also enable the support of over a million files per directory, critical for larger file systems. Since these types of inquiries are converted to a binary search, the system delivers exceptional responsiveness to users, regardless of the type of directory structures and files. This benefit allows storage administrators to consolidate their filers to reduce hardware, software license, and support costs. It also allows customers to purchase an initial system and grow the system without buying additional filers and having to manage the migration or separation of data between multiple filers or tiers of storage. Scalability High-performance NAS Platform can scale to meet an organizations future requirements in any of the three key storage dimensions. First is the need to support higher performance in terms of IOPS, to feed the high-speed applications and compute clusters. Second is the need to support higher bandwidth throughput in terms of Gbit/sec, as file sizes and the

16

number of users of these large data sets continue to grow. Third is the need to store significantly greater amounts of data in terms of terabytes, driven by growing file sizes, increased data sets, and changing regulatory requirements to retain data for longer periods of time.

Three Dimensions of Scalability


Throughput (Gbps)

Capacity (TB)

Performance (IOPs)

To cope with all three dimensions of scalability, most customers have been forced to do forklift upgrades or deploy many individual or HA-clustered filers. However, these approaches only lead to filer proliferation, increased complexity, and higher support costs. Filer proliferation causes many of the same challenges of DAS, such as unused space on some volumes, not enough on others, excessive headroom across all the volumes, and management issues. Data sets that once had to be separated between the different filers to support the greater aggregate capacity, performance, and throughput demands are no longer necessary with High-performance NAS Platform and, because of this, clients no longer need to be determine which filer to access. Clustering storage is another solution; however, with the lower performance of other systems this often requires as many as 8-10 cluster storage nodes or more. With clustering storage software in its infancy, and the number of nodes increasing in complexity and losing efficiency, fewer nodes are clearly an advantage. High-performance NAS Platform was designed to address these three dimensions of scalability today as well as into the future. First, High-performance NAS Platform was designed with the highest performance, capacity, and throughput, all of which meet or exceed most organizations current requirements. These requirements will continue to grow as data sets grow and compute clusters become more prevalent and put more demands on the storage system. High-performance NAS Platform foresees the potential of business or organization requirements nearly doubling year over year in the very near future. The SiliconServer Architecture was designed with this in mind. From a design perspective, the long-term benefit of the architecture is that it allows the engineering team to increase performance and throughput of each product generation by approximately 100 percent. This compares with approximately 30 percent achieved by conventional CPU-based appliance vendors. This equates to the architecture having an ever-increasing advantage at the atomic (single filer node) leveldoubling the advantage each product cycle at the current rates. This is critical both for organizations that wish to stay on single and high-availability clustered systems as well as for organizations that want to eventually deploy a storage-grid. Both the current High-performance NAS Platform and future module upgrades will continue to provide the fastest file serving nodes and are designed to be most capable and simple storage-grid. For those who require scalability beyond the capabilities of a single node, High-

17

performance NAS Platform is the ideal foundation for larger storage-grids, in that fewer nodes will be required to achieve the desired performance. As explained in the previous sections, High-performance NAS Platform was able to be clustered in 2-way, 3-way or 4-way configurations, with almost no loss of efficiency, providing the highest SpecSFS IOPS solution of any single name space solution. Other configurations take 6-, or 8-way or more clustered file servers to achieve similar results. Building a storage-grid from larger, faster nodes such as High-performance NAS Platform reduces the hardware, software, and support costs, as well as significantly reducing the complexity required to achieve the desired capacity and performance, and this is often theoretical as many clusters do not support these higher numbers and lose efficiency as they scale. Fewer storage cluster nodes reduce the back-end inter-cluster communications and data transfers. Smaller storage clusters or grids with faster nodes will provide a more efficient and cost-effective scalable storage solution. The engineering team is continuing to drive towards higher node cluster storage solutions; however, with High-performance NAS Platforms significant advantage in per node performance, their task is significantly less daunting than that of competitors. Competitive engineering efforts must cluster more nodes with standard PC-based architectures, which is significantly more complex and less efficient. Features The SiliconServer Architecture delivers advanced features, layered onto the hardware file system, without significantly impacting performance, as is often the case with a CPU-based, shared-memory appliance. Features such as snapshot, policy-based migration, mirroring, and other data mover features are executed in hardware operating on objects within the object store, allowing them to be done at a binary level within the hardware. This significantly reduces the overhead of these features. Combined with the virtualized storage and back-end SAN implementation of the architecture, this capability enables many storage functions to be handled without affecting the performance and throughput of the system. This is especially true in scenarios such as a drive failure, where a CPU-based filer would have to get involved in the rebuilding the hot-spare, while High-performance NAS Platform off-loads this function to the hardware storage systems. The SiliconServer Architecture allows Fibre Channel and SATA disk drives to be used. This feature, called Multi-Tiered Storage (MTS), departs from other vendors offerings, which require a separate filer to handle different tiers of storage and disk, causing a proliferation of filers. High-performance NAS Platform further delivers the capability to do data migration between these storage tiers, simplifying storage management and providing a more costcontrolled environment as administrators can provide the right type of disk, performance, and throughput based on the application requirements. To preserve the user or application experience, migration between tiers can also be accomplished transparently using NAS Data Migrator, allowing simplified data management functionality that does not affect the end users or applications. High-performance NAS Platform delivers multiprotocol access into its MTS, including NFS, CIFS, and even block-level iSCSI. Block level access is enabled through the object design of the file system, allowing even a block level partition to be viewed as an object. For management, High-performance NAS Platform supports SSL, HTTP, SSH, SMTP, and SNMP, as well as remote scripting capability via utilities provided at with no additional license fees to the customer. Going forward, Hitachi Data Systems and BlueArc will continue to innovate in both the hardware and software areas. The modular design of the High-performance NAS Platform will allow organization to have the purchase protection of an upgradeable system, increasing throughput, IOPS, and capacity, with simple blade changes. In terms of software, the

18

foundation of the SiliconFS and the virtualized storage allows advanced features such as Virtual Servers, NAS Data Migrator, and remote block-level replication. Conclusion The fastest, most reliable and longest-lived technology in the data center is typically the network switch, whether it is Ethernet or Fibre Channel. A switch purchased five years ago is still fast and useful because it was built with scalability in mind. The speed, reliability, and scalability of the network switch are directly attributable to the parallelism inherent in the hardware-accelerated implementation, the high-speed backplane, and the replaceable blade design. The High-performance NAS Platform has delivered on the promise of hardwareaccelerated file services and will capitalize on this unique capability to enable its customers to continually scale their storage infrastructure as their requirements grow.

19

Corporate Headquarters 750 Central Expressway, Santa Clara, California 95050-2627 USA Contact Information: + 1 408 970 1000 www.hds.com / info@hds.com Asia Pacific and Americas 750 Central Expressway, Santa Clara, California 95050-2627 USA Contact Information: + 1 408 970 1000 www.hds.com / info@hds.com Europe Headquarters Sefton Park, Stoke Poges, Buckinghamshire SL2 4HD United Kingdom Contact Information: + 44 (0) 1753 618000 www.hds.com / info.uk@hds.com

Hitachi is a registered trademark of Hitachi, Ltd., and/or its affiliates in the United States and other countries. Hitachi Data Systems is a registered trademark and service mark of Hitachi, Ltd., in the United States and other countries. HiCommand is a registered trademark of Hitachi, Ltd. BlueArc is a registered trademark of BlueArc Corporation in the United States and/or other countries. All other trademarks, service marks, and company names are properties of their respective owners. Notice: This document is for informational purposes only, and does not set forth any warranty, express or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems. This document describes some capabilities that are conditioned on a maintenance contract with Hitachi Data Systems being in effect, and that may be configurationdependent, and features that may not be currently available. Contact your local Hitachi Data Systems sales office for information on feature and product availability. Hitachi Data Systems sells and licenses its products subject to certain terms and conditions, including limited warranties. To see a copy of these terms and conditions prior to purchase or license, please go to http://www.hds.com/products_services/ support/warranty.html <http://www.hds.com/products_services/support/warranty.html> or call your local sales representative to obtain a printed copy. If you purchase or license the product, you are deemed to have accepted these terms and conditions. Hitachi Data Systems Corporation 2007. Adapted with the permission of BlueArc Corporation. All Rights Reserved. DISK-623-01 DG July 2007

You might also like