You are on page 1of 14

Isilon IQ Clustered Storage with Small Files

Isilon OneFS Storage Utilization Best Practices Guide An Isilon Systems Technical Whitepaper

April 2008

ISILON SYSTEMS

Table of Contents
1. 2. 3. 4. 5. 6. 7. Introduction ................................................................................................................................. 3 OneFS Storage Utilization......................................................................................................... 3 OneFS File Layout ...................................................................................................................... 4 OneFS Protection Overhead..................................................................................................... 6 Small Files Best Practices ........................................................................................................ 8 Online Capacity Analyzer ........................................................................................................10 Summary ....................................................................................................................................11

Appendix A: File Size Reporting Tools .........................................................................................12 Appendix B: Sample Storage Utilization Graphs .........................................................................13

ISILON SYSTEMS

1. Introduction

Isilon has become the clear leader in clustered storage by demonstrating superiority in storing and servicing data sets primarily containing large files of unstructured data. While this data is dominant in our core verticals, such as media & entertainment and Web2.0, Isilon has also been very successful in providing storage for a wide variety of data sets with a mix of small, medium and large files. In comparison to traditional RAID based file systems, OneFS can be misconstrued as inefficient at storing data sets with small files. The purpose of this paper is to rectify this false impression by introducing key concepts and demonstrating how to properly assess the overhead of OneFS data protection architecture. By further educating the market on OneFS we hope to remove any ambiguity associated with storage protection overhead and examine how the benefits of OneFS unique architecture extend to a wide variety of data sets. As part of these guidelines a new online tool is introduced to help storage professionals calculate and analyze overhead of varying data sets as well as eliminate the guess work of determining where OneFS is applicable.

2. OneFS Storage Utilization


A key requirement of any enterprise-class storage system is to ensure the reliability of stored data in the event of system failure. Most storage vendors achieve system reliability by utilizing RAID technology to distribute data across multiple disks. These disk aggregates, known as volumes, are further partitioned and then presented to the file system as LUNs. Typical RAID configurations include either RAID-1 for mirroring user data, or RAID-5 for striping user data and parity blocks (or a combination thereof). Each of these configurations, and other less common ones, are designed to meet different types of performance and reliability requirements and, subsequently, incur different levels of storage overhead. However, regardless of the RAID configuration, the administrator must manage multiple layers of abstraction: disks, volumes, LUNs, and file systems. Most people understand that data reliability comes with a storage cost, but because traditional RAID is not an integral part of the file system, additional overhead and scalability limitations are encountered. Among them are:  RAID stripes data across entire volumes. Each LUN must be set with a fixed RAID level across all disks, and cannot be changed there after. As a result, all data in the volume incur the same protection overhead while users may desire varying levels of protection for different classes of data. Expanding a volume is a complex administrative process consuming limited CPU resources. Most volume and file systems do not expand beyond 16TB. As a result, administrators often have to manage a large set of volumes and file systems. Additional storage must be allocated for hot spare disks per volume. Snapshot copies require reserved space per volume. File systems cannot cross volume boundaries limiting their scalability.

   

Considering these limitations, a typical RAID 5 storage system may incur between 40%-50% overhead, and even more for RAID-6 (double parity protection). In contrast to traditional storage architectures, which consist of three separate, non-integrated software layers the file system, volume management and RAID control - OneFS integrates all three layers into the OS. This seamless integration eliminates the management pain and scalability limitations mentioned above. Therefore, by combining these three layers, OneFS provides the following storage benefits:

ISILON SYSTEMS

   

Granular, flexible file protection. OneFS directly controls the layout of each file across the cluster, allowing administrators to set protection and performance settings on a file-by-file basis. These settings can be modified at any given moment, simultaneously modifying the protection overhead accordingly. One single pool of storage and one file system. Administrators do not need to manage complex RAID group settings or many volumes and file systems. In OneFS, once new nodes are added, existing data is automatically balanced by striping or mirroring data onto the new set of nodes. Support protection levels up to N+4 or 8x. OneFS is able to sustain multiple simultaneous failures of nodes or disks without losing data by striping or mirroring data across multiple nodes. Since OneFS stripes user data across different nodes, data protection extends from individual drive failures to complete node failures. No hot spare disks required. All disks across nodes in the cluster are available for rebuilding protection data anywhere in the cluster. No snapshot reservation required. Snapshots use the same pool of storage as the live file system. Industry leading scalability. OneFS presents all available storage as one single pool of storage. Up to 90% storage utilization. With proper cluster configuration certain user data sets may require only 10% of physical storage for protection.

Because OneFS protects data at the file level, protection overhead is calculated on a per file basis, as apposed to a flat overhead charge associated with RAID volumes. Files of different sizes and protection settings incur varying protection overhead. Consequently, when certain individual files are examined in isolation their overhead may seem excessive, leading to an incorrect assessment of overall storage utilization.

3. OneFS File Layout


In order to determine overall storage utilization, a detailed description of how OneFS writes and reads data to disk is necessary. Clients initiate a file write request by connecting to any one of the nodes in the cluster known as the transaction coordinator. The coordinator communicates with a group of peer nodes in the cluster to establish a transaction group of participants. The group participants will each contribute blocks of disk space to write a portion of the file to as a stripe. In this manner, as a file is written, its blocks are distributed across the cluster nodes in stripes, each containing sequences of user data blocks and sequences of protection data blocks. Stripes are formed in the following method: 1. A sequence of 16 data blocks are grouped together and written contiguously to a disk on one of the participating nodes. Subsequent sequences in a stripe are written to disks on other participating nodes. The number of participating nodes, and thus sequences, in a stripe is based on the protection type, protection level and cluster size. This will be discussed in a later section. 2. OneFS uses an 8KB block size so each full sequence in a stripe, comprised of 16 blocks, is 128KB long. Data is always written to disk using 8KB blocks, whether as part of a full or partial sequence. 3. For each stripe, OneFS computes protection parity data, written as similar size sequences to disks on participant nodes much like data sequences. The number of protection sequences is dependent on the protection type and level. 4. The location of data and protection sequences is tracked in the files meta-data structure on disk. These meta-data structures are always mirrored to meet the files protection level. 5. If user data in the last sequence in a stripe is less then 16 blocks, a partial sequence is written along with the computed protection sequence.

ISILON SYSTEMS

Once the write transaction has been committed a response is returned to the client. There are other elements to writing data on OneFS such as asynchronous writes, caching, journaling, and locking that are outside the scope of this discussion.

   

Data is evenly distributed across nodes as it is written OneFS block size is 8KB 16 blocks written contiguously to each node 8KB block size X 16 blocks = 128KB stripe unit size

0 1 ... 15

16 17 ... 31

144 145 ... 159

next stripe

node A

node B

node J

Figure 1 schematic representation of a striping data during a write operation

File protection is defined in a policy using one of two protection types: Forward Error Correction (FEC) This protection type is presented as N+M where N is the maximum number of nodes in a stripe containing user data and M is the number of nodes with parity data. M, which ranges from 1 to 4, represents how many nodes may fail while user data is fully available. The combination of N+M is also known as the stripe width. The stripe width will never exceed the number of nodes in a cluster; otherwise, one or more nodes in a stripe will have more than one sequence which will not be protected if that node becomes unavailable. For example, a five node cluster with N+1 protection policy will write files in stripes with N=4 (for data sequences) and M=1 (1 protection sequence). In an N+1 protection level, N is maxed out at 9 to produce maximum stripe width of 10, and in N+4, N is maxed at 16 to produce maximum stripe width of 20. See table 1 below for full range of stripe widths. Mirroring This protection type is presented as Mx where M is between 2 and 8. In this protection type, OneFS can protect files by creating between 2 and 8 copies of data in a stripe. Each additional mirrored copy adds 100% overhead to the user data copy. All mirrored copies are identical and reside on different nodes. The number of mirrored copies cannot exceed the number of nodes in a cluster. OneFS stores each files protection policy in the files meta-data. When new files are created they inherit the protection setting from their parent directory.
ISILON SYSTEMS

4. OneFS Protection Overhead


With the understanding of OneFS file layout scheme, we can now look at the actual protection overhead on files with varying protection settings and cluster configurations. For our purposes we define protection overhead as the ratio of protection plus user data to user data alone, represented as a percentage. First well look at protection overhead on large files as a baseline, and then well see how small files incur additional overhead. Ultimately our goal is to accurately determine the total cluster protection overhead of any mix of data sets. Mirroring Protection Overhead Calculating mirroring protection overhead is straightforward. A file protected at 2x protection setting incurs 100% overhead to account for the mirrored copy of user data. Similarly a file at 3x incurs 200% and so forth. A 1MB file (8*128KB) at 2x protection would consume 2MB disk space and at 3x 3MB disk space. But when file data doesnt fit into an exact number of 8KB blocks additional overhead is incurred because a full block is used for the remaining data. This is true but less noticeable in large files. For example, a file protected at 2x with 1056769 (one additional byte over 1MB file) will occupy 2129920 bytes on disk which results in 101.55% overhead instead of 100% on a 1MB file. But a file, small enough to be in the order of one block size, will incur a larger overhead percentage. A 10K file at 2x must be written in two 8KB blocks which are mirrored, using an overall disk space of 2*16KB = 32KB instead of 20KB. So the protection overhead for a 10KB file is (32KB - 10KB) / 10KB = 220% instead of the baseline of 100%. At 3x the file will occupy 48KB of disk space at 380% protection overhead instead of the baseline of 200%. FEC Protection Overhead Most commonly files are protected using FEC scheme. FEC protection overhead calculation is a bit more complicated. In addition to block size, the stripe width becomes a factor. In general, the larger the stripe width the less protection overhead a file incurs, because within a stripe, more data sequences are used to generate the same amount of protection sequences. In a 4 node cluster with files protected at +1, the overhead on large files is 1/3 = 33%. But in a 6 node cluster at +1 protection, the overhead is 1/5 = 20%. We refer to these calculations as the baseline protection overhead (BPO): BPO = M/N. The BPO measures the ideal storage utilization given a FEC protection level (N) and a stripe width (M). The following table shows typical FEC baseline protection overhead values. FEC Protection (N) Max Stripe Width (M+N) +1 10 +2 14 +3 18 +4 20 Table 1 - Typical Baseline Protection Overhead BPO 11% 17% 20% 25%

The BPO in table 1 is only an optimal baseline to compare against. It matches actual protection overhead only for files that have an exact number of 8KB blocks and that are large enough to occupy the maximum stripe width for the given FEC protection level. For example, at +3, to achieve optimal BPO, the cluster must have at least 18 nodes and the file must be at least 15x128KB=1.875MB in size, consuming 2.25MB of disk storage (20% protection overhead). Similar to mirrored protection cases, small files and files that do not hold an exact number of 8KB blocks will increase the overall protection overhead. For example, a 10KB file set at +1 on a 5

ISILON SYSTEMS

node cluster will actually take 32KB. While OneFS will report the protection policy as +1, the actual protection overhead will be 220% compared with the 20% baseline protection overhead. Actual Protection vs. Protection Policy There are also situations where a file cannot be protected according to its protection policy because OneFS cannot use the ideal protection layout. In these cases OneFS will attempt to provide the same level of file protection using the lowest mirroring protection scheme. For example, on a 3 node cluster a file with a +2 policy will be protected using a 3x mirror because on a 3 node cluster two distinct protection sequences in a stripe cannot be generated. The following table maps policy to actual protection level. Green indicates the desired protection policy is met, yellow indicates a changed protection policy that preserves the desired protection policy, and red indicate a lower protection than the desired protection policy. Cluster Size Policy +1 +2 +3 +4 9+1 12+2 15+3 16+4 9+1 12+2 15+3 16+4 9+1 8+2 7+3 6+4 8+1 7+2 6+3 5+4 7+1 6+2 5+3 4+4 6+1 5+2 4+3 5x 5+1 4+2 3+3 5x 4+1 3+2 4x 5x 3+1 2+2 4x 4x 2+1 3x 3x 3x 30 20 10 9 8 7 6 5 4 3

Table 2 File protection for various cluster configurations and protection policies Mixed Data Sets As indicated, smaller files incur additional protection overhead. This different overhead incurred in small files is noticeable in data sets with only very small files, but most data sets include files in a variety of sizes. The total cluster protection overhead mainly depends on the ratio between small and large files. In most cases the presence a few large files nullifies the extra protection overhead of many small files. For example, on a 5 node cluster with +1 protection setting, if there are a hundred 10K files and one 10MB file is added the overhead shifts dramatically from 150% to 36% of total protection overhead (compared to a baseline of 25%). It is important to understand that it is not the average file size in the cluster that matters for calculating total protection overhead, but the ratio between small and large files. In a data set with many small files and very few large files the average file size may be very close to the size of the small files, but the total protection overhead may be much closer to that of the large file. Using average file size produces incorrect storage utilization assessments. In the example above the average file size is about 111KB. Calculating the cluster protection overhead assuming all files are 111KB in size would produce 105% protection overhead compared with the 36% of actual overhead. The following graph shows the total data set protection overhead in the right column and the overhead of each of the classes of file sizes in the 2 other left columns. The graph shows that the total additional overhead in bytes produced by the 100 small files is significantly smaller in proportion to the one large file overhead. That is why the total additional overhead in the rightmost column remains relatively small.

ISILON SYSTEMS

Additional Overhead (Penality)

Actual Utilization as a Function of File Size

Ideal Overhead Logical Space

20,000,000

18,000,000

16,000,000

14,000,000

12,000,000
Bytes

10,000,000

8,000,000

6,000,000

4,000,000

2,000,000

0
kB B B B 2T B B B B 51 2B 51 2B B 12 8M B 51 2M B 51 2G B 2M B 8M B 32 M B 2G 8G 8k 2k 8k G 8G 32 32 To ta l

12

File Size

Figure 2 Comparison of protection overhead between different classes of file sizes

5. Small Files Best Practices


Using the above information we can address questions regarding small file storage utilization with the following points:  OneFS protects data at the file level, not at the volume level. It does not need to manage RAID volumes, and subsequently, does not require pre-allocating storage for hot spare disks or snapshots per volume. Typical RAID systems start off by claiming 40%-60% storage for protection before any data has been written. In OneFS, different file sizes incur different protection overhead. While small files incur higher protection overhead, it takes very few large files in a dataset to eliminate this small file effect. Most data sets include a mix of small and large files. To accurately assess a data sets total protection overhead, always look at the ratio of small and large files and not the average file size in the data set. Calculating total storage overhead based on average file size leads to significantly overestimating storage consumption.

 

ISILON SYSTEMS

12

 

 

Not all small files are created equally. Small files should be grouped into categories such as: 1K, 10K, 100K, and 500K. These groups should then be matched with other large files to offset storage overhead. The actual size of large files with ideal storage utilization is dependent on cluster size and stripe width. It is safe to use files over 2MB to offset against small files. The online analyzer tool described below can provide an accurate and granular analysis of storage utilization. The Isilon IQ single pool of storage makes it ideal for consolidating storage islands. Storage consolidation has the extra benefit of creating data sets with mixed file sizes, which further reduces total cluster protection overhead. Assuming some large files are present, always consider adding nodes to reduce storage overhead. Maximizing the number of nodes in a cluster, per desired protection level, minimizes protection overhead as more nodes participate in each file stripe. This technique directly minimizes protection overhead of single large files (2MB and up) and as described above, leads to significant lower total cluster protection overhead. Use the desired vs. actual protection policy in table 2 to ensure cluster size can support the desired protection policies. Try to avoid policy settings and cluster size combinations that produce suboptimal protection levels, as indicated in the yellow cells, or degraded protection levels indicated in the red cells. Isilon is a certified data storage partner for VMware ESX and Virtual Infrastructure. As an NFS data store, VM virtual disks are stored on an Isilon cluster as .vmdk files. Regardless of the size and amount of files a VM guest OS manages the entire data set is stored as a single large file on the Isilon cluster. As a result space utilization can be dramatically improved by data set virtualization. Investigate usage and access patterns of small files. If small files are rarely accessed after being created or after reaching a certain age, explore ways to archive those files. Isilon partners with various Archive vendors such as Symantec. An online tool is available for Isilon staff to calculate total data set protection overhead. The next section describes this tool.

ISILON SYSTEMS

6. Small Files Online Analyzer


The purpose of this tool is to provide Isilon sales engineers with an accurate account of total overhead given a data set with mixed file sizes. The tool is available internally at: http://seadev01.isilon.com/calculators/storage_usage.php

Figure 3 online storage usage analyzer The Storage Usage Analyzer accepts cluster and file level settings as user input, and generates storage utilization output.

ISILON SYSTEMS

10

Cluster Settings  The user selects the cluster size and the protection level.  At any point the user can change the global settings and data below will be recalculated. Data set input and utilization metrics  The user adds multiple entries of file size and file count combinations representing a data set.  For each line entry the calculator generates the following storage utilization metrics: o Logical Space: the total space representing the file data. o Physical Space: the actual space used to store the data on disk. o Ideal Physical Space: what the cluster would use ideally, if there was no small file overhead. o Ideal Overhead: the overhead for protection in an ideal physical space allocation. o Additional Overhead: the extra space used because of small files space allocation. Data Set Analysis  Actual vs Ideal % of Storage Used for Protection: the ratio of protection overhead divided by total storage allocation for the data set. The comparison between actual and ideal ratios allows you to see how much storage is needed for protection and the extra cost of protecting smaller files.  Actual vs Ideal Parity Tax: the ratio of parity protection data divided by user data. The comparison between actual and ideal ratios allows you to see the protection overhead per data set and the extra cost of small files.  Small File Penalty: the ratio of the small files additional storage divided by ideal physical space provides a view of the cost associated with small files compared with the ideal cost.

7. Summary
OneFS storage utilization is file based, not volume based. OneFS storage utilization compares competitively against other RAID based storage system while adding the benefits of unmatched reliability and scalability. This unique aspect of OneFS must be taken into account when assessing storage utilization of various data sets. Using the online analyzer, storage utilization of varying data sets and cluster configurations can be assessed.

About Isilon Systems Isilon Systems is the worldwide leader in clustered storage systems and software for digital content and unstructured data, enabling enterprises to transform data into information - and information into breakthroughs. Isilon's award-winning family of IQ clustered storage systems combines Isilon's OneFS operating system software with the latest advances in industrystandard hardware to deliver modular, pay-as-you-grow, enterprise-class storage systems. Isilon's clustered storage solutions speed access to critical business information while dramatically reducing the cost and complexity of storing it. Information about Isilon can be found at http://www.isilon.com.

ISILON SYSTEMS

11

Appendix A: File Size Reporting


The table below lists command line file size reporting tools and output per specified option. The Base column describes how the tool calculates megabytes, gigabytes, etc. Base-10 means 1GB equals 1000*1000*1000 bytes. Base-2 means 1GB equals 1024*1024*1024 bytes.

Command On Cluster du du -h du -l du -lh df ls stat isi quota NFS Client du df ls stat Windows Client File Properties Share Properties

Base

Includes sparse

Includes Protection

1024 1000* No units 1000* 1000* 1000* 1000* 1000*

no no no no no yes yes no

yes yes no no yes+ no no depends on quota

1024 1024 1024 1024

no no yes yes

yes yes+ no no

1024 1024

yes no

no yes+

Notes: * All Base-10 numbers on the cluster are expected to change to Base-2 in OneFS 5.0 + It is possible to setup a logical quota, with a hard threshold and the container flag, that would generate a value that excludes protection overhead.

ISILON SYSTEMS

12

Appendix B: Sample Storage Utilization Graphs

Contrary to large files, cluster size does not affect small file storage utilization. Below are sample graphs of two different clusters that show storage utilization of different file sizes. Cluster size affects maximum storage utilization of large files while protection level affects storage utilization of both small and large files. See examples below.

On this 3 node cluster with +1 protection maximum space utilization of 66% is achieved with 4MB files. A 128KB file has about 50% space utilization and a 2K file is at 10% utilization.
Utilization as a Function of File Size
100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

1M B 2M B 4M B 8M B 16 M B 32 M B 64 M B 12 8M B 25 6M B 51 2M B

1k B

2k B

4k B

8k B 16 kB 32 kB 64 kB 12 8k B 25 6k B 51 2B

1T B

2T B
2T B

1G B

File Size

On this 6 node cluster with +1 protection level, maximum space utilization of 83% is achieved at about 16MB per file. A 128KB file has about 50% space utilization and a 2KB file is at 10% utilization.
Utilization as a Function of File Size
100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

1M B 2M B

4M B 8M B 16 M B 32 M B 64 M B 12 8M B 25 6M B 51 2M B

1k B

2k B

4k B

8k B

2G B 4G B 8G B 16 G B 32 G B 64 G B 12 8G B 25 6G B 51 2G B

51 2B

1T B

1G B

File Size

On this 6 node cluster with +2 protection level, maximum space utilization of 66% is achieved at about 2MB per file. A 128KB file has about 50% space utilization, and a 2KB file is at 10% utilization.
ISILON SYSTEMS

13

2G B 4G B 8G B 16 G B 32 G B 64 G B 12 8G B 25 6G B 51 2G B

16 kB 32 kB 64 kB 12 8k B 25 6k B 51 2B

51 2B

4T B

4T B

Utilization as a Function of File Size


100%

80%

60%

40%

20%

0%

1M B 2M B 4M B 8M B 16 M B 32 M B 64 M B 12 8M B 25 6M B 51 2M B

1k B

2k B

4k B

8k B 16 kB 32 kB 64 kB 12 8k B 25 6k B 51 2B

1T B

2T B

1G B

File Size

Cluster size affects maximum storage utilization of large files while the storage overhead for small files is mainly affected by the file protection level.

ISILON SYSTEMS

14

2G B 4G B 8G B 16 G B 32 G B 64 G B 12 8G B 25 6G B 51 2G B

51 2B

4T B

You might also like