Professional Documents
Culture Documents
Sage A. Weil
Scott A. Brandt
Ethan L. Miller
Darrell D. E. Long
Carlos Maltzahn
University of California, Santa Cruz
{sage, scott, elm, darrell, carlosm}@cs.ucsc.edu
Abstract
We have developed Ceph, a distributed file system that
provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data
and metadata management by replacing allocation tables with a pseudo-random data distribution function
(CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We
leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous
OSDs running a specialized local object file system. A
dynamic distributed metadata cluster provides extremely
efficient metadata management and seamlessly adapts to
a wide range of general purpose and scientific computing file system workloads. Performance measurements
under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations
per second.
1 Introduction
System designers have long sought to improve the performance of file systems, which have proved critical to
the overall performance of an exceedingly broad class of
applications. The scientific and high-performance computing communities in particular have driven advances
in the performance and scalability of distributed storage systems, typically predicting more general purpose
needs by a few years. Traditional solutions, exemplified
by NFS [20], provide a straightforward model in which
a server exports a file system hierarchy that clients can
map into their local name space. Although widely used,
the centralization inherent in the client/server model has
proven a significant obstacle to scalable performance.
More recent distributed file systems have adopted architectures based on object-based storage, in which conventional hard disks are replaced with intelligent object
storage devices (OSDs) which combine a CPU, network
Clients
Metadata Cluster
Metadata operations
le
Fi
client
ls
libfuse
I/O
bash
vfs
fuse
Linux kernel
Metadata
storage
client
myproc
2 System Overview
The Ceph file system has three main components: the
client, each instance of which exposes a near-POSIX file
system interface to a host or process; a cluster of OSDs,
which collectively stores all data and metadata; and a
metadata server cluster, which manages the namespace
(file names and directories) while coordinating security,
consistency and coherence (see Figure 1). We say the
Ceph interface is near-POSIX because we find it appropriate to extend the interface and selectively relax consistency semantics in order to better align with the needs
of applications and to improve system performance.
The primary goals of the architecture are scalability (to
hundreds of petabytes and beyond), performance, and reliability. Scalability is considered in a variety of dimensions, including the overall storage capacity and throughput of the system, and performance in terms of individual clients, directories, or files. Our target workload may
include such extreme cases as tens or hundreds of thousands of hosts concurrently reading from or writing to
the same file or creating files in the same directory. Such
scenarios, common in scientific applications running on
supercomputing clusters, are increasingly indicative of
tomorrows general purpose workloads. More importantly, we recognize that distributed file system workloads are inherently dynamic, with significant variation
in data and metadata access as active applications and
data sets change over time. Ceph directly addresses the
issue of scalability while simultaneously achieving high
performance, reliability and availability through three
fundamental design features: decoupled data and metadata, dynamic distributed metadata management, and reliable autonomic distributed object storage.
Decoupled Data and MetadataCeph maximizes the
separation of file metadata management from the storage
of file data. Metadata operations (open, rename, etc.)
are collectively managed by a metadata server cluster,
while clients interact directly with OSDs to perform file
I/O (reads and writes). Object-based storage has long
promised to improve the scalability of file systems by
delegating low-level block allocation decisions to individual devices. However, in contrast to existing objectbased file systems [4, 7, 8, 32] which replace long per-file
block lists with shorter object lists, Ceph eliminates allocation lists entirely. Instead, file data is striped onto predictably named objects, while a special-purpose data distribution function called CRUSH [29] assigns objects to
storage devices. This allows any party to calculate (rather
than look up) the name and location of objects comprising a files contents, eliminating the need to maintain and
distribute object lists, simplifying the design of the system, and reducing the metadata cluster workload.
Dynamic Distributed Metadata Management
Because file system metadata operations make up as
much as half of typical file system workloads [22],
effective metadata management is critical to overall
system performance. Ceph utilizes a novel metadata
cluster architecture based on Dynamic Subtree Partitioning [30] that adaptively and intelligently distributes
responsibility for managing the file system directory
hierarchy among tens or even hundreds of MDSs. A
(dynamic) hierarchical partition preserves locality in
each MDSs workload, facilitating efficient updates
and aggressive prefetching to improve performance
for common workloads. Significantly, the workload
distribution among metadata servers is based entirely
on current access patterns, allowing Ceph to effectively
utilize available MDS resources under any workload and
achieve near-linear scaling in the number of MDSs.
Reliable Autonomic Distributed Object Storage
Large systems composed of many thousands of devices
are inherently dynamic: they are built incrementally, they
grow and contract as new storage is deployed and old devices are decommissioned, device failures are frequent
and expected, and large volumes of data are created,
moved, and deleted. All of these factors require that the
distribution of data evolve to effectively utilize available
resources and maintain the desired level of data replication. Ceph delegates responsibility for data migration,
replication, failure detection, and failure recovery to the
cluster of OSDs that store the data, while at a high level,
OSDs collectively provide a single logical object store
to clients and metadata servers. This approach allows
Ceph to more effectively leverage the intelligence (CPU
and memory) present on each OSD to achieve reliable,
highly available object storage with linear scaling.
We describe the operation of the Ceph client, metadata
server cluster, and distributed object store, and how they
are affected by the critical features of our architecture.
We also describe the status of our prototype.
3 Client Operation
We introduce the overall operation of Cephs components and their interaction with applications by describ-
(e. g., readdir, stat) and updates (e. g., unlink, chmod) are
synchronously applied by the MDS to ensure serialization, consistency, correct security, and safety. For simplicity, no metadata locks or leases are issued to clients.
For HPC workloads in particular, callbacks offer minimal upside at a high potential cost in complexity.
Instead, Ceph optimizes for the most common metadata access scenarios. A readdir followed by a stat of
each file (e. g., ls -l) is an extremely common access
pattern and notorious performance killer in large directories. A readdir in Ceph requires only a single MDS
request, which fetches the entire directory, including inode contents. By default, if a readdir is immediately
followed by one or more stats, the briefly cached information is returned; otherwise it is discarded. Although
this relaxes coherence slightly in that an intervening inode modification may go unnoticed, we gladly make this
trade for vastly improved performance. This behavior
is explicitly captured by the readdirplus [31] extension,
which returns lstat results with directory entries (as some
OS-specific implementations of getdir already do).
Ceph could allow consistency to be further relaxed by
caching metadata longer, much like earlier versions of
NFS, which typically cache for 30 seconds. However,
this approach breaks coherency in a way that is often critical to applications, such as those using stat to determine
if a file has been updatedthey either behave incorrectly,
or end up waiting for old cached values to time out.
We opt instead to again provide correct behavior and
extend the interface in instances where it adversely affects performance. This choice is most clearly illustrated
by a stat operation on a file currently opened by multiple
clients for writing. In order to return a correct file size
and modification time, the MDS revokes any write capabilities to momentarily stop updates and collect up-todate size and mtime values from all writers. The highest
values are returned with the stat reply, and capabilities
are reissued to allow further progress. Although stopping multiple writers may seem drastic, it is necessary to
ensure proper serializability. (For a single writer, a correct value can be retrieved from the writing client without
interrupting progress.) Applications for which coherent
behavior is unnecesssaryvictims of a POSIX interface
that doesnt align with their needscan use statlite [31],
which takes a bit mask specifying which inode fields are
not required to be coherent.
Root
MDS 0
MDS 1
MDS 2
MDS 3
MDS 4
Figure 2: Ceph dynamically maps subtrees of the directory hierarchy to metadata servers based on the current
workload. Individual directories are hashed across multiple nodes only when they become hot spots.
File
Objects
(ino,ono)
PGs
oid
pgid
CRUSH(pgid)
(osd1, osd2)
OSDs
(grouped by
failure domain)
5.2 Replication
In contrast to systems like Lustre [4], which assume one
can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that
in a petabyte or exabyte system failure will be the norm
rather than the exception, and at any point in time several
OSDs are likely to be inoperable. To maintain system
availability and ensure data safety in a scalable fashion,
RADOS manages its own replication of data using a variant of primary-copy replication [2], while taking steps to
minimize the impact on performance.
Data is replicated in terms of placement groups, each
of which is mapped to an ordered list of n OSDs (for
n-way replication). Clients send all writes to the first
non-failed OSD in an objects PG (the primary), which
assigns a new version number for the object and PG and
forwards the write to any additional replica OSDs. After
each replica has applied the update and responded to the
primary, the primary applies the update locally and the
write is acknowledged to the client. Reads are directed
at the primary. This approach spares the client of any of
the complexity surrounding synchronization or serialization between replicas, which can be onerous in the presence of other writers or failure recovery. It also shifts the
bandwidth consumed by replication from the client to the
OSD clusters internal network, where we expect greater
resources to be available. Intervening replica OSD failures are ignored, as any subsequent recovery (see Section 5.5) will reliably restore replica consistency.
Primary
Replica
Time
Client
Replica
Write
Apply update
Ack
Commit to disk
Commit
should be, even if their locally stored object set may not
match. Only after the primary determines the correct PG
state and shares it with any replicas is I/O to objects in
the PG permitted. OSDs are then independently responsible for retrieving missing or outdated objects from their
peers. If an OSD receives a request for a stale or missing
object, it delays processing and moves that object to the
front of the recovery queue.
For example, suppose osd1 crashes and is marked
down, and osd2 takes over as primary for pgA. If osd1
recovers, it will request the latest map on boot, and a
monitor will mark it as up. When osd2 receives the resulting map update, it will realize it is no longer primary
for pgA and send the pgA version number to osd1.
osd1 will retrieve recent pgA log entries from osd2,
tell osd2 its contents are current, and then begin processing requests while any updated objects are recovered
in the background.
Because failure recovery is driven entirely by individual OSDs, each PG affected by a failed OSD will recover in parallel to (very likely) different replacement
OSDs. This approach, based on the Fast Recovery Mechanism (FaRM) [37], decreases recovery times and improves overall data safety.
We evaluate our prototype under a range of microbenchmarks to demonstrate its performance, reliability, and
scalability. In all tests, clients, OSDs, and MDSs are
user processes running on a dual-processor Linux cluster with SCSI disks and communicating using TCP. In
general, each OSD or MDS runs on its own host, while
tens or hundreds of client instances may share the same
host while generating workload.
20
50
no replication
2x replication
3x replication
sync write
sync lock, async write
15
40
30
10
no replication
2x replication
3x replication
20
10
0
4
16
64
256
Write Size (KB)
1024
4096
5
0
4
16
64
Write Size (KB)
256
1024
Figure 7: Write latency for varying write sizes and replication. More than two replicas incurs minimal additional
cost for small writes because replicated updates occur
concurrently. For large synchronous writes, transmission times dominate. Clients partially mask that latency
for writes over 128 KB by acquiring exclusive locks and
asynchronously flushing the data.
60
writes
50
60
40
ebofs
ext3
reiserfs
xfs
30
20
reads
10
PerOSD Throughput
(MB/sec)
PerOSD Throughput
(MB/sec)
60
50
crush (32k PGs)
crush (4k PGs)
hash (32k PGs)
hash (4k PGs)
linear
40
30
16
64
256
1024
I/O Size (KB)
4096
16384
Figure 6: Performance of EBOFS compared to generalpurpose file systems. Although small writes suffer from
coarse locking in our prototype, EBOFS nearly saturates
the disk for writes larger than 32 KB. Since EBOFS lays
out data in large extents when it is written in large increments, it has significantly better read performance.
write out large files, striped over 16 MB objects, and read
them back again. Although small read and write performance in EBOFS suffers from coarse threading and
locking, EBOFS very nearly saturates the available disk
bandwidth for writes sizes larger than 32 KB, and significantly outperforms the others for read workloads because
data is laid out in extents on disk that match the write
sizeseven when they are very large. Performance was
measured using a fresh file system. Experience with an
earlier EBOFS design suggests it will experience significantly lower fragmentation than ext3, but we have not yet
evaluated the current implementation on an aged file system. In any case, we expect the performance of EBOFS
after aging to be no worse than the others.
6.1.2 Write Latency
Figure 7 shows the synchronous write latency (y) for a
single writer with varying write sizes (x) and replica-
10
14
18
OSD Cluster Size
22
26
3
2
1
diskless
local disk
0
0
1
2
3
4
Metadata Replication
4
150
stat
readdir
readdirplus
100
50
fresh primed
10 files / dir
fresh primed
1 file / dir
(a) Metadata update latency (b) Cumulative time confor an MDS with and with- sumed during a file system
out a local disk. Zero corre- walk.
sponds to no journaling.
5000
makedirs
makefiles
openshared
4000
openssh+include
openssh+lib
3000
2000
1000
0
0
16
32
48
64
80
96
MDS Cluster Size (nodes)
112
128
Figure 10: Per-MDS throughput under a variety of workloads and cluster sizes. As the cluster grows to 128
nodes, efficiency drops no more than 50% below perfect
linear (horizontal) scaling for most workloads, allowing
vastly improved performance over existing systems.
primed MDS cache reduces readdir times. Subsequent
stats are not affected, because inode contents are embedded in directories, allowing the full directory contents to
be fetched into the MDS cache with a single OSD access. Ordinarily, cumulative stat times would dominate
for larger directories. Subsequent MDS interaction can
be eliminated by using readdirplus, which explicitly bundles stat and readdir results in a single operation, or by
relaxing POSIX to allow stats immediately following a
readdir to be served from client caches (the default).
6.2.3 Metadata Scaling
We evaluate metadata scalability using a 430 node partition of the alc Linux cluster at Lawrence Livermore
National Laboratory (LLNL). Figure 10 shows per-MDS
throughput (y) as a function of MDS cluster size (x),
such that a horizontal line represents perfect linear scaling. In the makedirs workload, each client creates a tree
Latency (ms)
50
4 MDSs
16 MDSs
128 MDSs
40
30
20
10
0
0
500
1000
1500
PerMDS throughput (ops/sec)
2000
7 Experiences
We were pleasantly surprised by the extent to which replacing file allocation metadata with a distribution function became a simplifying force in our design. Although this placed greater demands on the function itself, once we realized exactly what those requirements
were, CRUSH was able to deliver the necessary scalability, flexibility, and reliability. This vastly simplified
our metadata workload while providing both clients and
OSDs with complete and independent knowledge of the
data distribution. The latter enabled us to delegate responsibility for data replication, migration, failure detection, and recovery to OSDs, distributing these mechanisms in a way that effectively leveraged their bundled
CPU and memory. RADOS has also opened the door to
a range of future enhancements that elegantly map onto
our OSD model, such as bit error detection (as in the
Google File System [7]) and dynamic replication of data
based on workload (similar to AutoRAID [34]).
Although it was tempting to use existing kernel file
systems for local object storage (as many other systems
have done [4, 7, 9]), we recognized early on that a file
system tailored for object workloads could offer better
performance [27]. What we did not anticipate was the
disparity between the existing file system interface and
our requirements, which became evident while developing the RADOS replication and reliability mechanisms.
EBOFS was surprisingly quick to develop in user-space,
offered very satisfying performance, and exposed an interface perfectly suited to our requirements.
One of the largest lessons in Ceph was the importance
of the MDS load balancer to overall scalability, and the
complexity of choosing what metadata to migrate where
and when. Although in principle our design and goals
seem quite simple, the reality of distributing an evolving workload over a hundred MDSs highlighted additional subtleties. Most notably, MDS performance has
8 Related Work
High-performance scalable file systems have long been
a goal of the HPC community, which tends to place a
heavy load on the file system [18, 27]. Although many
file systems attempt to meet this need, they do not provide the same level of scalability that Ceph does. Largescale systems like OceanStore [11] and Farsite [1] are
designed to provide petabytes of highly reliable storage,
and can provide simultaneous access to thousands of separate files to thousands of clients, but cannot provide
high-performance access to a small set of files by tens
of thousands of cooperating clients due to bottlenecks in
subsystems such as name lookup. Conversely, parallel
file and storage systems such as Vesta [6], Galley [17],
PVFS [12], and Swift [5] have extensive support for
striping data across multiple disks to achieve very high
transfer rates, but lack strong support for scalable metadata access or robust data distribution for high reliability.
For example, Vesta permits applications to lay their data
out on disk, and allows independent access to file data on
each disk without reference to shared metadata. However, like many other parallel file systems, Vesta does
not provide scalable support for metadata lookup. As a
result, these file systems typically provide poor performance on workloads that access many small files or require many metadata operations. They also typically suffer from block allocation issues: blocks are either allocated centrally or via a lock-based mechanism, preventing them from scaling well for write requests from thousands of clients to thousands of disks. GPFS [24] and
StorageTank [16] partially decouple metadata and data
9 Future Work
Some core Ceph elements have not yet been implemented, including MDS failure recovery and several
POSIX calls. Two security architecture and protocol
variants are under consideration, but neither have yet
been implemented [13, 19]. We also plan on investigating the practicality of client callbacks on namespace to
inode translation metadata. For static regions of the file
system, this could allow opens (for read) to occur without MDS interaction. Several other MDS enhancements
are planned, including the ability to create snapshots of
arbitrary subtrees of the directory hierarchy [28].
Although Ceph dynamically replicates metadata when
flash crowds access single directories or files, the same
is not yet true of file data. We plan to allow OSDs to
dynamically adjust the level of replication for individual
objects based on workload, and to distribute read traffic
across multiple OSDs in the placement group. This will
allow scalable access to small amounts of data, and may
facilitate fine-grained OSD load balancing using a mechanism similar to D-SPTF [15].
Finally, we are working on developing a quality
of service architecture to allow both aggregate classbased traffic prioritization and OSD-managed reserva-
tion based bandwidth and latency guarantees. In addition to supporting applications with QoS requirements,
this will help balance RADOS replication and recovery
operations with regular workload. A number of other
EBOFS enhancements are planned, including improved
allocation logic, data scouring, and checksums or other
bit-error detection mechanisms to improve data safety.
10 Conclusions
Ceph addresses three critical challenges of storage
systemsscalability, performance, and reliabilityby
occupying a unique point in the design space. By shedding design assumptions like allocation lists found in
nearly all existing systems, we maximally separate data
from metadata management, allowing them to scale independently. This separation relies on CRUSH, a data distribution function that generates a pseudo-random distribution, allowing clients to calculate object locations instead of looking them up. CRUSH enforces data replica
separation across failure domains for improved data
safety while efficiently coping with the inherently dynamic nature of large storage clusters, where devices failures, expansion and cluster restructuring are the norm.
RADOS leverages intelligent OSDs to manage data
replication, failure detection and recovery, low-level disk
allocation, scheduling, and data migration without encumbering any central server(s). Although objects can be
considered files and stored in a general-purpose file system, EBOFS provides more appropriate semantics and
superior performance by addressing the specific workloads and interface requirements present in Ceph.
Finally, Cephs metadata management architecture addresses one of the most vexing problems in highly
scalable storagehow to efficiently provide a single
uniform directory hierarchy obeying POSIX semantics
with performance that scales with the number of metadata servers. Cephs dynamic subtree partitioning is a
uniquely scalable approach, offering both efficiency and
the ability to adapt to varying workloads.
Ceph is licensed under the LGPL and is available at
http://ceph.sourceforge.net/.
Acknowledgments
This work was performed under the auspices of the U.S.
Department of Energy by the University of California,
Lawrence Livermore National Laboratory under Contract W-7405-Eng-48. Research was funded in part by
the Lawrence Livermore, Los Alamos, and Sandia National Laboratories. We would like to thank Bill Loewe,
Tyce McLarty, Terry Heidelberg, and everyone else at
LLNL who talked to us about their storage trials and
tribulations, and who helped facilitate our two days of
dedicated access time on alc. We would also like to
References
[1] A. Adya, W. J. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer,
and R. Wattenhofer. FARSITE: Federated, available, and
reliable storage for an incompletely trusted environment.
In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA,
Dec. 2002. USENIX.
[2] P. A. Alsberg and J. D. Day. A principle for resilient
sharing of distributed resources. In Proceedings of the
2nd International Conference on Software Engineering,
pages 562570. IEEE Computer Society Press, 1976.
[3] A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor,
N. Rinetzky, O. Rodeh, J. Satran, A. Tavory, and
L. Yerushalmi. Towards an object store. In Proceedings
of the 20th IEEE / 11th NASA Goddard Conference on
Mass Storage Systems and Technologies, pages 165176,
Apr. 2003.
[4] P. J. Braam.
The Lustre storage architecture.
http://www.lustre.org/documentation.html, Cluster File
Systems, Inc., Aug. 2004.
[5] L.-F. Cabrera and D. D. E. Long. Swift: Using distributed
disk striping to provide high I/O data rates. Computing
Systems, 4(4):405436, 1991.
[6] P. F. Corbett and D. G. Feitelson. The Vesta parallel
file system. ACM Transactions on Computer Systems,
14(3):225264, 1996.
[7] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google
file system. In Proceedings of the 19th ACM Symposium
on Operating Systems Principles (SOSP 03), Bolton
Landing, NY, Oct. 2003. ACM.
[8] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W.
Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg,
and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International
Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), pages 92
103, San Jose, CA, Oct. 1998.
[9] D. Hildebrand and P. Honeyman. Exporting storage systems in a scalable manner with pNFS. Technical Report
CITI-05-1, CITI, University of Michigan, Feb. 2005.
[10] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin,
and R. Panigrahy. Consistent hashing and random trees:
Distributed caching protocols for relieving hot spots on
the World Wide Web. In ACM Symposium on Theory of
Computing, pages 654663, May 1997.
[11] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels,
R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer,
C. Wells, and B. Zhao. OceanStore: An architecture for
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] M. Szeredi.
File System in User Space.
http://fuse.sourceforge.net, 2006.
[26] H. Tang, A. Gulbeden, J. Zhou, W. Strathearn, T. Yang,
and L. Chu. A self-organizing storage cluster for parallel data-intensive applications. In Proceedings of the
2004 ACM/IEEE Conference on Supercomputing (SC
04), Pittsburgh, PA, Nov. 2004.
[27] F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller,
D. D. E. Long, and T. T. McLarty. File system workload
analysis for large scale scientific computing applications.
In Proceedings of the 21st IEEE / 12th NASA Goddard
Conference on Mass Storage Systems and Technologies,
pages 139152, College Park, MD, Apr. 2004.
[28] S. A. Weil. Scalable archival data and metadata management in object-based file systems. Technical Report
SSRC-04-01, University of California, Santa Cruz, May
2004.
[29] S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn.
CRUSH: Controlled, scalable, decentralized placement
of replicated data. In Proceedings of the 2006 ACM/IEEE
Conference on Supercomputing (SC 06), Tampa, FL,
Nov. 2006. ACM.
[30] S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller.
Dynamic metadata management for petabyte-scale file
systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC 04). ACM, Nov. 2004.
[31] B. Welch. POSIX IO extensions for HPC. In Proceedings of the 4th USENIX Conference on File and Storage
Technologies (FAST), Dec. 2005.
[32] B. Welch and G. Gibson. Managing scalability in object
storage systems for HPC Linux clusters. In Proceedings
of the 21st IEEE / 12th NASA Goddard Conference on
Mass Storage Systems and Technologies, pages 433445,
Apr. 2004.
[33] B. S. White, M. Walker, M. Humphrey, and A. S.
Grimshaw. LegionFS: A secure and scalable file system supporting cross-domain high-performance applications. In Proceedings of the 2001 ACM/IEEE Conference
on Supercomputing (SC 01), Denver, CO, 2001.
[34] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The
HP AutoRAID hierarchical storage system. In Proceedings of the 15th ACM Symposium on Operating Systems
Principles (SOSP 95), pages 96108, Copper Mountain,
CO, 1995. ACM Press.
[35] T. M. Wong, R. A. Golding, J. S. Glider, E. Borowsky,
R. A. Becker-Szendy, C. Fleiner, D. R. KenchammanaHosekote, and O. A. Zaki. Kybos: self-management
for distributed brick-base storage. Research Report RJ
10356, IBM Almaden Research Center, Aug. 2005.
[36] J. C. Wu and S. A. Brandt. The design and implementation of AQuA: an adaptive quality of service aware
object-based storage device. In Proceedings of the 23rd
IEEE / 14th NASA Goddard Conference on Mass Storage
Systems and Technologies, pages 209218, College Park,
MD, May 2006.
[37] Q. Xin, E. L. Miller, and T. J. E. Schwarz. Evaluation
of distributed recovery in large-scale storage systems. In
Proceedings of the 13th IEEE International Symposium
on High Performance Distributed Computing (HPDC),
pages 172181, Honolulu, HI, June 2004.