Professional Documents
Culture Documents
Adrien.Lebre@emn.fr
405
could be possible to implement this feature in a file sys-
A0 B0 .content0 tem but it would break our portability goal).
A way to solve this problem would be to replicate
the block model using small files, but this raises two
C1 D1 .content1 issues. First, choosing the best file size is difficult: it
time depends on the application file access pattern. Second,
using many small files on the native file system for each
A2 C2 .content2
versioned file creates a large number of inodes, making
fsck longer to scan and repair the file system, wasting
disk space (although an inode is small, it still takes some
Figure 1. A view of content files after several place on the disk), potentially hitting the limit on the
checkpoints. number of inodes, and more generally impacting per-
The following scenario creates this pattern of content files:
formance.
write pages A0 and B0, checkpoint, write pages C1 and D1,
checkpoint, and finally write pages A2 and C2. Filled areas To solve this issue, we copy still-used pages from
correspond to empty areas of the sparse files. the content file to remove to a temporary sparse file, and
then replace the content file by this temporary file. This
can be expensive, but since checkpoint removal is not
a high priority operation (it can be done in the back-
Unfortunately, this is not enough to allow version- ground or when the system is idle), this is not a major
ing of file state: when reading an empty area in a sparse issue. However it should be noted that a downside of
file, zero bytes are returned. It is not possible to dis- keeping old content files for pages still in use is that
tinguish an empty area from an area that was written to the number of content files grows over time. A cleaner
with zero bytes. functionality (fsck-like) could be added to copy still-
Thus, our model stores another information: which used pages to a more recent content file and fix the cor-
content file stores a specific version of a page of the file. responding references in the B-trees.
This information is kept in extent-like structures, refer-
ring to a contiguous range of objects [8]. This allows 3.3. Stability of Checkpoint Data
using a single structure to reference a number of con-
tiguous pages, reducing space usage to store this infor- Data retained by a process checkpoint mechanism
mation when many contiguous pages share a common must be saved on stable storage: when failures happen,
version number. For a fast search, insertion and dele- checkpointed process states need to be accessible to be
tion of these extents, they are stored in a B-tree sorted able to restart applications.
by the starting page number. A new B-tree is created Most existing persistent state checkpointing sys-
at each checkpoint, so there is one B-tree per version of tems do not take into account this issue and assume an
the file. An example of B-trees along with the corre- external subsystem solves it. They typically use a cen-
sponding content files is shown in Figure 2. tral NFS server for file storage, which is assumed to be
This architecture allows checkpointed file content stable using traditional redundancy techniques such as
to be retained in the file system even if the application RAID [10].
overwrites it. If an application removes a file, the parent In the context of a distributed file system where
directory is versioned in the same way. The metadata several nodes are used in parallel to store data, fault tol-
and content files for the deleted file are not removed: erance must be a feature of the file system stack in order
they can be used when a process is restarted. to provide a truly stable support.
When a volatile state checkpoint is deleted, the cor- A straightforward solution would be, for instance,
responding persistent state is not useful anymore, and to use synchronous replication (similar to RAID1) at
should be removed from the file system. While a block network level, as illustrated in Figure 3. However, this
model would allow us to release unused data by free- approach has a serious drawback: duplicating writes in-
ing the corresponding blocks, the plain files we exploit creases the load on network and disks, potentially de-
must be accessed using the standard file interface. How- creasing performance. Moreover, if the cluster runs ap-
ever, the POSIX API does not provide a way to remove plications which are not checkpointed (e.g. applications
part of a file, except when it is located at the end (in with real-time constraints for which it makes no sense
this case we can use truncate). Overwriting data to restart at a previous state), this could decrease their
with zero bytes does not make the file sparse again on performance.
the ext2/ext3 file systems, and probably most others (it In our context of an HPC cluster designed to ex-
406
0 ; 2 ; V0 B−tree of checkpoint 0
(file is 2 pages, 8 KB)
.content0 A0 B0
0 ; 2 ; V0 2 ; 2 ; V1 B−tree of checkpoint 1
(2 new pages were appended)
.content0 A0 B0 C1 D1 .content1
1 ; 1 ; V0 B−tree of checkpoint 2
(pages A and C were modified)
0 ; 1 ; V2 2 ; 1 ; V2 3 ; 1 ; V1
time
A0 B0
.content0
C1 D1 .content1
.content2 A2 C2
write(A0) write(A1) write(A2) sign, based on a periodic replication, is close to the one
suggested for disaster recovery of enterprise data [11].
Coupling this kind of replication system with process
Nodes checkpoint/restart enables to provide fault tolerance for
HPC applications. When a checkpoint is taken, our
model replicates data created or changed by this pro-
Disks cess since the last checkpoint, if this data is still valid
A0 A0 A1 A1 A2 A2
(i.e. not overwritten or deleted). For example, in case
of multiple writes to the same page of a file, only the last
Figure 3. Multiple write operations using a version needs to be replicated, as illustrated in Figure 4.
replication model at the network level. If data is written and invalidated between two check-
Page A is modified three times with versions A0, A1, and points, it does not need to be replicated. This means
A2. A synchronous replication model sends data over the that files created and deleted between two checkpoints
network for each write operation. are not transferred over the network. This can be partic-
ularly interesting for applications doing large data I/O
on temporary files that are then removed. This replica-
tion can also be done in the background to keep a low
ecute applications reliably, our goal is not making file
performance impact on the network and the disks.
data resilient to failure but running applications suc-
cessfully while minimizing lost time in case of failure. When a node fails, checkpoint data stored on it can
When a crash happens, applications are restarted from be accessed through the replica. To stay fault tolerant,
the most recent checkpoint. All file modifications re- at least two copies of checkpoint data should always
alized since this checkpoint are not used. Because a be available in the cluster. In case of failure, a new
synchronous replication mechanism provides fault tol- replica should be initialized. Multiple replicas could be
erance for this kind of data, it adds I/O overhead without used to cope with several failures. In this case, quo-
any benefit. rum based mechanisms could improve performance by
The model we propose consists in replicating file avoiding synchronizing all replicas at each checkpoint.
state on a remote node only at checkpoint time. This de- Discussing the pros and cons of different replica man-
407
write(A0) write(A1)
80 Write (per byte) performance
Write (per block) performance
70
Rewrite performance
Disks A0 A1
30
20
10
Replication
0
write(A2) (triggered by 10 30 60 120 300 Checkpointing
disabled
the checkpoint) Checkpoint period (s)
408
5000 Bonnie++ step Number of write Size of writ-
File creation operations ten data
File deletion
4000
Operations per second
409
6. Future Works
RAID1
14
Our replication model
over the network (GB)
12
6.1. Checkpoint Replication in Grids
Size of data sent
10
410
7. Conclusion 43, 1991.
[9] K.-K. Muniswamy-Reddy, C. P. Wright, A. Himmer,
This paper presents a persistent state checkpointing and E. Zadok. A Versatile and User-Oriented Version-
ing File System. In FAST ’04: Proceedings of the 3rd
framework that can be used to coherently restart appli-
USENIX Conference on File and Storage Technologies,
cations using files. pages 115–128, 2004.
The first part of this mechanism is an efficient [10] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for
and portable file versioning mechanism that can be im- Redundant Arrays of Inexpensive Disks (RAID). In Pro-
plemented above a large number of existing file sys- ceedings of the 1988 ACM SIGMOD international con-
tems. We showed through experimentation with a stan- ference on Management of data, pages 109–116, 1988.
dard I/O benchmark that the performance impact of this [11] R. H. Patterson, S. Manley, M. Federwisch, D. Hitz,
mechanism is negligible. S. Kleiman, and S. Owara. SnapMirror: File-System-
Based Asynchronous Mirroring for Disaster Recovery.
The second part of our framework is a replication In FAST ’02: Proceedings of the Conference on File and
model in a distributed environment taking advantage of Storage Technologies, pages 117–129, 2002.
the usage of process checkpoint/restart to only replicate [12] D. Pei, D. Wang, M. Shen, and W. Zheng. Design and
relevant data. Depending on the file access pattern of Implementation of a Low-Overhead File Checkpoint-
the application and on the checkpoint frequency, using ing Approach. In IEEE International Conference on
this model can significantly reduce the amount of data High Performance Computing in the Asia-Pacific Re-
sent over the network. This enables increased I/O per- gion, pages 439–441, 2000.
formance compared to a traditional synchronous repli- [13] J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa, and
cation strategy. This model is particularly interesting S. Jiang. Current Practice and a Direction Forward
in the context of grid computing where network con- in Checkpoint/Restart Implementations for Fault Toler-
ance. In IPDPS ’05: Proceedings of the 19th IEEE In-
straints are several orders of magnitude higher than in a
ternational Parallel and Distributed Processing Sympo-
cluster. sium, 2005.
[14] S. Savage and J. Wilkes. AFRAID–A Frequently Re-
References dundant Array of Independent Disks. In Proceedings of
the USENIX 1996 Annual Technical Conference, 1996.
[1] TOP500. http://www.top500.org/, November 2008. [15] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and
[2] A. Calderon, F. Garcia-Carballeira, J. Carretero, J. M. S. Sekiguchi. Grid Datafarm Architecture for Petascale
Perez, and L. M. Sanchez. A Fault Tolerant MPI-IO Im- Data Intensive Computing. In CCGRID ’02: Proceed-
plementation using the Expand Parallel File System. In ings of the 2nd IEEE/ACM International Symposium on
PDP ’05: Proceedings of the 13th Euromicro Confer- Cluster Computing and the Grid, pages 102–110, 2002.
ence on Parallel, Distributed and Network-Based Pro- [16] G. Vallée, R. Lottiaux, D. Margery, and C. Morin. Ghost
cessing, pages 274–281, 2005. Process: a Sound Basis to Implement Process Duplica-
[3] R. Coker. The Bonnie++ hard drive and file system tion, Migration and Checkpoint/Restart in Linux Clus-
benchmark suite. http://www.coker.com.au/bonnie++/. ters. In ISPDC ’05: Proceedings of the 4th Interna-
[4] T. Cortes, C. Franke, Y. Jégou, T. Kielmann, tional Symposium on Parallel and Distributed Comput-
D. Laforenza, B. Matthews, C. Morin, L. P. Prieto, and ing, pages 97–104, 2005.
A. Reinefeld. XtreemOS: a Vision for a Grid Operating [17] G. Vallée, R. Lottiaux, L. Rilling, J.-Y. Berthou, I. D.
System. White paper, 2008. Malhen, and C. Morin. A Case for Single System Image
[5] J. Duell, P. Hargrove, and E. Roman. The Design Cluster Operating Systems: The Kerrighed Approach.
and Implementation of Berkeley Lab’s Linux Check- Parallel Processing Letters, 13(2):95–122, 2003.
point/Restart. Technical report, Lawrence Berkeley Na- [18] Y.-M. Wang, P. Y. Chung, Y. Huang, and E. N. Elnozahy.
tional Laboratory, 2002. Integrating Checkpointing with Transaction Processing.
[6] D. Hitz, J. Lau, and M. Malcolm. File System Design In FTCS ’97: Proceedings of the 27th International
for an NFS File Server Appliance. In Proceedings of Symposium on Fault-Tolerant Computing, pages 304–
the USENIX Winter 1994 Technical Conference, pages 308, 1997.
235–246, 1994. [19] Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and
[7] A. Lèbre, R. Lottiaux, E. Focht, and C. Morin. Re- C. Kintala. Checkpointing and Its Applications. In
ducing Kernel Development Complexity in Distributed IEEE Fault-Tolerant Computing Symposium, pages 22–
Environments. In Euro-Par 2008: Proceedings of the 31, 1995.
14th International Euro-Par Conference, pages 576– [20] E. Zadok, R. Iyer, N. Joukov, G. Sivathanu, and C. P.
586, 2008. Wright. On Incremental File System Development.
[8] L. W. McVoy and S. R. Kleiman. Extent-like Perfor- ACM Transactions on Storage, 2(2):161–196, 2006.
mance from a UNIX File System. In Proceedings of the
USENIX Winter 1991 Technical Conference, pages 33–
411