Fault tolerance: Validating a mathematical model via a case study of RAxML, an HPC community code
Researchers have mentioned that the three most difficult and growing problems in the future of high-performance computing will be avoiding, coping and recovering from failures. As the scale of computing increases, the Mean Time to Failure (MTTF) of the entire system decreases and, therefore, system resilience and fault tolerance techniques become mandatory. One of the most commonly used fault tolerance schemes is checkpoint/restart, however, it has been predicted that the current checkpoint/restart approach is not scalable. Thus, current research seeks to find scalable fault tolerance techniques as well as to extend the scalability of checkpoint/restart. The periodicity of the checkpointing operation, otherwise known as the checkpoint interval, plays an important role in application execution time and I/O performance. It can have a significant impact on execution time and the number of checkpoint I/O operations performed by the application. The frequency of checkpoint I/O operations performed by the application, along with its productive I/O, determine the demand made by the application on the I/O bandwidth of a massively parallel processing (MPP) system. There are analytical models for finding the optimal checkpoint interval that minimizes wall-clock execution time and the optimal checkpoint interval that minimizes the number of checkpoint I/O operations generated. This thesis presents a study that quantitatively measures the effect of the checkpoint interval on workload execution time and the number of checkpoint I/O operations generated. The study is based on the execution behavior of RAxML 7.2.6, a popular community code, and RAxML-Light 1.0.6 on an HPC system as well as simulations of workloads executed on an HPC system. Parameter values for the HPC runs and the simulations are the product of analysis of historic failure data of 22 systems made available by the Computer Failure Data Repository (CFDR). This analysis also is presented in the thesis. Our research showed that increasing the checkpoint interval to a value above the optimal checkpoint interval with respect to the execution time results in a significant decrease in the number of checkpoint I/O operations with a marginal increase in execution time. This shows that the associated analytical model holds good for the cases studied.
Chakraborty, Bidisha, "Fault tolerance: Validating a mathematical model via a case study of RAxML, an HPC community code" (2012). ETD Collection for University of Texas, El Paso. AAI1512557.