The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. In particular, the standard approach to fault tolerance, application-directed checkpointing, puts an incredible strain on the storage system and the interconnection network. This results in overheads on the appliation that severely impact performance and scalability. The checkpoint overhead can be reduced by decreasing the checkpoint latency, which is the time to write a checkpoint file, or by increasing the checkpoint interval, which is the compute time between writing checkpoint files. However, increasing the checkpoint interval may increase execution time in the presence of failures. The relationship among the mean time to interruption (MTTI), the checkpoint parameters, and the expected application execution time can be explored using a model, e.g., the model developed by researchers at Los Alamos National Laboratory (LANL). Such models may be used to calculate the optimal periodic checkpoint interval. In this paper, we use the LANL model of checkpointing and thorough mathematical analysis we show the impact of a change in the checkpoint latency on the optimal checkpoint interval and the overall execution time of the application.
For checkpoint latencies, d1 and d2, and the corresponding optimal checkpoint intervals, t1 and t2, our analysis shows the following results: (1) For a given MTTI, if d1 is greater than d2, t1 is greater than or equal to t2. (2) When the checkpoint interval is fixed, a decrease in checkpoint latency results in a decrease in application execution time. (3) A reduction in checkpoint latency, from d1 to d2, and a corresponding change of the checkpoint interval from the optimal checkpoint interval associated with d1, t1, to that associated with d2, t2, translates to reduced application execution time when the difference between t1 and t2 exceeds a certain threshold value, which can be as large as 12% of t_opt.
In terms of application execution times, the approximation error of the optimal checkpoint interval is not significant. However, when we consider other performance metrics of the application, such as network bandwidth consumption and I/O bandwidth consumption, we conjecture that the information obtained by the analysis presented in this report could be of value in reducing resource consumption.