A case study towards the verification of the utility of analytical models in selecting checkpoint intervals
As high performance computing (HPC) systems grow larger, with increasing numbers of components, failures become more common. Codes that utilize large numbers of nodes and run for long periods of time must take such failures into account and adopt fault tolerance mechanisms to avoid loss of computation and, thus, system utilization. One of those mechanisms is checkpoint/restart. Although analytical models exist to guide users in the selection of an appropriate checkpoint interval, these models are based on assumptions that may not always be true. This thesis examines some of these assumptions, in particular, the consistency of parameters like Mean Time To Interrupt (MTTI), checkpoint latency, and restart time, and explores the utility of the models, which assume an exponential failure distribution. The related experimentation uses checkpoint and restart data collected from NAMD, a widely used biomolecular simulation code, and failure data from Los Alamos National Lab (LANL) where failure distributions are not exponential in nature. It also presents preliminary work on spatio-temporal clustering of HPC failure data that is aimed towards determining the degree to which failures that occur in HPC centers are related. The experimental results of this thesis validate that Daly's execution-time and Arunagiri's defensive-I/O checkpoint/restart models hold for NAMD. This shows that these models have utility even when failures do not have an exponential distribution. The results of the clustering indicate that for some systems failures are located into easily recognized clusters, while for others failures are placed in small clusters showing that they occur in close proximity spatially and temporally. Note, however, no conclusion can be drawn from these results as to whether they are related events as random events sometimes cluster. Spatio-temporal autocorrelation is recommended as a continuation of this research to determine the degree of the relatedness of failure events.
Computer Engineering|Computer science
Harney, Michael Joseph, "A case study towards the verification of the utility of analytical models in selecting checkpoint intervals" (2013). ETD Collection for University of Texas, El Paso. AAI1539944.