Date of Award

2012-01-01

Degree Name

Master of Science

Department

Computer Science

Advisor(s)

Patricia J. Teller

Second Advisor

Sarala Arunagiri

Abstract

Researchers have mentioned that the three most difficult and growing problems in the future of high-performance computing will be avoiding, coping and recovering from failures. As the scale of computing increases, the Mean Time to Failure (MTTF) of the entire system decreases and, therefore, system resilience and fault tolerance techniques become mandatory. One of the most commonly used fault tolerance schemes is checkpoint/restart, however, it has been predicted that the current checkpoint/restart approach is not scalable. Thus, current research seeks to find scalable fault tolerance techniques as well as to extend the scalability of checkpoint/restart.

The periodicity of the checkpointing operation, otherwise known as the checkpoint interval, plays an important role in application execution time and I/O performance. It can have a significant impact on execution time and the number of checkpoint I/O operations performed by the application. The frequency of checkpoint I/O operations performed by the application, along with its productive I/O, determine the demand made by the application on the I/O bandwidth of a massively parallel processing (MPP) system. There are analytical models for finding the optimal checkpoint interval that minimizes wall-clock execution time and the optimal checkpoint interval that minimizes the number of checkpoint I/O operations generated.

This thesis presents a study that quantitatively measures the effect of the checkpoint interval on workload execution time and the number of checkpoint I/O operations generated. The study is based on the execution behavior of RAxML 7.2.6, a popular community code, and RAxML-Light 1.0.6 on an HPC system as well as simulations of workloads executed on an HPC system. Parameter values for the HPC runs and the simulations are the product of analysis of historic failure data of 22 systems made available by the Computer Failure Data Repository (CFDR). This analysis also is presented in the thesis.

Our research showed that increasing the checkpoint interval to a value above the optimal checkpoint interval with respect to the execution time results in a significant decrease in the number of checkpoint I/O operations with a marginal increase in execution time. This shows that the associated analytical model holds good for the cases studied.

Language

en

Provenance

Received from ProQuest

File Size

84 pages

File Format

application/pdf

Rights Holder

Bidisha Chakraborty

Share

COinS