Towards a Spaceworthy Cots Graphics Processing Unit: Hardware Performance Counter-Based Symptomatic Fault Detection
Abstract
Ionizing radiation remains an obstacle to bringing graphics processing units (GPU) to space. Since radiation-hardened GPU chips are technically infeasible at the moment, an emphasis has been placed on the adaptation of commercial-off-the-shelf (COTS) GPUs to the space domain. At present, GPU error detection methods require redundant computation. This thesis work explores the utilization of hardware performance counters, special registers useful for monitoring internal GPU hardware events, for symptom-based, lightweight error detection. Hardware performance counters are successfully utilized for the detection of anomalous single event upsets in the L0 instruction cache, the load store unit, the arithmetic and logic unit, the fused multiply add pipeline, and the address divergence unit of a GPU. These upsets are detected using both supervised and unsupervised shallow machine learning models. Results indicate a viable alternative to redundancy-based computational methods for detection and handling of single-event upsets in a subset of components of a GPU architecture.
Subject Area
Electrical engineering|Computer science|Artificial intelligence
Recommended Citation
Teijeiro, Antonio Emilio, "Towards a Spaceworthy Cots Graphics Processing Unit: Hardware Performance Counter-Based Symptomatic Fault Detection" (2023). ETD Collection for University of Texas, El Paso. AAI30819693.
https://scholarworks.utep.edu/dissertations/AAI30819693