Towards a Spaceworthy Cots Graphics Processing Unit: Hardware Performance Counter-Based Symptomatic Fault Detection

Antonio Emilio Teijeiro, University of Texas at El Paso

Abstract

Ionizing radiation remains an obstacle to bringing graphics processing units (GPU) to space. Since radiation-hardened GPU chips are technically infeasible at the moment, an emphasis has been placed on the adaptation of commercial-off-the-shelf (COTS) GPUs to the space domain. At present, GPU error detection methods require redundant computation. This thesis work explores the utilization of hardware performance counters, special registers useful for monitoring internal GPU hardware events, for symptom-based, lightweight error detection. Hardware performance counters are successfully utilized for the detection of anomalous single event upsets in the L0 instruction cache, the load store unit, the arithmetic and logic unit, the fused multiply add pipeline, and the address divergence unit of a GPU. These upsets are detected using both supervised and unsupervised shallow machine learning models. Results indicate a viable alternative to redundancy-based computational methods for detection and handling of single-event upsets in a subset of components of a GPU architecture.

Subject Area

Electrical engineering|Computer science|Artificial intelligence

Recommended Citation

Teijeiro, Antonio Emilio, "Towards a Spaceworthy Cots Graphics Processing Unit: Hardware Performance Counter-Based Symptomatic Fault Detection" (2023). ETD Collection for University of Texas, El Paso. AAI30819693.
https://scholarworks.utep.edu/dissertations/AAI30819693

Share

COinS