Hardware for Quantized Mixed-Precision Deep Neural Networks

Andres Rios, University of Texas at El Paso


Recently, there has been a push to perform deep learning (DL) computations on the edge rather than the cloud due to latency, network connectivity, energy consumption, and privacy issues. However, state-of-the-art deep neural networks (DNNs) require vast amounts of computational power, data, and energy—resources that are limited on edge devices. This limitation has brought the need to design domain-specific architectures (DSAs) that implement DL-specific hardware optimizations. Traditionally DNNs have run on 32-bit floating-point numbers; however, a body of research has shown that DNNs are surprisingly robust and do not require all 32 bits. Instead, using quantization, networks can run on extremely low-bit widths (1-8 bits) with fair accuracy. Suggesting that edge devices can handle low-bit width DNNs at the cost of accuracy, saving computations and energy. In addition to DNNs being run on low-bit widths, it has also been shown that not all layers within a network require the same precision. Therefore, a further optimization suggests using per-layer mixed-precision quantization rather than uniform quantization. This thesis conducts a comparative study on the effects of mixed-precision quantization using "simulated quantization" in software. Furthermore, a mixed-precision multiplier—able to be configured at run time—is designed to support mixed-precision quantized DNNs in hardware, and a comparative study is performed between a full-precision implementation.

Subject Area

Computer Engineering|Computer science|Electrical engineering|Artificial intelligence

Recommended Citation

Rios, Andres, "Hardware for Quantized Mixed-Precision Deep Neural Networks" (2021). ETD Collection for University of Texas, El Paso. AAI28715051.