A Novel Set of Weight Initialization Techniques for Deep Learning Architectures
The importance of weight initialization when building a deep learning model is often underappreciated. Even though it is usually seen as a minor detail in the model creation cycle, this process has shown to have a strong impact on the training time of a network and the quality of the resulting model. In fact, the implications of choosing a poor initialization scheme range from leading to the creation of a poorly performing model to preventing optimization techniques (like stochastic gradient descent) from converging. In this work, we introduce and evaluate a set of novel weight initialization techniques for deep learning architectures. These techniques use an initialization data set (extracted from the training data set) to compute the initial values of a layer's weights. They also use properties about the problem and network architecture to better select initial weight values. The first scheme in our set focuses on the initialization of dense and convolutional layers that make use of the ReLU activation function. This technique aims to maximize neuron heterogeneity in a given layer while 1) keeping the standard deviation of the neurons' outputs uniform and 2) controlling the initial number of active neurons. Our second technique is inspired by the observation that, normally, the weights learned by convolutional neural networks closely resemble Gabor filters. We propose to initialize regular convolutional layers with weights chosen so that the result is that of Gabor filters. The third technique in our set also targets convolutional layers. It selects weight values that allow filters in the layer to activate only when processing interesting regions in the input space (as defined by feature extraction techniques, such as SIFT and FAST). The fourth technique in our set aims to initialize recurrent layers. Recurrent layers reuse the same weight matrix in succeeding matrix multiplication operations, which can lead to the amplification or dilution of output values and gradients. We mitigate this problem by selecting weight matrices that maintain a uniform output response across the range of possible inputs. All of these initialization techniques add an extra step to initialize the output layer. We propose to use the ground truth information in the initialization data set to select weights that minimize the network's initial loss value. Experimentally, we show that our initialization schemes outperform state-of-the-art techniques (Glorot, He, and LSUV) by a considerable margin. Our methods allow for the creation of models that train faster and perform up to 40% better than if another technique was used for initialization.
Aguirre, Diego, "A Novel Set of Weight Initialization Techniques for Deep Learning Architectures" (2019). ETD Collection for University of Texas, El Paso. AAI27667567.