In most applications of data processing, we select the parameters that minimize the mean square approximation error. The same Least Squares approach has been used in the traditional neural networks. However, for deep learning, it turns out that an alternative idea works better -- namely, minimizing the Kullback-Leibler (KL) divergence. The use of KL divergence is justified if we predict probabilities, but the use of this divergence has been successful in other situations as well. In this paper, we provide a possible explanation for this empirical success. Namely, the Least Square approach is optimal when the approximation error is normally distributed -- and can lead to wrong results when the actual distribution is different from normal. The need to have a robust criterion, i.e., a criterion that does not depend on the corresponding distribution, naturally leads to the KL divergence.