Quantized Convolutional Neural Networks Through the Lens of Partial Differential Equations

Ido Ben-Yair, Gil Ben Shalom
Moshe Eliasof and Eran Treister

Ben-Gurion University of the Negev, Israel

Springer Research in the Mathematical Sciences, Volume 9, December 2022

Paper | arXiv | Code

Abstract. Quantization of Convolutional Neural Networks (CNNs) is a common approach to ease the computational burden involved in the deployment of CNNs, especially on low-resource edge devices. However, fixed-point arithmetic is not natural to the type of computations involved in neural networks. In this work, we explore ways to improve quantized CNNs using PDE-based perspective and analysis. First, we harness the total variation (TV) approach to apply edge-aware smoothing to the feature maps throughout the network. This aims to reduce outliers in the distribution of values and promote piece-wise constant maps, which are more suitable for quantization. Secondly, we consider symmetric and stable variants of common CNNs for image classification, and Graph Convolutional Networks (GCNs) for graph node-classification. We demonstrate through several experiments that the property of forward stability preserves the action of a network under different quantization rates. As a result, stable quantized networks behave similarly to their non-quantized counterparts even though they rely on fewer parameters. We also find that at times, stability even aids in improving accuracy. These properties are of particular interest for sensitive, resource-constrained, low-power or real-time applications like autonomous driving.

Stable Architectures for Quantized Residual Networks

Quantization of neural network weights and activation maps are useful for network compression and for running on resource-constrainged edge devices such as autonomous vehicles.

But, quantization is a source of noise for our networks.

By treating the network as a discretization of a partial differential equation, we can treat this noise as error and apply PDE stability theory to mitigate its impact.

We show that we get lighter models without harming performance by much.

The symmetric variant of a ResNet layer together with the activation quantization operator is given as follows:

$$\mathbf{x}_{j+1} = Q_b(\mathbf{x}_j - h \mathbf{K}_j^\top\ Q_b(\sigma(\mathbf{K}_j\mathbf{x}_j))),$$

As well as a symmetric variant of MobileNetV2:

$$\mathbf{x}_{j+1} = \mathbf{x}_j - (\mathbf{K}_{1, j})^\top((\mathbf{K}_{2, j})^\top\sigma (\mathbf{K}_{2, j}\mathbf{K}_{1, j}\mathbf{x}_j)),$$

This is due to the observation that the Jacobian of the layer, $\mathbf{J}_j = \mathbf{I} -h\mathbf{K}_j^\top\mathbf{\Omega}\mathbf{K}_j$, propagates the error to the next layer, therefore for a proper choice of h, we obtain a positive semi-definite Jacobian.

We also study the importance of stability for quantized Graph Convolution Networks. We use a diffusive PDE-GCN architecture, which operates on unstructured graphs:

$$\mathbf{x}_{j+1} = \mathbf{x}_j - h \mathbf{S}_j^{\top} \mathbf{K}_j^{\top} \sigma(\mathbf{K}_j \mathbf{S}_j \mathbf{x}_j).$$

As opposed to a non-symmetric diffusive residual layer:

$$\mathbf{x}_{j+1} = \mathbf{x}_j - h \mathbf{S}_j^{\top} \mathbf{K}_{j_2} \sigma(\mathbf{K}_{j_{1}} \mathbf{S}_j \mathbf{x}_j)$$

Stable ResNet Block — Schematic diagram of our symmetric ResNet block which yields a symmetric and positive-definite layer.

Quantization-Aware Training

We restrict the values of the weights and activations to a smaller set, so that after training, the calculation of a prediction by the network can be carried out in fixed-point integer arithmetic using a quantization operator:

$$q_b(t) = \frac{\mbox{round}((2^b - 1) \cdot t)}{2^b - 1},$$ $$w_b = Q_b(w) = \alpha_w q_{b-1}\left(\mbox{clip}\left(\frac{w}{\alpha_w}, -1, 1\right)\right),$$ $$x_b = Q_b(x) = \alpha_x q_{b}\left(\mbox{clip}\left(\frac{x}{\alpha_x}, 0, 1\right)\right).$$

Treating Activation Error as Noise Using Total Variation

One of the most popular and effective approaches for image denoising is the Total Variation method:

$$||u||_{TV(\Omega)} = \int_{\Omega}{\|\nabla u(x)\| dx}.$$

We augment deep neural networks with the TV method as a non-pointwise activation function at each layer. We minimize the TV norm to encourage piecewise-constant images, thereby obtaining smaller quantization error when quantizing the images.

Smoothing with Total Variation reduces outliers and creates a more uniform distribution of values in the image.

In summary, at each nonlinear activation in the network, we apply the edge-aware denoising step:

$$S(\mathbf{x}) = \mathbf{x} - \gamma^2(\mathbf{D}_x + \mathbf{D}_y)\mathbf{x}.$$

To test the theory, we applied the TV method to the ResNet50, ResNet20 and DeepLab architectures.

Example image before TV denoising — An example of a feature map from the 4th layer of the ResNet50 encoder for an image. The feature map is shown before and after 3 iterations of the TV-smoothing operator. Fine details are preserved after smoothing.

Example image after TV denoising — An example of a feature map from the 4th layer of the ResNet50 encoder for an image. The feature map is shown before and after 3 iterations of the TV-smoothing operator. Fine details are preserved after smoothing.

Experimental Results

Results for stability experiments — Per-layer MSE between activation maps of symmetric and non-symmetric network pairs. Each line represents a pair of networks where one has quantized activation maps and the other does not. The values are normalized per-layer to account for the different dimensions of each layer. In all cases, the symmetric variants (in red) exhibit a bounded divergence from full-precision activations, while the non-symmetric networks diverge as the information propagates through the layers (in blue). Hence, they are unstable. Top to bottom: ResNet56/CIFAR-10, ResNet56/CIFAR-100 and MobileNetV2/CIFAR-100. Both networks in each pair achieve comparable classification accuracy.

Results for TV experiments — The value distributions for a feature map from the 4th layer of a ResNet50 encoder, with and without TV smoothing. The smoothing eliminates outliers in addition to denoising the image. The distributions in (c) have been smoothed slightly for improved visual clarity.