QGen: On the Ability to Generalize in Quantization Aware Training (2024)

MohammadHossein AskariHemmat¹¹footnotemark: 1 mohammad@deeplite.ai
DeepliteAhmadreza Jeddi ¹¹footnotemark: 1 ahmadreza.jeddi@gmail.com
DeepliteReyhane Askari Hemmat reyhane.askari.hemmat@umontreal.ca
University of Montreal and Mila, Quebec AI InstituteIvan Lazarevich ivan.lazarevich@deeplite.ai
DeepliteAlexander Hoffman alexander.hoffman@deeplite.ai
DeepliteSudhakar Sah sudhakar@deeplite.ai
DeepliteEhsan Saboori ehsan@deeplite.ai
DeepliteYvon Savaria yvon.savaria@polymtl.ca
École Polytechnique de MontréalJean-Pierre David jean-pierre.david@polymtl.ca
École Polytechnique de MontréalEqual Contribution.

Abstract

Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.

1 Introduction

The exceptional growth of technology involving deep learning has made it one of the most promising technologies for applications such as computer vision, natural language processing, and speech recognition. The ongoing advances in these models consistently enhance their capabilities, yet their improving performances often come at the price of growing complexity and an increased number of parameters. The increasing complexity of these models poses a challenge to the deployment in production systems due to higher operating costs, greater memory requirements, and longer response times. Quantization Courbariaux etal. (2015); Hubara etal. (2016); Polino etal. (2018); Zhou etal. (2017); Jacob etal. (2018); Krishnamoorthi (2018) is one of the prominent techniques that have been developed to reduce the model sizes and/or reduce latency. Model quantization represents full precision model weights and/or activations using fewer bits, resulting in models with lower memory usage, energy consumption, and faster inference. Quantization has gained significant attention in academia and industry. Especially with the emergence of the transformer Vaswani etal. (2017) model, quantization has become a standard technique to reduce memory and computation requirements.

The impact on accuracy and the benefits of quantization, such as memory footprint and latency, is well studied Courbariaux etal. (2015); Li etal. (2017); Gholami etal. (2021). These studies are mainly driven by the fact that modern hardware is faster and more energy efficient in low-precision (byte, sub-byte) arithmetic compared to the floating point counterpart. Despite its numerous benefits, quantization may adversely impact accuracy.

Hence, substantial research efforts on quantization revolve around addressing the accuracy degradation resulting from lower bit representation. This involves analyzing the model’s convergence qualities for various numerical precisions and studying their impacts on gradients and network updates Li etal. (2017); Hou etal. (2019).

In this work, we delve into the generalization properties of quantized neural networks. This key aspect has received limited attention despite its significant implications for the performance of models on unseen data. This factor becomes particularly important for safety-critical Gambardella etal. (2019); Zhang etal. (2022b) applications. While prior research has explored the performance of neural networks in adversarial settings Gorsline etal. (2021); Lin etal. (2019); Galloway etal. (2017) and investigated how altering the number of quantization bits during inference can affect model performanceChmiel etal. (2020); Bai etal. (2021), there is a lack of systematic studies on the generalization effects of quantization using standard measurement techniques.

This work studies the effects of different quantization levels on model generalization, training accuracy, and training loss. First, in Section 3, we model quantization as a form of noise added to the network weights. Subsequently, we demonstrate that this introduced noise serves as a regularizing agent, with its degree of regularization directly related to the bit precision. Consistent with other regularization methods, our empirical studies further support the claim that each model requires precise tuning of its quantization level, as models achieve optimal generalization at varying quantization levels. On the generalization side, in Section 4, we show that quantization could help the optimization process convergence to minima with lower sharpness when the scale of quantization noise is bounded. This is motivated by recent works of Foret etal. (2021); Keskar etal. (2016), which establish connections between the sharpness of the loss landscape and generalization. We then leverage a variety of recent advances in the field of generalization measurement Jiang etal. (2019); Dziugaite etal. (2020), particularly sharpness-based measures Keskar etal. (2016); Dziugaite & Roy (2017); Neyshabur etal. (2017), to verify our hypothesis for a wide range of vision problems with different setups and model architectures. Finally, in this section, we present visual demonstrations illustrating that models subjected to quantization have a flatter loss landscape.

After establishing that lower-bit-quantization results in improved flatness in the loss landscape, we study the connection between the achieved flatness of the lower-bit-quantized models and generalization. Our method estimates the model’s generalization on a given data distribution by measuring the difference between the loss of the model on training data and test data. To do so, we train a pool of almost 2000 models on CIFAR-10 and CIFAR-100 Krizhevsky (2009) and Imagenet-1K Deng etal. (2009) datasets, and report the estimated generalization gap. Furthermore, we conclude our experiments by showing a practical use case of model generalization, in which we evaluate the vision models under severe cases when the input to the model is corrupted. This is achieved by measuring the generalization gap for quantized and full precision models when different types of input noise are used, as introduced in Hendrycks & Dietterich (2019).Our main contributions can be summarized as follows:

•
We theoretically show that quantization can be seen as a regularizer.
•
We empirically show that there exists a quantization level at which the quantized model converges to a flatter minimum than its full-precision model.
•
We empirically demonstrate that quantized models show a better generalization gap on distorted data.

2 Related Works

2.1 Regularization Effects of Quantization

Since the advent of BinaryConnect Courbariaux etal. (2015) and Binarized Neural Networks Hubara etal. (2016), which were the first works on quantization, the machine learning community has been aware of the generalization effects of quantization, and the observed generalization gains have commonly been attributed to the implicit regularization effects that the quantization process may impose. This pattern is also observed in more recent works such as Mishchenko etal. (2019); Xu etal. (2018); Chen etal. (2021). Even though these studies have empirically reported some performance gain as a side-product of quantization, they lack a well-formed analytical study.

Viewing quantization simply as regularization is relatively intuitive, and to the best of our knowledge, the only work so far that has tried to study this behavior formally is the recent work done in Zhang etal. (2022a), where the authors provide an analytical study on how models with stochastic binary quantization can have a smaller generalization gap compared to their full precision counterparts. The authors propose a quasi-neural network to approximate the effect of binarization on neural networks. They then derive the neural tangent kernel Jacot etal. (2018); Bach (2017) for the proposed quasi-neural network approximation. With this formalization, the authors show that binary neural networks have lower capacity, hence lower training accuracy, and a smaller generalization gap than their full precision counterparts. However, this work is limited to the case of simplified binarized networks and does not study the wider quantization space, and their supporting empirical studies are done on MNIST and Fashion MNIST datasets with no studies done on larger scale more realistic problems. Furthermore, the Neural Tangent Kernel (NTK) analysis requires strong assumptions such as an approximately linear behaviour of the model during training which may not hold in practical setups.

2.2 Generalization and Complexity Measures

Generalization refers to the ability of machine learning models to perform well on unseen data beyond the training set. Despite the remarkable success and widespread adoption of deep neural networks across various applications, the factors influencing their generalization capabilities and the extent to which they generalize effectively are still unclear Jiang etal. (2019); Recht etal. (2019).

Minimization of the common loss functions (e.g., cross=entropy and its variants) on the training data does not necessarily mean the model would generalize well Foret etal. (2021); Recht etal. (2019), especially since the recent models are heavily over-parameterized and they can easily overfit the training data. InZhang etal. (2021), the authors demonstrate neural networks’ vulnerability to poor generalization by showing they can perfectly fit randomly labeled training data. This is due to the complex and non-convex landscape of the training loss. Numerous works have tried to either explicitly or implicitly solve this overfitting issue using optimizer algorithms Kingma & Ba (2014); Martens & Grosse (2015), data augmentation techniques Cubuk etal. (2018), and batch normalization Ioffe & Szegedy (2015), to name a few.

So the question remains: what is the best indicator of a model’s generalization ability? Proving upper bounds on the test error Neyshabur etal. (2017); Bartlett etal. (2017) has been the most direct way of studying the ability of models to generalize; however, the current bounds are not tight enough to indicate the model’s ability to generalize Jiang etal. (2019). Therefore, several recent works have preferred the more empirical approaches of studying generalization Keskar etal. (2016); Liang etal. (2019). These works introduce a complexity measure, a quantity that monotonically relates to some aspect of generalization. Specifically, lower complexity measures correspond to neural networks with improved generalization capacity. Many complexity measures are introduced in the literature, but each of them has typically targeted a limited set of models on toy problems. However, recent work in Jiang etal. (2019) followed by Dziugaite etal. (2020) performed an exhaustive set of experiments on the CIFAR-10 and SVHN Netzer etal. (2011) datasets with different model backbones and hyper-parameters to identify the measures that correlate best with generalization. Both of these large-scale studies show that sharpness-based measures are the most effective. The sharpness-based measures are derived either from measuring the average flatness around a minimum through adding Gaussian perturbations (PAC-Bayesian bounds McAllester (1999); Dziugaite & Roy (2017)) or from measuring the worst-case loss, i.e., sharpness Keskar etal. (2016); Dinh etal. (2017).

The effectiveness of sharpness-based measures has also inspired new training paradigms that penalize the loss of landscape sharpness during training Foret etal. (2021); Du etal. (2022); Izmailov etal. (2018). In particular, Foret etal. (2021) introduced the Sharpness-Aware-Minimization (SAM), which is a scalable and differentiable algorithm that helps models to converge and reduce the model sharpness. It is also worth mentioning here that some recent works Liu etal. (2021); Wang etal. (2022) assume that the discretization and gradient estimation processes, which are common in quantization techniques, might cause loss fluctuations that could result in a sharper loss landscape. Then they couple quantization with SAM and report improved results; however, our findings in Section 4 suggest the opposite. The quantized models in our experiments exhibit improved loss landscape flatness compared to their full precision counterparts.

Acknowledgments

Use unnumbered third level headings for the acknowledgments. Allacknowledgments, including those to funding agencies, go at the end of the paper.Only add this information once your submission is accepted and deanonymized.

3 Mathematical Model for Quantization

Throughout this paper, we will denote vectors as $\bm{x}$ , the scalars as $x$ , and the sets as $\mathscr{X}$ . Furthermore, $\perp\!\!\!\!\perp$ denotes independence. Given a distribution $\mathcal{D}$ for the data space, our training dataset $\mathscr{S}$ is a set of i.i.d. samples drawn from $\mathcal{D}$ . The typical ML task tries to learn models $f(.)$ parametrized by weights $\bm{w}$ that can minimize the training set loss $\mathcal{L}_{\mathscr{S}}(\bm{w})=\frac{1}{|\mathscr{S}|}\sum_{i=1}^{|\mathscr%{S}|}l(f(\bm{w},\bm{x}_{i}),\bm{y}_{i})$ given a loss function $l(.)$ and $(\bm{x}_{i},\bm{y}_{i})$ pairs in the training data.

To quantize our deep neural networks, we utilize Quantization Aware Training (QAT) methods similar to Learned Step-size Quantization (LSQ) Esser etal. (2020) for CNNs and Variation-aware Vision Transformer Quantization (VVTQ) XijieHuang & Cheng (2023) for ViT models. Specifically, we apply the per-layer quantization approach, in which, for each target quantization layer, we learn a step size s to quantize the layer weights. Therefore, given the weights $\bm{w}$ , scaling factor $s\in\mathbb{R}$ and $b$ bits to quantize, the quantized weight tensor $\hat{\bm{w}}$ and the quantization noise $\Delta$ can be calculated as below:

$\displaystyle\bar{\bm{w}}$	$\displaystyle=\lfloor clip(\frac{\bm{w}}{s},-2^{b-1},2^{b-1}-1)\rceil$	(1)
$\displaystyle\hat{\bm{w}}$	$\displaystyle=\bar{\bm{w}}\times s$	(2)
$\displaystyle\Delta$	$\displaystyle=\bm{w}-\hat{\bm{w}},$	(3)

where the $\lfloor\bm{z}\rceil$ rounds the input vector $\bm{z}$ to the nearest integer vector, $clip(r,z_{1},z_{2})$ function returns $r$ with values below $z_{1}$ set to $z_{1}$ and values above $z_{2}$ set to $z_{2}$ , and $\hat{\bm{w}}$ shows a quantized representation of the weights at the same scale as $\bm{w}$ .

3.1 Theoretical Analysis

For simplicity, let us consider a regression problem where the mean square error loss is defined as,

\mathcal{L}=\mathbb{E}_{p(\bm{x},\bm{y})}[\|\bm{\hat{y}}-\bm{y}\|_{2}^{2}],

(4)

where $\bm{y}$ is the target, and $\bm{\hat{y}}=f(\bm{x},\bm{w})$ is the output of the network $f$ parameterized by $\bm{w}$ .

For uniform quantization, the quantization noise $\Delta$ can be approximated by the uniform distribution ${\bm{\Delta}\sim\mathcal{U}[\frac{-\delta}{2},\frac{\delta}{2}]}$ where $\delta$ is the width of the quantization bin and $\mathcal{U}$ is the uniform distribution Défossez etal. (2021); Widrow etal. (1996); Agustsson & Theis (2020).

Consequently, a quantized neural network effectively has the following loss,

\tilde{\mathcal{L}}=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[\|\bm{\hat{y^{q}%}}-\bm{y}\|_{2}^{2}],

(5)

where $\bm{\hat{y}}^{q}=f(\bm{x},\bm{w}+\bm{\Delta})$ .

4 Analyzing Loss Landscapes in Quantized Models and Implications for Generalization

A low generalization gap is a desirable characteristic of deep neural networks. It is common in practice to estimate the population loss of the data distribution $\mathcal{D}$ , i.e. $\mathcal{L}_{\mathcal{D}}(\bm{w})=\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}}[%l(f(\bm{w},\bm{x}),\bm{y})]$ , by utilizing $\mathcal{L}_{\mathscr{S}}(\bm{w})$ as a proxy, and then minimizing it by gradient descent-based optimizers. However, given that modern neural networks are highly over-parameterized and $\mathcal{L}_{\mathscr{S}}(\bm{w})$ is commonly non-convex in $\bm{w}$ , the optimization process can converge to local or even global minima that could adversely affect the generalization of the model (i.e. with a significant gap between $\mathcal{L}_{\mathscr{S}}(\bm{w})$ and $\mathcal{L}_{\mathcal{D}}(\bm{w})$ ) Foret etal. (2021).

Motivated by the connection between the sharpness of the loss landscape and generalization Keskar etal. (2016), in Foret etal. (2021) the authors proposed the Sharpness-Aware-Minimization (SAM) technique, in which they propose to learn the weights $\bm{w}$ that result in a flat minimum with a neighborhood of low training loss values characterized by $\rho$ . Especially, inspired by the PAC-Bayesian generalization bounds, they were able to prove that for any $\rho>0$ , with high probability over the training dataset $\mathscr{S}$ , the following inequality holds:

\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\max_{||\bm{\epsilon}||_{2}\leq\rho}%\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\epsilon})\>+h(||\bm{w}||_{2}^{2}/\rho^{2%}),

(8)

where $h:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ is a strictly increasing function. Even though the above theorem is for the case where the $L2$ -norm of $\bm{\epsilon}$ is bounded by $\rho$ and the adversarial perturbations are utilized to achieve the worst-case loss, the authors empirically show that in practice, other norms in $[1,\infty]$ and random perturbations for $\bm{\epsilon}$ can also achieve some levels of flatness; however, they may not be as effective as the $L2$ -norm coupled with the adversarial perturbations.

Extending on the empirical studies of Foret etal. (2021), we relax the $L2$ -norm condition of Equation 8, and consider the $L_{\infty}$ -norm instead, resulting in:

\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\max_{||\bm{\epsilon}||_{\infty}\leq\rho}%\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\epsilon})\>+h(||\bm{w}||_{2}^{2}/\rho^{2})

(9)

Furthermore, given small values of $\rho>0$ , for any noise vector $\bm{\delta}$ such that $||\bm{\delta}||_{\infty}\leq\rho$ , the following inequality holds in practice for a local minimum characterized by $\bm{w}$ , as also similarly depicted in Equation 7 where $\bm{\delta}$ corresponds to the quantization noise, $\bm{\Delta}$ ; however, this inequality may not necessarily hold for every $\bm{w}$ :

\mathcal{L}_{\mathscr{S}}(\bm{w})\leq\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\delta})\leq\max_{||\bm{\epsilon}||_{\infty}\leq\rho}\mathcal{L}_{\mathscr{S}}%(\bm{w}+\bm{\epsilon}),

(10)

For small values of $\rho$ close to 0, and a given $\bm{w}$ we can approximate,

\max_{||\bm{\epsilon}||_{\infty}\leq\rho}\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\epsilon})

(11)

in Equation 9 with $\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\delta})$ . As a result, for small positive values of $\rho$ , we have:

\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\delta})\>+h(||\bm{w}||_{2}^{2}/\rho^{2}),

(12)

Hypothesis 1 (H1)

Let $\bm{w}$ be the set of weights in the model, $\bm{w^{q}}$ be the set of quantized weights, $\delta$ be the width of quantization bin and $g(.)$ be a function that measures the sharpness of a minima, we have,

1.
Having a bounded $\bm{\Delta}$ with $||\bm{\Delta}||_{\infty}\leq\frac{\delta}{2}$ , there exist a $\bm{\Delta}$ where, for quantized model parameterized by $w^{1}$ obtained through QAT and full precision model parameterized by $w^{2}$ we have: $g(w^{1})\leq g(w^{2})$

(1) implies that quantization helps the model converge to flatter minima with lower sharpness. As discussed in Section 3 and illustrated in Figure 1, since lower bit quantization corresponds to higher $\delta$ , therefore, lower bit resolution quantization results in better flatness around the minima. However, as described by 7, the $\delta$ is a hyperparameter for the induced regularization. Hence, not all quantization levels will result in flatter minima and improved generalization.

In the rest of this Section, we report the results of our exhaustive set of empirical studies regarding the generalization qualities of quantized models; in Section 4.1, for different datasets and different backbones (models) we study the flatness of the loss landscape of the deep neural networks under different quantization regimens, in Section 4.2 we measure and report the generalization gap of the quantized models for a set of almost 2000 vision models, and finally in Section 4.3 using corrupted datasets, we study the real-world implications that the generalization quality can have and how different levels of quantization perform under such scenarios.

4.1 Flatness of Minima and Generalization

In this section, we conduct experiments that demonstrate quantized neural networks enjoy better flatness in their loss landscape compared to their full-precision counterparts; this finding is contrary to the assumption of some of the recent studies Liu etal. (2021); Wang etal. (2022). In those studies, it is assumed that quantization results in sharper minima. We believe that the root of this assumption might be that the authors of those works have not considered the magnitude of the network weights in measuring the sharpness. However, as Jiang etal. (2019) and Dziugaite etal. (2020) show, the flatness measures that take the magnitude of the parameters into account Keskar etal. (2016), are better indicators of generalization.

Dataset	Model	Int2	Int4	Int8	FP32
CIFAR-10	NiN (4x10)	47.263	54.291	53.804	130.686
	NiN (4x12)	43.039	46.523	46.750	73.042
	ResNet-18	44.264	48.227	47.368	59.474
	ResNet-50	45.011	238.117	48.149	97.856
CIFAR-100	NiN (5x10)	60.981	60.707	60.905	190.414
	NiN (5x12)	82.230	87.931	87.307	163.768
	ResNet-18	48.120	55.027	54.735	125.164
	ResNet-50	75.739	82.788	79.603	148.298
ImageNet-1K	ResNet-18	78.291	84.472	85.162	415.004
ImageNet-1K	ResNet-50	214.055	213.035	212.624	379.465

As Table 2 shows, for a given backbone, the magnitude of the network parameters (calculated as the $L2$ -norm of weights) are very different among different quantization levels; therefore, simply measuring the loss landscape flatness using sharpness or PAC-Bayesian bounds without considering the magnitude of the weights could be misleading.

PAC-BayesianSharpnessDatasetModelPrecisionInitOrigMag-InitMag-OrigInitOrigMag-InitMag-OrigFP322.2642.27.6357.5940.5890.5728.2198.181Int84.2043.6266.4356.1760.2920.2526.8536.610Int42.4822.1436.4196.1621.4441.2476.8266.586NiN (4x10)Int21.5881.326.1715.8331.1520.9586.4546.131FP321.4691.3287.9747.7700.3590.3249.2169.040Int86.0574.8667.7187.2560.2590.2088.6558.245Int42.6582.1317.7657.3021.3351.078.568.142NiN (4x12)Int21.9181.4937.6547.1190.7810.6088.5138.034FP321.1861.1353.6593.6171.4471.3834.3994.364Int80.8930.8343.3553.2850.4260.3984.1124.055Int41.7861.6733.2913.2230.4330.4054.0373.981ResNet-18Int21.3681.2673.2383.1560.8190.7594.0744.012FP321.8031.6475.3045.1931.2371.136.3036.210Int83.9112.9014.7294.3000.9880.7335.4725.106Int48.9378.7936.3946.3773.713.656.6846.669CIFAR10ResNet-50Int21.9911.4314.6384.1511.9261.3855.4995.094FP324.2664.1929.339.3020.8590.84410.46710.443Int87.3546.3397.4517.1550.4740.4098.0847.812Int43.1012.6737.3997.1010.4730.4088.0327.759NiN (5x10)Int22.251.9396.1385.7760.3130.277.7747.491FP323.5053.40910.95810.9040.7770.75512.04111.992Int81.7121.5619.1758.9630.5820.5319.9569.761Int43.9343.5959.2749.0690.5810.5319.7949.599NiN (5x12)Int24.3433.9229.4799.2520.5570.5039.8289.609FP323.5353.4954.2434.2343.4293.394.7954.786Int86.0315.6963.6853.6311.1941.1284.2324.185Int42.3812.253.5913.5361.1661.1024.1174.069ResNet-18Int23.7043.4433.5383.46527.98327.2474.654.611FP324.3964.2655.9185.8834.7324.5916.7976.768Int85.5834.2794.7754.3852.4451.8745.6135.285Int43.0762.3975.2734.9452.3861.8596.8096.558CIFAR100ResNet-50Int229.72729.5315.2535.24737.89338.1248.3438.339FP3211.69411.58412.37812.355349.235345.96220.06920.055Int87.8365.30310.18.902104.9170.99418.41617.786Int44.6153.10810.0728.853104.55770.41918.4117.772ResNet-18Int216.39710.56311.0049.770101.3165.26618.36217.649FP327.9427.14422.16321.8265.0674.55627.74627.418Int820.39814.34417.59716.27211.2087.88120.10418.995Int435.01124.63717.80916.503258.118181.63619.16218.833ResNet-50Int2245.654173.28717.95417.023258.722182.50524.05124.007FP328.6538.12319.65118.2267.0176.75331.92431.5Int826.54427.35218.23217.5637.4455.12622.52423.432Int435.78633.98217.98316.1485.9274.67220.12223.765ImageNet-1KDeiT-TInt2236.322171.23419.86518.982218.621169.97232.11433.763

To capture the flatness around the local minima of a given network $f(\bm{w})$ , we utilize the PAC-Bayesian bounds McAllester (1999) and sharpness measures Keskar etal. (2016). The former adds Gaussian perturbations to the network parameters and captures the average (expected) flatness within a bound, while the latter captures the worst-case flatness, i.e. sharpness, through adding adversarial worst-case perturbations to the parameters. We use the same formulation and implementation as specified by Jiang etal. (2019) and Dziugaite etal. (2020); in particular, similar to their approach, we measure and report these metrics by considering the trained model and the network-at-initialization parameters as the origin and initialization tensors, respectively. Moreover, as discussed in the above paragraph, and indicated in Table 2, the magnitude-aware versions of these metrics are the most reliable way of capturing the flatness of a network, hence we also report the magnitude-aware measurements, and they will be the main measure of loss landscape flatness. Details about these metrics and their formulation are in the supplementary Section A.

As shown in Table 3, for the 3 datasets of CIFAR-10, CIFAR-100 , and ImageNet-1k, and over a variety of network backbones, the quantized models enjoy flatter loss landscapes which is an indicator of better generalization to unseen data. An important observation from the experiments reported in Table 3 is that relying solely on sharpness or PAC-Bayesian measures without considering the magnitude of the network parameters might create the assumption that quantization does increase the network sharpness. We suspect that this might have indeed been the cause of this assumption in the works of Liu etal. (2021); Wang etal. (2022) which assume worse sharpness for quantized models and then propose Sharpness-Aware-Minimization (SAM) coupled with Quantization-Aware-Training (QAT). However, our empirical studies demonstrate that when the magnitude of the parameters is taken into account, quantization does actually improve flatness, and the finding that SAM can help quantized models achieve further flatness does not necessarily mean that quantized models have sharper minima compared to the non-quantized counterparts.

4.1.1 Loss Landscape Visualization

The loss landscape of quantized neural networks can be effectively visualized. Using the technique outlined in Li etal. (2018), we projected the loss landscape of quantized models and the full precision ResNet-18 models trained on the CIFAR-10 dataset onto a three-dimensional plane. The visual representation, as illustrated in Figure 2, clearly demonstrates that the loss landscape associated with quantized models is comparatively flatter. This observation confirms the findings presented in Table 3.

4.2 Measuring the Generalization Gap

To study the generalization behaviors of quantized models, we have trained almost 2000 models on the CIFAR-10, CIFAR-100 and ImageNet-1K datasets. Our goal is to measure the $\mathcal{L}_{\mathcal{D}}(\bm{w})\>-\mathcal{L}_{\mathscr{S}}(\bm{w})$ , i.e., the generalization gap, by utilizing the data that is unseen during the training process (the test data). Without loss of generality, herein, we will refer to the difference between test data loss and training data loss as the generalization gap.

Following the guidelines of Jiang etal. (2019) and Dziugaite etal. (2020) to remove the effect of randomness from our analysis of generalization behavior, for smaller datasets (CIFAR-10 and CIFAR100), we construct a pool of trained models by varying 5 commonly used hyperparameters over the fully convolutional "Network-in-Network" architecture Lin etal. (2013). The hyperparameter list includes learning rate, weight decay, optimization algorithm, architecture depth, and layer width. In our experiments, each hyperparameter has 3 choices; therefore, the number of trained models per quantization level is $3^{5}=243$ , with the number of bits considered being selected from the following values: 2, 4, and 8, and the resulting models are compared with their full-precision counterpart. Thus, in total, we will have $4\times 243=992$ models trained per dataset, over CIFAR-10 and CIFAR-100 datasets, which gives us almost 2000 trained models. For more details regarding hyperparameter choices and model specifications, please refer to the supplementary material Section B. Lastly, for ImageNet-1k, we measured the generalization gap on both CNN and ViT models.

In Jiang etal. (2019), to measure the generalization gap of a model, the authors first train the model until the training loss converges to a threshold (0.01). Here, we argue that this approach might not be optimal when quantization enters the picture. First, lower bit-resolution quantized models have lower learning capacity compared to the higher bit-resolution quantized or the full-precision ones; our proof in Equation 7 also indicates that the learning capabilities of a given network diminish as the number of quantization bits decreases. Second, early stopping of the training process may hinder the trained models from appropriately converging to flatter local minima, which quantized models enjoy in their loss landscape. Therefore, we apply a different training approach. Each model is trained for 300 epochs by lowering the learning rate by a factor of 10 at epochs 100 and 200, and at the end, the model corresponding to the lowest training loss is chosen.

Table 1 summarizes the results of these experiments. The accuracy-generalization trade-off is demonstrated through these experiments. The training loss and training accuracy of lower-resolution quantized models are negatively impacted. However, they enjoy better generalization. Some additional interesting results can be inferred from Table 1. Notably, 8-bit quantization is almost on par with the full-precision counterpart on all the metrics. This is also evident in Table 3, where we studied the sharpness-based measures. The other interesting observation is that although training losses vary among the models, the test loss is almost the same among all; this, in turn, indicates that full-precision and high-resolution quantized models have a higher degree of overfitting, which could result from converging to sharper local minima.

Severity 1Severity 2Severity 3Severity 4Severity 5ModelAugmentationFP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2Gaussian Noise1.0670.860.9391.211.9131.461.6292.2013.4392.522.7963.6585.5293.964.3525.1157.6535.626.1176.093Shot Noise1.2380.961.0541.3332.2891.721.8872.4213.8012.752.9873.6856.4384.44.7395.3047.9195.415.8056.003Impulse Noise2.2351.782.0582.3243.1772.352.6363.2794.0612.93.1854.0016.0964.264.5955.3247.7815.575.9956.114Defocus Noise0.9790.890.8580.8221.4321.371.3251.3022.3942.342.2632.2173.2853.193.0992.9343.9833.923.8083.498Glass Blue1.181.071.0310.9851.9691.861.8121.8043.8223.833.7273.544.2214.254.153.9094.6524.624.5424.139Motion Blur0.6870.550.5420.5091.371.221.2451.2612.5212.422.4542.3813.73.643.6783.4124.2854.234.2753.875Zoom Blur1.5181.381.3821.3862.2192.12.1192.0942.6882.582.5882.5183.2133.123.1373.0283.6663.593.5993.437Snow1.4011.010.9981.1243.2152.362.3742.6432.9692.072.0942.3153.972.812.8693.0744.5153.483.5173.572Frost0.9490.660.6260.6332.0931.681.6811.8122.9782.532.5532.6883.1412.742.7662.8933.7133.313.3623.447Fog0.8090.420.4440.4051.2140.680.7340.7741.8571.181.2731.4312.3471.681.7621.9583.773.033.1413.275Brightness0.1210.040.0190.1550.2210.10.080.0620.3780.190.1840.0840.6310.370.360.3230.9860.620.6260.672Contrast0.5230.240.2320.130.8670.40.4130.3961.6270.810.8611.0313.612.372.5292.9215.2644.634.7654.479Elastic0.5380.430.4060.2872.0261.951.9111.8331.1161.030.9690.8841.9971.941.8441.7554.1124.113.9573.57Pixelate0.6120.50.4920.4160.5990.510.5060.4651.8891.721.7341.9583.0462.932.883.3063.3693.323.3133.51ResNet-18JPEG0.590.480.4680.3750.8010.680.6740.6270.9720.850.8410.8241.5991.461.4461.4912.6152.432.4052.487Gaussian Noise1.0410.760.8572.781.9231.3821.5363.7553.4252.52.7625.0095.2514.0654.5186.1826.9975.8156.4237.124Shot Noise1.1320.8431.0272.9262.2141.5911.8463.9753.672.6243.0135.0585.8914.3635.0116.367.0455.4186.1376.96Impulse Noise1.6351.4831.5853.0432.5972.2232.2844.1443.4232.7512.9014.9615.3024.1714.5916.2966.9795.7536.3157.126Defocus Noise0.8630.7991.0053.8561.3261.2661.5194.2862.232.2132.4344.8583.0592.9833.3285.1193.7843.6554.25.303Glass Blue1.2231.1411.5093.5382.1152.0392.4434.2574.014.0034.3094.9434.3754.3534.5645.0394.6684.6014.7095.141Motion Blur0.6430.530.6413.0681.3351.2091.3543.7682.3922.2822.4354.3493.53.4183.5884.7744.1084.0284.2354.983Zoom Blur1.5391.4231.6073.5642.2822.1852.3583.972.7742.6852.8864.3353.3173.2633.4924.5673.7973.7384.0354.817Snow1.2530.9041.1682.3773.0742.4292.6944.0172.8382.162.4973.8693.7752.9453.2864.7974.5623.7763.9965.151Frost0.9410.6580.82.2362.1931.7842.0213.7133.1542.6732.9694.6823.3452.9043.2234.943.9563.4843.8355.46Fog0.6990.3540.8223.8741.0840.6241.244.4541.7151.1451.8024.9292.2561.6752.1114.9693.7923.1163.2985.371Brightness0.0340.050.0191.3420.1430.0080.0891.3590.3150.1170.2111.5490.5950.3030.4221.9871.0020.5910.7522.663Contrast0.4820.1880.8163.7170.8460.3941.5134.5711.6240.8993.0955.5563.662.7255.8166.3965.4114.8816.5056.59Elastic0.4420.3470.4812.3961.9731.8712.1054.0120.9620.871.1132.4741.9131.8072.212.9594.1063.9824.6934.036Pixelate0.9260.6530.8721.8831.4441.020.9341.8382.1551.8222.4682.1723.1113.0643.7732.7553.9793.9933.9883.301MoblieNet V2JPEG0.4910.3820.5541.7540.6750.5520.7841.8480.8260.6930.9671.9281.3571.1651.5552.1822.1821.9022.4622.545Gaussian Noise0.9380.9140.9280.9731.4371.2821.1121.5712.3632.0472.5132.893.7193.2553.7544.885.1344.9995.3397.828Shot Noise0.9610.9460.9571.0261.5851.4081.2461.832.4482.1662.0233.2154.0843.6973.8875.7484.9244.9195.2297.656Impulse Noise1.7031.6521.6761.7892.0131.8741.4642.3732.5642.2951.9953.2113.9623.5073.4585.4355.284.9425.1058.123Defocus Noise1.0591.0420.9110.8691.4411.4141.3091.2982.3442.3112.2862.2313.2443.2253.2263.1644.0524.0494.0193.994Glass Blue1.3491.271.0111.4572.2972.1691.8292.0884.6134.5514.1854.3465.0575.0094.8154.7935.3995.3765.345.102Motion Blur0.7310.6380.6230.5781.3141.3071.191.2382.5632.5512.3372.5014.1484.0573.6723.965.0485.0334.4044.75Zoom Blur1.5091.4731.2611.3612.2522.1872.0372.1342.8222.7362.5712.6973.443.3373.1393.3254.0393.9493.6763.916Snow1.2291.131.0481.1432.622.5292.4812.9332.3752.3172.1932.4783.1273.0162.9413.3473.6973.4373.84.209Frost0.8450.8370.6740.6531.7691.7261.5061.7852.5632.5222.2452.5882.7612.722.4352.8163.3223.2912.9723.427Fog0.6910.6850.5820.5010.8970.8760.7840.9261.3051.2481.1671.3371.811.6571.6351.9093.2612.962.9793.569Brightness0.3450.250.210.0810.3820.3140.2590.4310.4530.4460.3410.2220.5840.5140.4730.3670.7870.7160.6740.609Contrast0.5450.4490.420.3080.690.6770.5680.4941.0470.9980.8671.0512.3872.1731.852.6244.6864.3943.6734.914Elastic0.6550.6250.5170.4222.2742.2431.9082.1221.6021.5711.2421.4162.7042.6712.2092.5915.5985.5224.6475.348Pixelate0.8710.710.7270.6841.0421.0211.0150.7781.9711.5751.5561.9743.3733.2293.2083.4314.0384.0143.9964.179ResNet-50JPEG0.7930.7010.670.5220.9570.980.8320.7041.0921.0230.960.8491.5371.4251.3561.3942.2362.1181.9672.228

4.3 Generalization Under Distorted Data

In addition to assessing the generalization metrics outlined in the previous sections, we sought to investigate some real-world implications that the generalization quality of the quantized models would have. To this end, we evaluate the performance of vision models in the case when the input images are distorted or corrupted by common types of perturbations. We take advantage of the comprehensive benchmark provided in Hendrycks & Dietterich (2019) where they identify 15 types of common distortions and measure the performance of the models under different levels of severity for each distortion. Table 4 presents the generalization gaps as calculated by the difference of loss on the corrupted dataset and the loss on training data for 5 levels of severity for ResNet-18 and ResNet-50 trained on ImageNet-1K. By augmenting the test dataset in this way, we are unlocking more unseen data for evaluating the generalization of our models. As is evident through these experiments, quantized models maintain their superior generalization under most of the distortions. Accuracies of the models on the distorted dataset, as well as results and discussions on more architectures and datasets and details on conducting our experiments, are available in the supplementary material Section C.

5 Conclusion

In this work, we investigated the generalization properties of quantized neural networks, which have received limited attention despite their significant impact on model performance. We demonstrated that quantization has a regularization effect and it leads to improved generalization capabilities. We empirically show that quantization could facilitate the convergence of models to flatter minima. Lastly, on distorted data, we provided empirical evidence that quantized models exhibit improved generalization compared to their full-precision counterparts across various experimental setups. Through the findings of this study, we hope that the inherent generalization capabilities of quantized models can be used to further improve their performance.

References

(1)Models and pre-trained weights and mdash; Torchvision 0.15 documentation.https://pytorch.org/vision/stable/models.html.[Online; accessed 2023-05-19].
Agustsson & Theis (2020)Eirikur Agustsson and Lucas Theis.Universally quantized neural compression.Advances in neural information processing systems, 33:12367–12376, 2020.
Bach (2017)Francis Bach.Breaking the curse of dimensionality with convex neural networks.The Journal of Machine Learning Research, 18(1):629–681, 2017.
Bai etal. (2021)Haoping Bai, Meng Cao, Ping Huang, and Jiulong Shan.Batchquant: Quantized-for-all architecture search with robust quantizer.Advances in Neural Information Processing Systems, 34:1074–1085, 2021.
Bartlett etal. (2017)PeterL Bartlett, DylanJ Foster, and MatusJ Telgarsky.Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017.
Chen etal. (2021)Wentao Chen, Hailong Qiu, Jian Zhuang, Chutong Zhang, YuHu, Qing Lu, Tianchen Wang, Yiyu Shi, Meiping Huang, and Xiaowe Xu.Quantization of deep neural networks for accurate edge computing.ACM Journal on Emerging Technologies in Computing Systems (JETC), 17(4):1–11, 2021.
Chmiel etal. (2020)Brian Chmiel, Ron Banner, Gil Shomron, Yury Nahshan, Alex Bronstein, Uri Weiser, etal.Robust quantization: One model to rule them all.Advances in neural information processing systems, 33:5308–5317, 2020.
Courbariaux etal. (2015)Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David.Binaryconnect: Training deep neural networks with binary weights during propagations.Advances in neural information processing systems, 28, 2015.
Cubuk etal. (2018)EkinD Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and QuocV Le.Autoaugment: Learning augmentation policies from data.arXiv preprint arXiv:1805.09501, 2018.
Défossez etal. (2021)Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve.Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987, 2021.
Deng etal. (2009)Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.doi: 10.1109/CVPR.2009.5206848.
Dinh etal. (2017)Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio.Sharp minima can generalize for deep nets.In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
Du etal. (2022)Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent Tan, and JoeyTianyi Zhou.Sharpness-aware training for free.Advances in Neural Information Processing Systems, 35:23439–23451, 2022.
Dziugaite & Roy (2017)GintareKarolina Dziugaite and DanielM Roy.Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017.
Dziugaite etal. (2020)GintareKarolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and DanielM Roy.In search of robust measures of generalization.Advances in Neural Information Processing Systems, 33:11723–11733, 2020.
Esser etal. (2020)StevenK. Esser, JeffreyL. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and DharmendraS. Modha.Learned step size quantization.In International Conference on Learning Representations, 2020.URL https://openreview.net/forum?id=rkgO66VKDS.
Foret etal. (2021)Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur.Sharpness-aware minimization for efficiently improving generalization.In International Conference on Learning Representations, 2021.URL https://openreview.net/forum?id=6Tm1mposlrM.
Galloway etal. (2017)Angus Galloway, GrahamW Taylor, and Medhat Moussa.Attacking binarized neural networks.arXiv preprint arXiv:1711.00449, 2017.
Gambardella etal. (2019)Giulio Gambardella, Johannes Kappauf, Michaela Blott, Christoph Doehring, Martin Kumm, Peter Zipf, and Kees Vissers.Efficient error-tolerant quantized neural network accelerators.In 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6. IEEE, 2019.
Gholami etal. (2021)Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, MichaelW Mahoney, and Kurt Keutzer.A survey of quantization methods for efficient neural network inference.arXiv preprint arXiv:2103.13630, 2021.
Gorsline etal. (2021)Micah Gorsline, James Smith, and Cory Merkel.On the adversarial robustness of quantized neural networks.In Proceedings of the 2021 on Great Lakes Symposium on VLSI, pp. 189–194, 2021.
Hendrycks & Dietterich (2019)Dan Hendrycks and Thomas Dietterich.Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019.
Hou etal. (2019)LuHou, Ruiliang Zhang, and JamesT Kwok.Analysis of quantized models.In International Conference on Learning Representations, 2019.
Hubara etal. (2016)Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio.Binarized neural networks.Advances in neural information processing systems, 29, 2016.
Ioffe & Szegedy (2015)Sergey Ioffe and Christian Szegedy.Batch normalization: Accelerating deep network training by reducing internal covariate shift.In International conference on machine learning, pp. 448–456. pmlr, 2015.
Izmailov etal. (2018)Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and AndrewGordon Wilson.Averaging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407, 2018.
Jacob etal. (2018)Benoit Jacob, Skirmantas Kligys, BoChen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko.Quantization and training of neural networks for efficient integer-arithmetic-only inference.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Jacot etal. (2018)Arthur Jacot, Franck Gabriel, and Clément Hongler.Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018.
Jiang etal. (2019)Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio.Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019.
Keskar etal. (2016)NitishShirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping TakPeter Tang.On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016.
Kingma & Ba (2014)DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
Krishnamoorthi (2018)Raghuraman Krishnamoorthi.Quantizing deep convolutional networks for efficient inference: A whitepaper.CoRR, abs/1806.08342, 2018.URL http://arxiv.org/abs/1806.08342.
Krizhevsky (2009)Alex Krizhevsky.Learning multiple layers of features from tiny images.pp. 32–33, 2009.URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Li etal. (2017)Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein.Training quantized nets: A deeper understanding.Advances in Neural Information Processing Systems, 30, 2017.
Li etal. (2018)Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein.Visualizing the loss landscape of neural nets.In Neural Information Processing Systems, 2018.
Liang etal. (2019)Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes.Fisher-rao metric, geometry, and complexity of neural networks.In The 22nd international conference on artificial intelligence and statistics, pp. 888–896. PMLR, 2019.
Lin etal. (2019)JiLin, Chuang Gan, and Song Han.Defensive quantization: When efficiency meets robustness.arXiv preprint arXiv:1904.08444, 2019.
Lin etal. (2013)Min Lin, Qiang Chen, and Shuicheng Yan.Network in network.arXiv preprint arXiv:1312.4400, 2013.
Liu etal. (2021)Jing Liu, Jianfei Cai, and Bohan Zhuang.Sharpness-aware quantization for deep neural networks.arXiv preprint arXiv:2111.12273, 2021.
Martens & Grosse (2015)James Martens and Roger Grosse.Optimizing neural networks with kronecker-factored approximate curvature.In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
McAllester (1999)DavidA McAllester.Pac-bayesian model averaging.In Proceedings of the twelfth annual conference on Computational learning theory, pp. 164–170, 1999.
Mishchenko etal. (2019)Yuriy Mishchenko, Yusuf Goren, Ming Sun, Chris Beauchene, Spyros Matsoukas, Oleg Rybakov, and Shiv NagaPrasad Vitaladevuni.Low-bit quantization and quantization-aware training for small-footprint keyword spotting.In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 706–711. IEEE, 2019.
Netzer etal. (2011)Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, BoWu, and AndrewY. Ng.Reading digits in natural images with unsupervised feature learning.In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
Neyshabur etal. (2017)Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro.A pac-bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017.
Polino etal. (2018)Antonio Polino, Razvan Pascanu, and Dan Alistarh.Model compression via distillation and quantization.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=S1XolQbRW.
Recht etal. (2019)Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar.Do imagenet classifiers generalize to imagenet?In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
Wang etal. (2022)Zheng Wang, JunchengB Li, Shuhui Qu, Florian Metze, and Emma Strubell.Squat: Sharpness-and quantization-aware training for bert.arXiv preprint arXiv:2210.07171, 2022.
Widrow etal. (1996)Bernard Widrow, Istvan Kollar, and Ming-Chang Liu.Statistical theory of quantization.IEEE Transactions on instrumentation and measurement, 45(2):353–361, 1996.
XijieHuang & Cheng (2023)ZhiqiangShen XijieHuang and Kwang-Ting Cheng.Variation-aware vision transformer quantization.arXiv preprint arXiv:2307.00331, 2023.
Xu etal. (2018)Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha.Alternating multi-bit quantization for recurrent neural networks.arXiv preprint arXiv:1802.00150, 2018.
Zhang etal. (2021)Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021.
Zhang etal. (2022a)Kaiqi Zhang, Ming Yin, and Yu-Xiang Wang.Why quantization improves generalization: Ntk of binary weight neural networks, 2022a.URL https://arxiv.org/abs/2206.05916.
Zhang etal. (2022b)Yedi Zhang, FuSong, and Jun Sun.Qebverif: Quantization error bound verification of neural networks.arXiv preprint arXiv:2212.02781, 2022b.
Zhou etal. (2017)Aojun Zhou etal.Incremental network quantization: Towards lossless cnns with low-precision weights.CoRR, abs/1702.03044, 2017.

Appendix A Flatness Landscape

The PAC-Bayesian and sharpness generalization measures both make use of the PAC-Bayes bounds, which estimate the bounds of the generalization error of a predictor (i.e. neural network). In our case, the PAC-Bayes bound is a function of the KL divergence of the prior distribution and posterior distribution of the model parameters, where the prior distribution is drawn without knowledge of the dataset and the posterior distribution is a perturbation on the trained parameters. It has been shown that when both distributions are isotropic Gaussian distributions, then PAC-Bayesian bounds are a good measure of generalization in small-scale experiments. We refer the reader to Jiang etal. (2019) for more detailed analysis and derivations, which we summarize here. The PAC-Bayes generalization measures are defined below:

	$\displaystyle\mu_{\text{pac-bayes-init}}(f_{\bm{w}}))$	$\displaystyle=\frac{\|\|\bm{w}-\bm{w}^{0}\|\|_{2}^{2}}{4\sigma^{2}}+\log(\frac{m}{%\sigma})+10$		(15)
	$\displaystyle\mu_{\text{pac-bayes-orig}}(f_{\bm{w}}))$	$\displaystyle=\frac{\|\|\bm{w}\|\|_{2}^{2}}{4\sigma^{2}}+\log(\frac{m}{\delta})+10$		(16)

Where $\sigma$ is chosen to be the largest number such that $\mathbb{E}_{\bm{u}\sim\mathcal{N}(\mu,\sigma^{2}I)}[\hat{\mathcal{L}}\left(f_{%\bm{w}+\bm{u}})\right]\leq 0.1$ , and $m$ is the sample size of the dataset

From the same PAC-Bayesian bound framework, we can also derive the sharpness measure, by using the worst-case noise $\alpha$ rather than the Gaussian sampled noise.

	$\displaystyle\mu_{\text{sharpness-init}}(f_{\bm{w}}))$	$\displaystyle=\frac{\|\|\bm{w}-\bm{w}^{0}\|\|_{2}^{2}\log(2\omega)}{4\alpha^{2}}+%\log(\frac{m}{\sigma})+10$		(17)
	$\displaystyle\mu_{\text{sharpness-orig}}(f_{\bm{w}}))$	$\displaystyle=\frac{\|\|\bm{w}\|\|_{2}^{2}\log(2\omega)}{4\alpha^{2}}+\log(\frac{m%}{\delta})+10$		(18)

Where $\alpha$ is chosen to be the largest number such that $\text{max}_{|u_{i}|\leq\alpha}\hat{\mathcal{L}}(f_{\bm{w}+\bm{u}})\leq 0.1$ and $\omega$ is the number of parameters in the model.

For magnitude-aware measures Keskar etal. (2016), the ratio of the magnitude of the perturbation to the magnitude of the parameter is bound by a constant $\alpha^{\prime}$ . By bounding the ratio of perturbation to parameter magnitude, we prevent parameters from changing signs. This change leads to the following magnitude-aware generalization measures:

	$\displaystyle\mu_{pac-bayes-mag-init}(f_{\bm{w}})$	$\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\sigma^{\prime 2}+1)\|\|\bm{w}-\bm{w^{0}}\|\|_{2}^{2}/\omega}{\epsilon^{2}+\sigma^%{\prime 2}\|w_{i}-w_{i}^{0}\|^{2}}\right)}+\log(\frac{m}{\delta})+10$		(19)
	$\displaystyle\mu_{pac-bayes-mag-orig}(f_{\bm{w}})$	$\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\sigma^{\prime 2}+1)\|\|\bm{w}\|\|_{2}^{2}/\omega}{\epsilon^{2}+\sigma^{\prime 2}\|%w_{i}-w_{i}^{0}\|^{2}}\right)}+\log(\frac{m}{\delta})+10$		(20)

	$\displaystyle\mu_{sharpness-mag-init}(f_{\bm{w}})$	$\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\alpha^{\prime 2}+4\log(2\omega/\delta)\|\|\bm{w}-\bm{w^{0}}\|\|_{2}^{2}/\omega}{%\epsilon^{2}+\alpha^{\prime 2}\|w_{i}-w_{i}^{0}\|^{2}}\right)}+\log(\frac{m}{%\delta})+10$		(21)
	$\displaystyle\mu_{sharpness-mag-orig}(f_{\bm{w}})$	$\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\alpha^{\prime 2}+4\log(2\omega/\delta)\|\|\bm{w}\|\|_{2}^{2}/\omega}{\epsilon^{2}%+\alpha^{\prime 2}\|w_{i}-w_{i}^{0}\|^{2}}\right)}+\log(\frac{m}{\delta})+10$		(22)

Where $\epsilon=0.001$ and $\sigma$ is chosen to be the largest number such that $\mathbb{E}_{\bm{u}}\left[\hat{\mathcal{L}}(f_{\bm{w}+\bm{u}})\right]\leq 0.1$ ,

Appendix B Experiment Setup For Measuring Sharpness-based Metrics

B.1 Training Setup

We used different models and datasets to compute the generalization gap using proxy metrics described in Section A.

Our experiments employed the LSQ method Esser etal. (2020) for weight quantization. The CIFAR-10, CIFAR-100, and ImageNet datasets were utilized for testing purposes. We applied three distinct quantization levels for quantized models: 2, 4, and 8 bits. The CIFAR-10 and CIFAR-100 NiN models are trained with a base width of 25, and they are trained for 300 epochs, with an SGD optimizer, an initial learning rate of 0.1, momentum of 0.9, and a weight decay of 0.0001. We utilize a multi-step scheduler with steps at epochs 100 and 200, and the gamma is 0.1. The ResNet models that we use for these two datasets have a base width of 16 and use the same optimizer as the NiN network. However, these models are trained for 200 epochs, and the steps happen at epochs 80 and 160. The ResNet models we utilize for comparing sharpness-based measures for the ImageNet dataset have a base width of 64. We again use the same optimizer only with a different learning rate of 0.01. We fine-tune the models from Pytorch pre-trained weights for 120 epochs, and the steps happen at epochs 30, 60, and 90.

B.2 Measuring the Metrics

To measure the PAC-Bayesian and sharpness measures, we measure these metrics for the cases of magnitude aware and the normal for each quantization level. In each case, we run the search for finding the maximum amount of possible noise ( $\sigma$ ), for 15 iterations, and within each iteration we calculate the mean of the accuracy on the training data over 10 runs to remove the effect of randomness. As an additional step in calculating the sharpness measures, we perform the gradient ascent step to maximize the loss value for 20 iterations. We use a learning rate of 0.0001 with an SGD optimizer for the gradient ascent process.

B.3 Measuring Generalization Gaps

In our experiments for measuring the generalization gaps, we trained almost 2000 CIFAR-10 and CIFAR-100 models. The main backbone in all these experiments was NiN. We trained the models over the variation of hyperparameter values for 5 hyperparameter, and each hyperparameter had 3 choices. For the case of CIFAR-10, here are the values for hyperparameters:

•
Optimizer algorithm: {SGD, ADAM, RMSProp}
•
Learning rate: {0.1, 0.05, 0.01} for SGD, {0.001, 0.0005, 0.0001} for ADAM and RMSProp
•
Weight decay: {0.0, 0.0001, 0.0002}
•
Width multiplier: {8, 10, 12}
•
Depth multiplier: {2, 3, 4}

For CIFAR-100 everything is the same with the minor difference of depth multipliers being in the set of {3, 4, 5}.

Each NiN training instance is trained for 300 epochs, in every case a step scheduler with steps at the 100th and 200th epoch and a gamma of 0.1 is utilized. The model with the lowest loss on training data is used with no information about the test data. Then the statistics in 1 are generated.

B.4 Computation Requirements

To train the NiN models for each quantization level, we use one NVIDIA A100 GPU with a batch size of 128. Each experiment takes almost 6 days to run, which on average is equivalent to 35 minutes per model training. We use 8 GPUs, 4 for CIFAR-10 and 4 for CIFAR-100.

For evaluating the sharpness measures, the main bottleneck is for ImageNet models, as evaluating the sharpness measures for each quantization level requires almost 600 evaluations on the training data in the worst. Running each quantization level on one NVIDIA A100 GPU requires 33 hours on average.

Appendix C Distortion Experiments

These are the extended results for investigating the generalization gap under distortion. We provide a generalization gap of quantized and full precision models on augmenteddatasets.

C.1 Training Setup

For full precision models, we used pre-trained models publicly available on the Pytorch website Pyt .For quantized models, we use weight quantization using LSQ Esser etal. (2020) method. We use CIFAR-100 and ImageNet datasets in our tests. We use three different quantization levels for quantized models: 2, 4, and 8 bits. We use a multi-step scheduler with steps at 30, 60, and 90 with an initial learning rate of 0.01 and gamma of 0.1. We use weight decay of 1e-4 and SGD optimizer. We trained all models for 120 epochs. Finally, we used pre-trained models from Pytorch to initialize weights for LSQ quantization.

C.2 Data Preparation

For augmented datasets, we use the corrupted Imagenet-C and CIFAR100-C datasets proposed in Hendrycks & Dietterich (2019).Table 6 presents the results of the experiments performed on the ResNet-18, MoobileNet V2 and ResNet-50 models trained on theImageNet dataset, and Table 5 present the results for ResNet-18, MobileNet V1, and VGG-19 models on CIFAR-100. These tables show the effect of distortion on the generalization gap of quantized models when various types and severity levels of distortions are used.Specifically, 15 different types of distortions were applied to the models. For each distortion type, the generalization gap was computed by subtracting the test loss on the distorted dataset from the loss on the original ImageNet training dataset.

C.3 Computation Setup

For these experiments, we used 8 NVIDIA A100 GPUs with 40 GB of RAM to train ImageNeta and CIFAR-100 models. With the above training hardware and submitted code, each ImageNet model takes 18 hours to train on average. CIFAR-100 models take much less time, and on average, one can train each model in less than an hour.

C.4 Results on CIFAR-100 Dataset

For the CIFAR-100 dataset, we computed the generalization gap ( by subtracting test loss on augmented CIFAR100-C data from train loss on the original CIFAR-100 dataset) for ResNet-18, MobileNet-V1, and VGG-19. Unlike the Imagenet-C dataset, the CIFAR100-C dataset comes with only one distortion severity level. Table 5 shows the result of our experiment on the CIFAR-100 dataset. Compared to full-precision models, quantized models show smaller generalization gaps in all cases.

ResNet-18MobileNet-V1VGG-19AugmentationFP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2Gaussian Noise2.7821.5461.6141.2254.3270.0220.0270.1914.261.6870.910.497Shot Noise2.0271.3361.41.0413.3570.0230.7730.1473.2671.2990.7010.427Impulse Noise2.0041.1551.2521.0042.9210.0230.6330.1564.2051.3690.9640.563Defocus Noise1.0040.7520.8150.5981.8890.0230.4790.0571.890.5010.4010.267Glass Blue4.6492.4452.4791.8895.8490.0220.1860.238.7092.3691.4570.797Motion Blur1.3730.9230.9710.7062.3860.0251.0010.0672.4130.6870.540.335Zoom Blur1.490.9150.9790.7362.6570.0220.2240.0682.5790.7880.5810.351Snow1.4090.7690.8530.6662.4330.0220.260.0852.6530.7090.4450.256Frost1.4730.4870.5810.42.4650.0230.2450.0032.5060.4390.1150.005Fog0.9910.5050.5830.4091.8580.0220.0560.0071.880.3380.2380.145Brightness0.9960.5430.6090.4351.8690.0220.0480.011.8780.3490.2290.112Contrast1.0180.5320.6180.4341.9050.0220.0810.0151.9070.3650.2690.179Elastic1.4150.9961.0480.8282.4930.0220.0550.1132.5070.80.6460.429Pixelate1.1850.8880.9820.7672.1710.0220.3360.0872.1360.7130.4790.33JPEG1.71.2171.3131.0022.750.0220.3090.1192.8681.120.7030.427

C.5 Results on ImageNet Dataset

For the ImageNet dataset, we computed the generalization gap ( by subtracting test loss on augmented data from train loss on the original ImageNet dataset) for ResNet-18, MobileNet V2, and ResNet-50.Table 6 shows the full list of experiments. As seen, compared to the full precision model and unlike the CIFAR-100 dataset, not all quantization levels show a better generalization gap. Especially for the MobileNet-V2 model, Int2 quantization shows the worst generalization gap in most distortion types and severity levels. But in general, Int8 and Int4 show better generalization gaps in almost all models, distortion types, and levels.

	$\displaystyle\tilde{\mathcal{L}}$	$\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[(\bm{\hat{y}}^{q}-\bm{%{y}})^{2}]$
		$\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[((\bm{\hat{y}}+{\bm{%\Delta}}^{\top}\nabla_{\bm{w}}\bm{\hat{y}})-\bm{{y}})^{2}]$
		$\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[(\bm{\hat{y}}-\bm{{y}}%)^{2}+\\|{\bm{\Delta}}^{\top}\nabla_{\bm{w}}\hat{\bm{y}}\\|^{2}_{2}+2(\bm{\hat{y%}}-\bm{{y}})({\bm{\Delta}}^{\top}\nabla_{\bm{w}}\bm{\hat{y}})]$
		$\displaystyle=\mathcal{L}+\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[\\|{\bm{%\Delta}}^{\top}\nabla_{\bm{w}}\hat{\bm{y}}\\|^{2}_{2}]+\mathbb{E}_{p(\bm{x},\bm%{y},\bm{\Delta})}[2(\bm{\hat{y}}-\bm{y})({\bm{\Delta}}^{\top}\nabla_{\bm{w}}%\bm{\hat{y}})]$