QGen: On the Ability to Generalize in Quantization Aware Training (2024)

MohammadHossein AskariHemmat11footnotemark: 1 mohammad@deeplite.ai
Deeplite
Ahmadreza Jeddi 11footnotemark: 1 ahmadreza.jeddi@gmail.com
Deeplite
Reyhane Askari Hemmat reyhane.askari.hemmat@umontreal.ca
University of Montreal and Mila, Quebec AI Institute
Ivan Lazarevich ivan.lazarevich@deeplite.ai
Deeplite
Alexander Hoffman alexander.hoffman@deeplite.ai
Deeplite
Sudhakar Sah sudhakar@deeplite.ai
Deeplite
Ehsan Saboori ehsan@deeplite.ai
Deeplite
Yvon Savaria yvon.savaria@polymtl.ca
École Polytechnique de Montréal
Jean-Pierre David jean-pierre.david@polymtl.ca
École Polytechnique de Montréal
Equal Contribution.

Abstract

Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.

1 Introduction

The exceptional growth of technology involving deep learning has made it one of the most promising technologies for applications such as computer vision, natural language processing, and speech recognition. The ongoing advances in these models consistently enhance their capabilities, yet their improving performances often come at the price of growing complexity and an increased number of parameters. The increasing complexity of these models poses a challenge to the deployment in production systems due to higher operating costs, greater memory requirements, and longer response times. Quantization Courbariaux etal. (2015); Hubara etal. (2016); Polino etal. (2018); Zhou etal. (2017); Jacob etal. (2018); Krishnamoorthi (2018) is one of the prominent techniques that have been developed to reduce the model sizes and/or reduce latency. Model quantization represents full precision model weights and/or activations using fewer bits, resulting in models with lower memory usage, energy consumption, and faster inference. Quantization has gained significant attention in academia and industry. Especially with the emergence of the transformer Vaswani etal. (2017) model, quantization has become a standard technique to reduce memory and computation requirements.

The impact on accuracy and the benefits of quantization, such as memory footprint and latency, is well studied Courbariaux etal. (2015); Li etal. (2017); Gholami etal. (2021). These studies are mainly driven by the fact that modern hardware is faster and more energy efficient in low-precision (byte, sub-byte) arithmetic compared to the floating point counterpart. Despite its numerous benefits, quantization may adversely impact accuracy.

Hence, substantial research efforts on quantization revolve around addressing the accuracy degradation resulting from lower bit representation. This involves analyzing the model’s convergence qualities for various numerical precisions and studying their impacts on gradients and network updates Li etal. (2017); Hou etal. (2019).

In this work, we delve into the generalization properties of quantized neural networks. This key aspect has received limited attention despite its significant implications for the performance of models on unseen data. This factor becomes particularly important for safety-critical Gambardella etal. (2019); Zhang etal. (2022b) applications. While prior research has explored the performance of neural networks in adversarial settings Gorsline etal. (2021); Lin etal. (2019); Galloway etal. (2017) and investigated how altering the number of quantization bits during inference can affect model performanceChmiel etal. (2020); Bai etal. (2021), there is a lack of systematic studies on the generalization effects of quantization using standard measurement techniques.

This work studies the effects of different quantization levels on model generalization, training accuracy, and training loss. First, in Section 3, we model quantization as a form of noise added to the network weights. Subsequently, we demonstrate that this introduced noise serves as a regularizing agent, with its degree of regularization directly related to the bit precision. Consistent with other regularization methods, our empirical studies further support the claim that each model requires precise tuning of its quantization level, as models achieve optimal generalization at varying quantization levels. On the generalization side, in Section 4, we show that quantization could help the optimization process convergence to minima with lower sharpness when the scale of quantization noise is bounded. This is motivated by recent works of Foret etal. (2021); Keskar etal. (2016), which establish connections between the sharpness of the loss landscape and generalization. We then leverage a variety of recent advances in the field of generalization measurement Jiang etal. (2019); Dziugaite etal. (2020), particularly sharpness-based measures Keskar etal. (2016); Dziugaite & Roy (2017); Neyshabur etal. (2017), to verify our hypothesis for a wide range of vision problems with different setups and model architectures. Finally, in this section, we present visual demonstrations illustrating that models subjected to quantization have a flatter loss landscape.

After establishing that lower-bit-quantization results in improved flatness in the loss landscape, we study the connection between the achieved flatness of the lower-bit-quantized models and generalization. Our method estimates the model’s generalization on a given data distribution by measuring the difference between the loss of the model on training data and test data. To do so, we train a pool of almost 2000 models on CIFAR-10 and CIFAR-100 Krizhevsky (2009) and Imagenet-1K Deng etal. (2009) datasets, and report the estimated generalization gap. Furthermore, we conclude our experiments by showing a practical use case of model generalization, in which we evaluate the vision models under severe cases when the input to the model is corrupted. This is achieved by measuring the generalization gap for quantized and full precision models when different types of input noise are used, as introduced in Hendrycks & Dietterich (2019).Our main contributions can be summarized as follows:

  • We theoretically show that quantization can be seen as a regularizer.

  • We empirically show that there exists a quantization level at which the quantized model converges to a flatter minimum than its full-precision model.

  • We empirically demonstrate that quantized models show a better generalization gap on distorted data.

2 Related Works

2.1 Regularization Effects of Quantization

Since the advent of BinaryConnect Courbariaux etal. (2015) and Binarized Neural Networks Hubara etal. (2016), which were the first works on quantization, the machine learning community has been aware of the generalization effects of quantization, and the observed generalization gains have commonly been attributed to the implicit regularization effects that the quantization process may impose. This pattern is also observed in more recent works such as Mishchenko etal. (2019); Xu etal. (2018); Chen etal. (2021). Even though these studies have empirically reported some performance gain as a side-product of quantization, they lack a well-formed analytical study.

Viewing quantization simply as regularization is relatively intuitive, and to the best of our knowledge, the only work so far that has tried to study this behavior formally is the recent work done in Zhang etal. (2022a), where the authors provide an analytical study on how models with stochastic binary quantization can have a smaller generalization gap compared to their full precision counterparts. The authors propose a quasi-neural network to approximate the effect of binarization on neural networks. They then derive the neural tangent kernel Jacot etal. (2018); Bach (2017) for the proposed quasi-neural network approximation. With this formalization, the authors show that binary neural networks have lower capacity, hence lower training accuracy, and a smaller generalization gap than their full precision counterparts. However, this work is limited to the case of simplified binarized networks and does not study the wider quantization space, and their supporting empirical studies are done on MNIST and Fashion MNIST datasets with no studies done on larger scale more realistic problems. Furthermore, the Neural Tangent Kernel (NTK) analysis requires strong assumptions such as an approximately linear behaviour of the model during training which may not hold in practical setups.

2.2 Generalization and Complexity Measures

Generalization refers to the ability of machine learning models to perform well on unseen data beyond the training set. Despite the remarkable success and widespread adoption of deep neural networks across various applications, the factors influencing their generalization capabilities and the extent to which they generalize effectively are still unclear Jiang etal. (2019); Recht etal. (2019).

Minimization of the common loss functions (e.g., cross=entropy and its variants) on the training data does not necessarily mean the model would generalize well Foret etal. (2021); Recht etal. (2019), especially since the recent models are heavily over-parameterized and they can easily overfit the training data. InZhang etal. (2021), the authors demonstrate neural networks’ vulnerability to poor generalization by showing they can perfectly fit randomly labeled training data. This is due to the complex and non-convex landscape of the training loss. Numerous works have tried to either explicitly or implicitly solve this overfitting issue using optimizer algorithms Kingma & Ba (2014); Martens & Grosse (2015), data augmentation techniques Cubuk etal. (2018), and batch normalization Ioffe & Szegedy (2015), to name a few.

So the question remains: what is the best indicator of a model’s generalization ability? Proving upper bounds on the test error Neyshabur etal. (2017); Bartlett etal. (2017) has been the most direct way of studying the ability of models to generalize; however, the current bounds are not tight enough to indicate the model’s ability to generalize Jiang etal. (2019). Therefore, several recent works have preferred the more empirical approaches of studying generalization Keskar etal. (2016); Liang etal. (2019). These works introduce a complexity measure, a quantity that monotonically relates to some aspect of generalization. Specifically, lower complexity measures correspond to neural networks with improved generalization capacity. Many complexity measures are introduced in the literature, but each of them has typically targeted a limited set of models on toy problems. However, recent work in Jiang etal. (2019) followed by Dziugaite etal. (2020) performed an exhaustive set of experiments on the CIFAR-10 and SVHN Netzer etal. (2011) datasets with different model backbones and hyper-parameters to identify the measures that correlate best with generalization. Both of these large-scale studies show that sharpness-based measures are the most effective. The sharpness-based measures are derived either from measuring the average flatness around a minimum through adding Gaussian perturbations (PAC-Bayesian bounds McAllester (1999); Dziugaite & Roy (2017)) or from measuring the worst-case loss, i.e., sharpness Keskar etal. (2016); Dinh etal. (2017).

The effectiveness of sharpness-based measures has also inspired new training paradigms that penalize the loss of landscape sharpness during training Foret etal. (2021); Du etal. (2022); Izmailov etal. (2018). In particular, Foret etal. (2021) introduced the Sharpness-Aware-Minimization (SAM), which is a scalable and differentiable algorithm that helps models to converge and reduce the model sharpness. It is also worth mentioning here that some recent works Liu etal. (2021); Wang etal. (2022) assume that the discretization and gradient estimation processes, which are common in quantization techniques, might cause loss fluctuations that could result in a sharper loss landscape. Then they couple quantization with SAM and report improved results; however, our findings in Section 4 suggest the opposite. The quantized models in our experiments exhibit improved loss landscape flatness compared to their full precision counterparts.

Acknowledgments

Use unnumbered third level headings for the acknowledgments. Allacknowledgments, including those to funding agencies, go at the end of the paper.Only add this information once your submission is accepted and deanonymized.

3 Mathematical Model for Quantization

Throughout this paper, we will denote vectors as 𝒙𝒙\bm{x}bold_italic_x, the scalars as x𝑥xitalic_x, and the sets as 𝒳𝒳\mathscr{X}script_X. Furthermore, perpendicular-toabsentperpendicular-to\perp\!\!\!\!\perp⟂ ⟂ denotes independence. Given a distribution 𝒟𝒟\mathcal{D}caligraphic_D for the data space, our training dataset 𝒮𝒮\mathscr{S}script_S is a set of i.i.d. samples drawn from 𝒟𝒟\mathcal{D}caligraphic_D. The typical ML task tries to learn models f(.)f(.)italic_f ( . ) parametrized by weights 𝒘𝒘\bm{w}bold_italic_w that can minimize the training set loss𝒮(𝒘)=1|𝒮|i=1|𝒮|l(f(𝒘,𝒙i),𝒚i)subscript𝒮𝒘1𝒮superscriptsubscript𝑖1𝒮𝑙𝑓𝒘subscript𝒙𝑖subscript𝒚𝑖\mathcal{L}_{\mathscr{S}}(\bm{w})=\frac{1}{|\mathscr{S}|}\sum_{i=1}^{|\mathscr%{S}|}l(f(\bm{w},\bm{x}_{i}),\bm{y}_{i})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ) = divide start_ARG 1 end_ARG start_ARG | script_S | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | script_S | end_POSTSUPERSCRIPT italic_l ( italic_f ( bold_italic_w , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )given a loss function l(.)l(.)italic_l ( . ) and (𝒙i,𝒚i)subscript𝒙𝑖subscript𝒚𝑖(\bm{x}_{i},\bm{y}_{i})( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) pairs in the training data.

To quantize our deep neural networks, we utilize Quantization Aware Training (QAT) methods similar to Learned Step-size Quantization (LSQ) Esser etal. (2020) for CNNs and Variation-aware Vision Transformer Quantization (VVTQ) XijieHuang & Cheng (2023) for ViT models. Specifically, we apply the per-layer quantization approach, in which, for each target quantization layer, we learn a step size s to quantize the layer weights. Therefore, given the weights 𝒘𝒘\bm{w}bold_italic_w, scaling factor s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R and b𝑏bitalic_b bits to quantize, the quantized weight tensor 𝒘^^𝒘\hat{\bm{w}}over^ start_ARG bold_italic_w end_ARG and the quantization noise ΔΔ\Deltaroman_Δ can be calculated as below:

𝒘¯¯𝒘\displaystyle\bar{\bm{w}}over¯ start_ARG bold_italic_w end_ARG=clip(𝒘s,2b1,2b11)\displaystyle=\lfloor clip(\frac{\bm{w}}{s},-2^{b-1},2^{b-1}-1)\rceil= ⌊ italic_c italic_l italic_i italic_p ( divide start_ARG bold_italic_w end_ARG start_ARG italic_s end_ARG , - 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT - 1 ) ⌉(1)
𝒘^^𝒘\displaystyle\hat{\bm{w}}over^ start_ARG bold_italic_w end_ARG=𝒘¯×sabsent¯𝒘𝑠\displaystyle=\bar{\bm{w}}\times s= over¯ start_ARG bold_italic_w end_ARG × italic_s(2)
ΔΔ\displaystyle\Deltaroman_Δ=𝒘𝒘^,absent𝒘^𝒘\displaystyle=\bm{w}-\hat{\bm{w}},= bold_italic_w - over^ start_ARG bold_italic_w end_ARG ,(3)

where the 𝒛delimited-⌊⌉𝒛\lfloor\bm{z}\rceil⌊ bold_italic_z ⌉ rounds the input vector 𝒛𝒛\bm{z}bold_italic_z to the nearest integer vector, clip(r,z1,z2)𝑐𝑙𝑖𝑝𝑟subscript𝑧1subscript𝑧2clip(r,z_{1},z_{2})italic_c italic_l italic_i italic_p ( italic_r , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) function returns r𝑟ritalic_r with values below z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT set to z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and values above z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT set to z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝒘^^𝒘\hat{\bm{w}}over^ start_ARG bold_italic_w end_ARG shows a quantized representation of the weights at the same scale as 𝒘𝒘\bm{w}bold_italic_w.

3.1 Theoretical Analysis

For simplicity, let us consider a regression problem where the mean square error loss is defined as,

=𝔼p(𝒙,𝒚)[𝒚^𝒚22],subscript𝔼𝑝𝒙𝒚delimited-[]superscriptsubscriptnormbold-^𝒚𝒚22\mathcal{L}=\mathbb{E}_{p(\bm{x},\bm{y})}[\|\bm{\hat{y}}-\bm{y}\|_{2}^{2}],caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y ) end_POSTSUBSCRIPT [ ∥ overbold_^ start_ARG bold_italic_y end_ARG - bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(4)

where 𝒚𝒚\bm{y}bold_italic_y is the target, and 𝒚^=f(𝒙,𝒘)bold-^𝒚𝑓𝒙𝒘\bm{\hat{y}}=f(\bm{x},\bm{w})overbold_^ start_ARG bold_italic_y end_ARG = italic_f ( bold_italic_x , bold_italic_w ) is the output of the network f𝑓fitalic_f parameterized by 𝒘𝒘\bm{w}bold_italic_w.

For uniform quantization, the quantization noise ΔΔ\Deltaroman_Δ can be approximated by the uniform distribution 𝚫𝒰[δ2,δ2]similar-to𝚫𝒰𝛿2𝛿2{\bm{\Delta}\sim\mathcal{U}[\frac{-\delta}{2},\frac{\delta}{2}]}bold_Δ ∼ caligraphic_U [ divide start_ARG - italic_δ end_ARG start_ARG 2 end_ARG , divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG ] where δ𝛿\deltaitalic_δ is the width of the quantization bin and 𝒰𝒰\mathcal{U}caligraphic_U is the uniform distribution Défossez etal. (2021); Widrow etal. (1996); Agustsson & Theis (2020).

Consequently, a quantized neural network effectively has the following loss,

~=𝔼p(𝒙,𝒚,𝚫)[𝒚𝒒^𝒚22],~subscript𝔼𝑝𝒙𝒚𝚫delimited-[]superscriptsubscriptnormbold-^superscript𝒚𝒒𝒚22\tilde{\mathcal{L}}=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[\|\bm{\hat{y^{q}%}}-\bm{y}\|_{2}^{2}],over~ start_ARG caligraphic_L end_ARG = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ∥ overbold_^ start_ARG bold_italic_y start_POSTSUPERSCRIPT bold_italic_q end_POSTSUPERSCRIPT end_ARG - bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(5)

where 𝒚^q=f(𝒙,𝒘+𝚫)superscriptbold-^𝒚𝑞𝑓𝒙𝒘𝚫\bm{\hat{y}}^{q}=f(\bm{x},\bm{w}+\bm{\Delta})overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = italic_f ( bold_italic_x , bold_italic_w + bold_Δ ).

We can apply a first-order Taylor approximation,

f(𝒙,𝒘+𝚫)f(𝒙,𝒘)+𝚫𝒘f(𝒙,𝒘)𝑓𝒙𝒘𝚫𝑓𝒙𝒘superscript𝚫topsubscript𝒘𝑓𝒙𝒘f(\bm{x},\bm{w}+\bm{\Delta})\approx f(\bm{x},\bm{w})+\bm{\Delta}^{\top}\nabla_%{\bm{w}}f(\bm{x},\bm{w})italic_f ( bold_italic_x , bold_italic_w + bold_Δ ) ≈ italic_f ( bold_italic_x , bold_italic_w ) + bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT italic_f ( bold_italic_x , bold_italic_w )(6)

Thus, 𝒚^𝒒𝒚𝒊^+𝚫𝒘𝒚^superscriptbold-^𝒚𝒒bold-^subscript𝒚𝒊superscript𝚫topsubscript𝒘bold-^𝒚\bm{\hat{y}^{q}}\approx\bm{\hat{y_{i}}}+\bm{\Delta}^{\top}\nabla_{\bm{w}}\bm{%\hat{y}}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT bold_italic_q end_POSTSUPERSCRIPT ≈ overbold_^ start_ARG bold_italic_y start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG + bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG.Re-writing, the expectation on ~~\tilde{\mathcal{L}}over~ start_ARG caligraphic_L end_ARG,

~~\displaystyle\tilde{\mathcal{L}}over~ start_ARG caligraphic_L end_ARG=𝔼p(𝒙,𝒚,𝚫)[(𝒚^q𝒚)2]absentsubscript𝔼𝑝𝒙𝒚𝚫delimited-[]superscriptsuperscriptbold-^𝒚𝑞𝒚2\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[(\bm{\hat{y}}^{q}-\bm{%{y}})^{2}]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼p(𝒙,𝒚,𝚫)[((𝒚^+𝚫𝒘𝒚^)𝒚)2]absentsubscript𝔼𝑝𝒙𝒚𝚫delimited-[]superscriptbold-^𝒚superscript𝚫topsubscript𝒘bold-^𝒚𝒚2\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[((\bm{\hat{y}}+{\bm{%\Delta}}^{\top}\nabla_{\bm{w}}\bm{\hat{y}})-\bm{{y}})^{2}]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ( ( overbold_^ start_ARG bold_italic_y end_ARG + bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG ) - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼p(𝒙,𝒚,𝚫)[(𝒚^𝒚)2+𝚫𝒘𝒚^22+2(𝒚^𝒚)(𝚫𝒘𝒚^)]absentsubscript𝔼𝑝𝒙𝒚𝚫delimited-[]superscriptbold-^𝒚𝒚2subscriptsuperscriptnormsuperscript𝚫topsubscript𝒘^𝒚222bold-^𝒚𝒚superscript𝚫topsubscript𝒘bold-^𝒚\displaystyle=\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[(\bm{\hat{y}}-\bm{{y}}%)^{2}+\|{\bm{\Delta}}^{\top}\nabla_{\bm{w}}\hat{\bm{y}}\|^{2}_{2}+2(\bm{\hat{y%}}-\bm{{y}})({\bm{\Delta}}^{\top}\nabla_{\bm{w}}\bm{\hat{y}})]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ( overbold_^ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 ( overbold_^ start_ARG bold_italic_y end_ARG - bold_italic_y ) ( bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG ) ]
=+𝔼p(𝒙,𝒚,𝚫)[𝚫𝒘𝒚^22]+𝔼p(𝒙,𝒚,𝚫)[2(𝒚^𝒚)(𝚫𝒘𝒚^)]absentsubscript𝔼𝑝𝒙𝒚𝚫delimited-[]subscriptsuperscriptnormsuperscript𝚫topsubscript𝒘^𝒚22subscript𝔼𝑝𝒙𝒚𝚫delimited-[]2bold-^𝒚𝒚superscript𝚫topsubscript𝒘bold-^𝒚\displaystyle=\mathcal{L}+\mathbb{E}_{p(\bm{x},\bm{y},\bm{\Delta})}[\|{\bm{%\Delta}}^{\top}\nabla_{\bm{w}}\hat{\bm{y}}\|^{2}_{2}]+\mathbb{E}_{p(\bm{x},\bm%{y},\bm{\Delta})}[2(\bm{\hat{y}}-\bm{y})({\bm{\Delta}}^{\top}\nabla_{\bm{w}}%\bm{\hat{y}})]= caligraphic_L + blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ∥ bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ 2 ( overbold_^ start_ARG bold_italic_y end_ARG - bold_italic_y ) ( bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG ) ]

Since 𝚫𝒘𝒚^\bm{\Delta}\perp\!\!\!\!\perp\nabla_{\bm{w}}\bm{\hat{y}}bold_Δ ⟂ ⟂ ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG, and 𝔼p(𝚫)[𝚫]=0subscript𝔼𝑝𝚫delimited-[]𝚫0\mathbb{E}_{p(\bm{\Delta})}[\bm{\Delta}]=0blackboard_E start_POSTSUBSCRIPT italic_p ( bold_Δ ) end_POSTSUBSCRIPT [ bold_Δ ] = 0111Note that we only require the quantization noise distribution ΔΔ\Deltaroman_Δ to have 𝔼p(𝚫)[𝚫]=0subscript𝔼𝑝𝚫delimited-[]𝚫0\mathbb{E}_{p(\bm{\Delta})}[\bm{\Delta}]=0blackboard_E start_POSTSUBSCRIPT italic_p ( bold_Δ ) end_POSTSUBSCRIPT [ bold_Δ ] = 0. We do not explicitly use the assumption of ΔΔ\Deltaroman_Δ coming from a uniform distribution. Thus, for any zero mean noise distribution, the above proof holds., the last term on the right-hand side is zero. Thus we have,

~=+𝔼p(𝒙,𝒚,𝚫)[𝚫𝒘𝒚^22]=+(𝚫),~subscript𝔼𝑝𝒙𝒚𝚫delimited-[]subscriptsuperscriptnormsuperscript𝚫topsubscript𝒘bold-^𝒚22𝚫\displaystyle\tilde{\mathcal{L}}=\mathcal{L}+\mathbb{E}_{p(\bm{x},\bm{y},\bm{%\Delta})}[\|{\bm{{\Delta}}^{\top}\nabla_{\bm{w}}\bm{\hat{y}}\|^{2}_{2}]}=%\mathcal{L}+\mathcal{R}(\bm{{\Delta}}),over~ start_ARG caligraphic_L end_ARG = caligraphic_L + blackboard_E start_POSTSUBSCRIPT italic_p ( bold_italic_x , bold_italic_y , bold_Δ ) end_POSTSUBSCRIPT [ ∥ bold_Δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = caligraphic_L + caligraphic_R ( bold_Δ ) ,(7)

where (𝚫)𝚫\mathcal{R}(\bm{{\Delta}})caligraphic_R ( bold_Δ ) can be viewed as a regularization function. This means that minimizing ~~\tilde{\mathcal{L}}over~ start_ARG caligraphic_L end_ARG is equivalent to minimizing the loss of a non-quantized neural network with gradient norm regularization. Given a quantization method like LSQ or VVTQ, we know that the quantization error (𝚫𝚫\bm{\Delta}bold_Δ) is a function of the quantization level (δ𝛿\deltaitalic_δ). As a result, \mathcal{R}caligraphic_R is also a function of the quantization level. Thus, the quantization level should be viewed as a hyper-parameter that controls the regularization level. Similar to other regularization methods in deep learning, this hyper-parameter should also be carefully tuned for the best generalization performance.

QGen: On the Ability to Generalize in Quantization Aware Training (1)

QGen: On the Ability to Generalize in Quantization Aware Training (2)

QGen: On the Ability to Generalize in Quantization Aware Training (3)

To study the relation between (𝚫)𝚫\mathcal{R}(\bm{{\Delta}})caligraphic_R ( bold_Δ ) and quantization level, we ran some experiments. Figure 1 illustrates the width of the quantization bin per layer for three different architectures trained on ImageNet Deng etal. (2009). As it can be seen, the lower the number of quantization bits, the larger the scale of step size, s𝑠sitalic_s, is. And as Equations 1 to 3 indicate, s𝑠sitalic_s is equivalent to the width of the quantization bin. Hence lower-bit quantization causes quantization bins to be wider as the number of potential representations becomes limited, which results in higher regularization and training losses.In our experiments, this trend was consistent across various vision tasks and model architectures, allowing us to affirm that lower-bit-resolution quantization (with greater δ𝛿\deltaitalic_δ) results in increased training losses, as shown in Equation 7. This indicates that the level of quantization dictates the degree of regularization introduced to the network. Furthermore, our empirical investigation, encompassing nearly 2000 models trained on the CIFAR-10, CIFAR-100, and ImagenNet-1k datasets, confirms this observation. The findings are detailed in Table 1.

ModelPrecisionTrain AccTest AccTrain LossTest LossGeneralizationCIFAR-10NiNFP3297.6188.050.1030.4050.302Int897.588.010.1060.4070.301Int496.987.70.1250.4130.288Int293.486.110.2220.4460.224CIFAR-100NiNFP3295.2863.480.2071.6871.48Int895.1763.440.2111.6851.469Int493.563.190.2711.6481.38Int281.2162.110.6761.5370.859Imagenet-1KDeiT-TFP3273.7571.381.382.481.1Int876.375.540.991.980.98Int474.7172.311.082.070.99Int259.7355.351.832.810.98Swin-TFP3283.3980.960.5161.480.964Int885.2182.480.7561.560.80Int484.8282.420.7641.590.82Int278.7677.660.9411.840.89ResNet-18FP3269.9671.491.182.231.05Int873.2373.321.282.100.82Int471.3471.741.262.180.92Int267.168.581.382.160.78

4 Analyzing Loss Landscapes in Quantized Models and Implications for Generalization

A low generalization gap is a desirable characteristic of deep neural networks. It is common in practice to estimate the population loss of the data distribution 𝒟𝒟\mathcal{D}caligraphic_D, i.e.𝒟(𝒘)=𝔼(𝒙,𝒚)𝒟[l(f(𝒘,𝒙),𝒚)]subscript𝒟𝒘subscript𝔼similar-to𝒙𝒚𝒟delimited-[]𝑙𝑓𝒘𝒙𝒚\mathcal{L}_{\mathcal{D}}(\bm{w})=\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}}[%l(f(\bm{w},\bm{x}),\bm{y})]caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , bold_italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_l ( italic_f ( bold_italic_w , bold_italic_x ) , bold_italic_y ) ], by utilizing 𝒮(𝒘)subscript𝒮𝒘\mathcal{L}_{\mathscr{S}}(\bm{w})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ) as a proxy, and then minimizing it by gradient descent-based optimizers. However, given that modern neural networks are highly over-parameterized and 𝒮(𝒘)subscript𝒮𝒘\mathcal{L}_{\mathscr{S}}(\bm{w})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ) is commonly non-convex in 𝒘𝒘\bm{w}bold_italic_w, the optimization process can converge to local or even global minima that could adversely affect the generalization of the model (i.e. with a significant gap between 𝒮(𝒘)subscript𝒮𝒘\mathcal{L}_{\mathscr{S}}(\bm{w})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ) and 𝒟(𝒘)subscript𝒟𝒘\mathcal{L}_{\mathcal{D}}(\bm{w})caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w )) Foret etal. (2021).

Motivated by the connection between the sharpness of the loss landscape and generalization Keskar etal. (2016), in Foret etal. (2021) the authors proposed the Sharpness-Aware-Minimization (SAM) technique, in which they propose to learn the weights 𝒘𝒘\bm{w}bold_italic_w that result in a flat minimum with a neighborhood of low training loss values characterized by ρ𝜌\rhoitalic_ρ. Especially, inspired by the PAC-Bayesian generalization bounds, they were able to prove that for any ρ>0𝜌0\rho>0italic_ρ > 0, with high probability over the training dataset 𝒮𝒮\mathscr{S}script_S, the following inequality holds:

𝒟(𝒘)maxϵ2ρ𝒮(𝒘+ϵ)+h(𝒘22/ρ2),subscript𝒟𝒘subscriptsubscriptnormbold-italic-ϵ2𝜌subscript𝒮𝒘bold-italic-ϵsuperscriptsubscriptnorm𝒘22superscript𝜌2\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\max_{||\bm{\epsilon}||_{2}\leq\rho}%\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\epsilon})\>+h(||\bm{w}||_{2}^{2}/\rho^{2%}),caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) ≤ roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_ϵ ) + italic_h ( | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,(8)

where h:++:subscriptsubscripth:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}italic_h : blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a strictly increasing function. Even though the above theorem is for the case where the L2𝐿2L2italic_L 2-norm of ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ is bounded by ρ𝜌\rhoitalic_ρ and the adversarial perturbations are utilized to achieve the worst-case loss, the authors empirically show that in practice, other norms in [1,]1[1,\infty][ 1 , ∞ ] and random perturbations for ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ can also achieve some levels of flatness; however, they may not be as effective as the L2𝐿2L2italic_L 2-norm coupled with the adversarial perturbations.

Extending on the empirical studies of Foret etal. (2021), we relax the L2𝐿2L2italic_L 2-norm condition of Equation 8, and consider the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm instead, resulting in:

𝒟(𝒘)maxϵρ𝒮(𝒘+ϵ)+h(𝒘22/ρ2)subscript𝒟𝒘subscriptsubscriptnormbold-italic-ϵ𝜌subscript𝒮𝒘bold-italic-ϵsuperscriptsubscriptnorm𝒘22superscript𝜌2\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\max_{||\bm{\epsilon}||_{\infty}\leq\rho}%\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\epsilon})\>+h(||\bm{w}||_{2}^{2}/\rho^{2})caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) ≤ roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_ϵ ) + italic_h ( | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )(9)

Furthermore, given small values of ρ>0𝜌0\rho>0italic_ρ > 0, for any noise vector 𝜹𝜹\bm{\delta}bold_italic_δ such that 𝜹ρsubscriptnorm𝜹𝜌||\bm{\delta}||_{\infty}\leq\rho| | bold_italic_δ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ, the following inequality holds in practice for a local minimum characterized by 𝒘𝒘\bm{w}bold_italic_w, as also similarly depicted in Equation 7 where 𝜹𝜹\bm{\delta}bold_italic_δ corresponds to the quantization noise, 𝚫𝚫\bm{\Delta}bold_Δ; however, this inequality may not necessarily hold for every 𝒘𝒘\bm{w}bold_italic_w:

𝒮(𝒘)𝒮(𝒘+𝜹)maxϵρ𝒮(𝒘+ϵ),subscript𝒮𝒘subscript𝒮𝒘𝜹subscriptsubscriptnormbold-italic-ϵ𝜌subscript𝒮𝒘bold-italic-ϵ\mathcal{L}_{\mathscr{S}}(\bm{w})\leq\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\delta})\leq\max_{||\bm{\epsilon}||_{\infty}\leq\rho}\mathcal{L}_{\mathscr{S}}%(\bm{w}+\bm{\epsilon}),caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ) ≤ caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_δ ) ≤ roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_ϵ ) ,(10)

For small values of ρ𝜌\rhoitalic_ρ close to 0, and a given 𝒘𝒘\bm{w}bold_italic_w we can approximate,

maxϵρ𝒮(𝒘+ϵ)subscriptsubscriptnormbold-italic-ϵ𝜌subscript𝒮𝒘bold-italic-ϵ\max_{||\bm{\epsilon}||_{\infty}\leq\rho}\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\epsilon})roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_ϵ )(11)

in Equation 9 with 𝒮(𝒘+𝜹)subscript𝒮𝒘𝜹\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\delta})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_δ ). As a result, for small positive values of ρ𝜌\rhoitalic_ρ, we have:

𝒟(𝒘)𝒮(𝒘+𝜹)+h(𝒘22/ρ2),subscript𝒟𝒘subscript𝒮𝒘𝜹superscriptsubscriptnorm𝒘22superscript𝜌2\mathcal{L}_{\mathcal{D}}(\bm{w})\leq\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\delta})\>+h(||\bm{w}||_{2}^{2}/\rho^{2}),caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) ≤ caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_δ ) + italic_h ( | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,(12)

and finally, moving the 𝒮(𝒘+𝜹)subscript𝒮𝒘𝜹\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{\delta})caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_δ ) to the left-hand-side in Equation 12, will give us:

𝒟(𝒘)𝒮(𝒘+𝜹)h(𝒘22/ρ2).subscript𝒟𝒘subscript𝒮𝒘𝜹superscriptsubscriptnorm𝒘22superscript𝜌2\mathcal{L}_{\mathcal{D}}(\bm{w})\>-\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\delta})\leq h(||\bm{w}||_{2}^{2}/\rho^{2}).caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) - caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_italic_δ ) ≤ italic_h ( | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(13)

The above inequality formulates an approximate bound for values of ρ>0𝜌0\rho>0italic_ρ > 0 close to 0 on the generalization gap for a model parametrized by 𝒘𝒘\bm{w}bold_italic_w; given the nature of function h(.)h(.)italic_h ( . ), the higher the value ρ𝜌\rhoitalic_ρ is the tighter the generalization bound becomes.

As shown in Section 3, for quantization techniques with a constant quantization bin width, we have 𝚫δ2subscriptnorm𝚫𝛿2||\bm{\Delta}||_{\infty}\leq\frac{\delta}{2}| | bold_Δ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, where 𝚫𝚫\bm{\Delta}bold_Δ is the quantization noise, and δ𝛿\deltaitalic_δ is the width of the quantization bin. Replacing the quantization equivalent terms in Equation 13 yields:

𝒟(𝒘)𝒮(𝒘+𝚫)h(4𝒘22/δ2).subscript𝒟𝒘subscript𝒮𝒘𝚫4superscriptsubscriptnorm𝒘22superscript𝛿2\mathcal{L}_{\mathcal{D}}(\bm{w})\>-\mathcal{L}_{\mathscr{S}}(\bm{w}+\bm{%\Delta})\leq h(4||\bm{w}||_{2}^{2}/\delta^{2}).caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) - caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w + bold_Δ ) ≤ italic_h ( 4 | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(14)

We now state the following hypothesis for quantization techniques based on Equation 14:

Hypothesis 1 (H1)

Let 𝐰𝐰\bm{w}bold_italic_w be the set of weights in the model, 𝐰𝐪superscript𝐰𝐪\bm{w^{q}}bold_italic_w start_POSTSUPERSCRIPT bold_italic_q end_POSTSUPERSCRIPT be the set of quantized weights, δ𝛿\deltaitalic_δ be the width of quantization bin and g(.)g(.)italic_g ( . ) be a function that measures the sharpness of a minima, we have,

  1. 1.

    Having a bounded 𝚫𝚫\bm{\Delta}bold_Δ with 𝚫δ2subscriptnorm𝚫𝛿2||\bm{\Delta}||_{\infty}\leq\frac{\delta}{2}| | bold_Δ | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, there exist a 𝚫𝚫\bm{\Delta}bold_Δ where, for quantized model parameterized by w1superscript𝑤1w^{1}italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT obtained through QAT and full precision model parameterized by w2superscript𝑤2w^{2}italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT we have: g(w1)g(w2)𝑔superscript𝑤1𝑔superscript𝑤2g(w^{1})\leq g(w^{2})italic_g ( italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ≤ italic_g ( italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

(1) implies that quantization helps the model converge to flatter minima with lower sharpness. As discussed in Section 3 and illustrated in Figure 1, since lower bit quantization corresponds to higher δ𝛿\deltaitalic_δ, therefore, lower bit resolution quantization results in better flatness around the minima. However, as described by 7, the δ𝛿\deltaitalic_δ is a hyperparameter for the induced regularization. Hence, not all quantization levels will result in flatter minima and improved generalization.

In the rest of this Section, we report the results of our exhaustive set of empirical studies regarding the generalization qualities of quantized models; in Section 4.1, for different datasets and different backbones (models) we study the flatness of the loss landscape of the deep neural networks under different quantization regimens, in Section 4.2 we measure and report the generalization gap of the quantized models for a set of almost 2000 vision models, and finally in Section 4.3 using corrupted datasets, we study the real-world implications that the generalization quality can have and how different levels of quantization perform under such scenarios.

4.1 Flatness of Minima and Generalization

In this section, we conduct experiments that demonstrate quantized neural networks enjoy better flatness in their loss landscape compared to their full-precision counterparts; this finding is contrary to the assumption of some of the recent studies Liu etal. (2021); Wang etal. (2022). In those studies, it is assumed that quantization results in sharper minima. We believe that the root of this assumption might be that the authors of those works have not considered the magnitude of the network weights in measuring the sharpness. However, as Jiang etal. (2019) and Dziugaite etal. (2020) show, the flatness measures that take the magnitude of the parameters into account Keskar etal. (2016), are better indicators of generalization.

DatasetModelInt2Int4Int8FP32
CIFAR-10NiN (4x10)47.26354.29153.804130.686
NiN (4x12)43.03946.52346.75073.042
ResNet-1844.26448.22747.36859.474
ResNet-5045.011238.11748.14997.856
CIFAR-100NiN (5x10)60.98160.70760.905190.414
NiN (5x12)82.23087.93187.307163.768
ResNet-1848.12055.02754.735125.164
ResNet-5075.73982.78879.603148.298
ImageNet-1KResNet-1878.29184.47285.162415.004
ResNet-50214.055213.035212.624379.465

As Table 2 shows, for a given backbone, the magnitude of the network parameters (calculated as the L2𝐿2L2italic_L 2-norm of weights) are very different among different quantization levels; therefore, simply measuring the loss landscape flatness using sharpness or PAC-Bayesian bounds without considering the magnitude of the weights could be misleading.

PAC-BayesianSharpnessDatasetModelPrecisionInitOrigMag-InitMag-OrigInitOrigMag-InitMag-OrigFP322.2642.27.6357.5940.5890.5728.2198.181Int84.2043.6266.4356.1760.2920.2526.8536.610Int42.4822.1436.4196.1621.4441.2476.8266.586NiN (4x10)Int21.5881.326.1715.8331.1520.9586.4546.131FP321.4691.3287.9747.7700.3590.3249.2169.040Int86.0574.8667.7187.2560.2590.2088.6558.245Int42.6582.1317.7657.3021.3351.078.568.142NiN (4x12)Int21.9181.4937.6547.1190.7810.6088.5138.034FP321.1861.1353.6593.6171.4471.3834.3994.364Int80.8930.8343.3553.2850.4260.3984.1124.055Int41.7861.6733.2913.2230.4330.4054.0373.981ResNet-18Int21.3681.2673.2383.1560.8190.7594.0744.012FP321.8031.6475.3045.1931.2371.136.3036.210Int83.9112.9014.7294.3000.9880.7335.4725.106Int48.9378.7936.3946.3773.713.656.6846.669CIFAR10ResNet-50Int21.9911.4314.6384.1511.9261.3855.4995.094FP324.2664.1929.339.3020.8590.84410.46710.443Int87.3546.3397.4517.1550.4740.4098.0847.812Int43.1012.6737.3997.1010.4730.4088.0327.759NiN (5x10)Int22.251.9396.1385.7760.3130.277.7747.491FP323.5053.40910.95810.9040.7770.75512.04111.992Int81.7121.5619.1758.9630.5820.5319.9569.761Int43.9343.5959.2749.0690.5810.5319.7949.599NiN (5x12)Int24.3433.9229.4799.2520.5570.5039.8289.609FP323.5353.4954.2434.2343.4293.394.7954.786Int86.0315.6963.6853.6311.1941.1284.2324.185Int42.3812.253.5913.5361.1661.1024.1174.069ResNet-18Int23.7043.4433.5383.46527.98327.2474.654.611FP324.3964.2655.9185.8834.7324.5916.7976.768Int85.5834.2794.7754.3852.4451.8745.6135.285Int43.0762.3975.2734.9452.3861.8596.8096.558CIFAR100ResNet-50Int229.72729.5315.2535.24737.89338.1248.3438.339FP3211.69411.58412.37812.355349.235345.96220.06920.055Int87.8365.30310.18.902104.9170.99418.41617.786Int44.6153.10810.0728.853104.55770.41918.4117.772ResNet-18Int216.39710.56311.0049.770101.3165.26618.36217.649FP327.9427.14422.16321.8265.0674.55627.74627.418Int820.39814.34417.59716.27211.2087.88120.10418.995Int435.01124.63717.80916.503258.118181.63619.16218.833ResNet-50Int2245.654173.28717.95417.023258.722182.50524.05124.007FP328.6538.12319.65118.2267.0176.75331.92431.5Int826.54427.35218.23217.5637.4455.12622.52423.432Int435.78633.98217.98316.1485.9274.67220.12223.765ImageNet-1KDeiT-TInt2236.322171.23419.86518.982218.621169.97232.11433.763

To capture the flatness around the local minima of a given network f(𝒘)𝑓𝒘f(\bm{w})italic_f ( bold_italic_w ), we utilize the PAC-Bayesian bounds McAllester (1999) and sharpness measures Keskar etal. (2016). The former adds Gaussian perturbations to the network parameters and captures the average (expected) flatness within a bound, while the latter captures the worst-case flatness, i.e. sharpness, through adding adversarial worst-case perturbations to the parameters. We use the same formulation and implementation as specified by Jiang etal. (2019) and Dziugaite etal. (2020); in particular, similar to their approach, we measure and report these metrics by considering the trained model and the network-at-initialization parameters as the origin and initialization tensors, respectively. Moreover, as discussed in the above paragraph, and indicated in Table 2, the magnitude-aware versions of these metrics are the most reliable way of capturing the flatness of a network, hence we also report the magnitude-aware measurements, and they will be the main measure of loss landscape flatness. Details about these metrics and their formulation are in the supplementary Section A.

As shown in Table 3, for the 3 datasets of CIFAR-10, CIFAR-100 , and ImageNet-1k, and over a variety of network backbones, the quantized models enjoy flatter loss landscapes which is an indicator of better generalization to unseen data. An important observation from the experiments reported in Table 3 is that relying solely on sharpness or PAC-Bayesian measures without considering the magnitude of the network parameters might create the assumption that quantization does increase the network sharpness. We suspect that this might have indeed been the cause of this assumption in the works of Liu etal. (2021); Wang etal. (2022) which assume worse sharpness for quantized models and then propose Sharpness-Aware-Minimization (SAM) coupled with Quantization-Aware-Training (QAT). However, our empirical studies demonstrate that when the magnitude of the parameters is taken into account, quantization does actually improve flatness, and the finding that SAM can help quantized models achieve further flatness does not necessarily mean that quantized models have sharper minima compared to the non-quantized counterparts.

4.1.1 Loss Landscape Visualization

The loss landscape of quantized neural networks can be effectively visualized. Using the technique outlined in Li etal. (2018), we projected the loss landscape of quantized models and the full precision ResNet-18 models trained on the CIFAR-10 dataset onto a three-dimensional plane. The visual representation, as illustrated in Figure 2, clearly demonstrates that the loss landscape associated with quantized models is comparatively flatter. This observation confirms the findings presented in Table 3.

4.2 Measuring the Generalization Gap

To study the generalization behaviors of quantized models, we have trained almost 2000 models on the CIFAR-10, CIFAR-100 and ImageNet-1K datasets. Our goal is to measure the 𝒟(𝒘)𝒮(𝒘)subscript𝒟𝒘subscript𝒮𝒘\mathcal{L}_{\mathcal{D}}(\bm{w})\>-\mathcal{L}_{\mathscr{S}}(\bm{w})caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_w ) - caligraphic_L start_POSTSUBSCRIPT script_S end_POSTSUBSCRIPT ( bold_italic_w ), i.e., the generalization gap, by utilizing the data that is unseen during the training process (the test data). Without loss of generality, herein, we will refer to the difference between test data loss and training data loss as the generalization gap.

Following the guidelines of Jiang etal. (2019) and Dziugaite etal. (2020) to remove the effect of randomness from our analysis of generalization behavior, for smaller datasets (CIFAR-10 and CIFAR100), we construct a pool of trained models by varying 5 commonly used hyperparameters over the fully convolutional "Network-in-Network" architecture Lin etal. (2013). The hyperparameter list includes learning rate, weight decay, optimization algorithm, architecture depth, and layer width. In our experiments, each hyperparameter has 3 choices; therefore, the number of trained models per quantization level is 35=243superscript352433^{5}=2433 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT = 243, with the number of bits considered being selected from the following values: 2, 4, and 8, and the resulting models are compared with their full-precision counterpart. Thus, in total, we will have 4×243=99242439924\times 243=9924 × 243 = 992 models trained per dataset, over CIFAR-10 and CIFAR-100 datasets, which gives us almost 2000 trained models. For more details regarding hyperparameter choices and model specifications, please refer to the supplementary material Section B. Lastly, for ImageNet-1k, we measured the generalization gap on both CNN and ViT models.

In Jiang etal. (2019), to measure the generalization gap of a model, the authors first train the model until the training loss converges to a threshold (0.01). Here, we argue that this approach might not be optimal when quantization enters the picture. First, lower bit-resolution quantized models have lower learning capacity compared to the higher bit-resolution quantized or the full-precision ones; our proof in Equation 7 also indicates that the learning capabilities of a given network diminish as the number of quantization bits decreases. Second, early stopping of the training process may hinder the trained models from appropriately converging to flatter local minima, which quantized models enjoy in their loss landscape. Therefore, we apply a different training approach. Each model is trained for 300 epochs by lowering the learning rate by a factor of 10 at epochs 100 and 200, and at the end, the model corresponding to the lowest training loss is chosen.

Table 1 summarizes the results of these experiments. The accuracy-generalization trade-off is demonstrated through these experiments. The training loss and training accuracy of lower-resolution quantized models are negatively impacted. However, they enjoy better generalization. Some additional interesting results can be inferred from Table 1. Notably, 8-bit quantization is almost on par with the full-precision counterpart on all the metrics. This is also evident in Table 3, where we studied the sharpness-based measures. The other interesting observation is that although training losses vary among the models, the test loss is almost the same among all; this, in turn, indicates that full-precision and high-resolution quantized models have a higher degree of overfitting, which could result from converging to sharper local minima.

Severity 1Severity 2Severity 3Severity 4Severity 5ModelAugmentationFP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2Gaussian Noise1.0670.860.9391.211.9131.461.6292.2013.4392.522.7963.6585.5293.964.3525.1157.6535.626.1176.093Shot Noise1.2380.961.0541.3332.2891.721.8872.4213.8012.752.9873.6856.4384.44.7395.3047.9195.415.8056.003Impulse Noise2.2351.782.0582.3243.1772.352.6363.2794.0612.93.1854.0016.0964.264.5955.3247.7815.575.9956.114Defocus Noise0.9790.890.8580.8221.4321.371.3251.3022.3942.342.2632.2173.2853.193.0992.9343.9833.923.8083.498Glass Blue1.181.071.0310.9851.9691.861.8121.8043.8223.833.7273.544.2214.254.153.9094.6524.624.5424.139Motion Blur0.6870.550.5420.5091.371.221.2451.2612.5212.422.4542.3813.73.643.6783.4124.2854.234.2753.875Zoom Blur1.5181.381.3821.3862.2192.12.1192.0942.6882.582.5882.5183.2133.123.1373.0283.6663.593.5993.437Snow1.4011.010.9981.1243.2152.362.3742.6432.9692.072.0942.3153.972.812.8693.0744.5153.483.5173.572Frost0.9490.660.6260.6332.0931.681.6811.8122.9782.532.5532.6883.1412.742.7662.8933.7133.313.3623.447Fog0.8090.420.4440.4051.2140.680.7340.7741.8571.181.2731.4312.3471.681.7621.9583.773.033.1413.275Brightness0.1210.040.0190.1550.2210.10.080.0620.3780.190.1840.0840.6310.370.360.3230.9860.620.6260.672Contrast0.5230.240.2320.130.8670.40.4130.3961.6270.810.8611.0313.612.372.5292.9215.2644.634.7654.479Elastic0.5380.430.4060.2872.0261.951.9111.8331.1161.030.9690.8841.9971.941.8441.7554.1124.113.9573.57Pixelate0.6120.50.4920.4160.5990.510.5060.4651.8891.721.7341.9583.0462.932.883.3063.3693.323.3133.51ResNet-18JPEG0.590.480.4680.3750.8010.680.6740.6270.9720.850.8410.8241.5991.461.4461.4912.6152.432.4052.487Gaussian Noise1.0410.760.8572.781.9231.3821.5363.7553.4252.52.7625.0095.2514.0654.5186.1826.9975.8156.4237.124Shot Noise1.1320.8431.0272.9262.2141.5911.8463.9753.672.6243.0135.0585.8914.3635.0116.367.0455.4186.1376.96Impulse Noise1.6351.4831.5853.0432.5972.2232.2844.1443.4232.7512.9014.9615.3024.1714.5916.2966.9795.7536.3157.126Defocus Noise0.8630.7991.0053.8561.3261.2661.5194.2862.232.2132.4344.8583.0592.9833.3285.1193.7843.6554.25.303Glass Blue1.2231.1411.5093.5382.1152.0392.4434.2574.014.0034.3094.9434.3754.3534.5645.0394.6684.6014.7095.141Motion Blur0.6430.530.6413.0681.3351.2091.3543.7682.3922.2822.4354.3493.53.4183.5884.7744.1084.0284.2354.983Zoom Blur1.5391.4231.6073.5642.2822.1852.3583.972.7742.6852.8864.3353.3173.2633.4924.5673.7973.7384.0354.817Snow1.2530.9041.1682.3773.0742.4292.6944.0172.8382.162.4973.8693.7752.9453.2864.7974.5623.7763.9965.151Frost0.9410.6580.82.2362.1931.7842.0213.7133.1542.6732.9694.6823.3452.9043.2234.943.9563.4843.8355.46Fog0.6990.3540.8223.8741.0840.6241.244.4541.7151.1451.8024.9292.2561.6752.1114.9693.7923.1163.2985.371Brightness0.0340.050.0191.3420.1430.0080.0891.3590.3150.1170.2111.5490.5950.3030.4221.9871.0020.5910.7522.663Contrast0.4820.1880.8163.7170.8460.3941.5134.5711.6240.8993.0955.5563.662.7255.8166.3965.4114.8816.5056.59Elastic0.4420.3470.4812.3961.9731.8712.1054.0120.9620.871.1132.4741.9131.8072.212.9594.1063.9824.6934.036Pixelate0.9260.6530.8721.8831.4441.020.9341.8382.1551.8222.4682.1723.1113.0643.7732.7553.9793.9933.9883.301MoblieNet V2JPEG0.4910.3820.5541.7540.6750.5520.7841.8480.8260.6930.9671.9281.3571.1651.5552.1822.1821.9022.4622.545Gaussian Noise0.9380.9140.9280.9731.4371.2821.1121.5712.3632.0472.5132.893.7193.2553.7544.885.1344.9995.3397.828Shot Noise0.9610.9460.9571.0261.5851.4081.2461.832.4482.1662.0233.2154.0843.6973.8875.7484.9244.9195.2297.656Impulse Noise1.7031.6521.6761.7892.0131.8741.4642.3732.5642.2951.9953.2113.9623.5073.4585.4355.284.9425.1058.123Defocus Noise1.0591.0420.9110.8691.4411.4141.3091.2982.3442.3112.2862.2313.2443.2253.2263.1644.0524.0494.0193.994Glass Blue1.3491.271.0111.4572.2972.1691.8292.0884.6134.5514.1854.3465.0575.0094.8154.7935.3995.3765.345.102Motion Blur0.7310.6380.6230.5781.3141.3071.191.2382.5632.5512.3372.5014.1484.0573.6723.965.0485.0334.4044.75Zoom Blur1.5091.4731.2611.3612.2522.1872.0372.1342.8222.7362.5712.6973.443.3373.1393.3254.0393.9493.6763.916Snow1.2291.131.0481.1432.622.5292.4812.9332.3752.3172.1932.4783.1273.0162.9413.3473.6973.4373.84.209Frost0.8450.8370.6740.6531.7691.7261.5061.7852.5632.5222.2452.5882.7612.722.4352.8163.3223.2912.9723.427Fog0.6910.6850.5820.5010.8970.8760.7840.9261.3051.2481.1671.3371.811.6571.6351.9093.2612.962.9793.569Brightness0.3450.250.210.0810.3820.3140.2590.4310.4530.4460.3410.2220.5840.5140.4730.3670.7870.7160.6740.609Contrast0.5450.4490.420.3080.690.6770.5680.4941.0470.9980.8671.0512.3872.1731.852.6244.6864.3943.6734.914Elastic0.6550.6250.5170.4222.2742.2431.9082.1221.6021.5711.2421.4162.7042.6712.2092.5915.5985.5224.6475.348Pixelate0.8710.710.7270.6841.0421.0211.0150.7781.9711.5751.5561.9743.3733.2293.2083.4314.0384.0143.9964.179ResNet-50JPEG0.7930.7010.670.5220.9570.980.8320.7041.0921.0230.960.8491.5371.4251.3561.3942.2362.1181.9672.228

4.3 Generalization Under Distorted Data

In addition to assessing the generalization metrics outlined in the previous sections, we sought to investigate some real-world implications that the generalization quality of the quantized models would have. To this end, we evaluate the performance of vision models in the case when the input images are distorted or corrupted by common types of perturbations. We take advantage of the comprehensive benchmark provided in Hendrycks & Dietterich (2019) where they identify 15 types of common distortions and measure the performance of the models under different levels of severity for each distortion. Table 4 presents the generalization gaps as calculated by the difference of loss on the corrupted dataset and the loss on training data for 5 levels of severity for ResNet-18 and ResNet-50 trained on ImageNet-1K. By augmenting the test dataset in this way, we are unlocking more unseen data for evaluating the generalization of our models. As is evident through these experiments, quantized models maintain their superior generalization under most of the distortions. Accuracies of the models on the distorted dataset, as well as results and discussions on more architectures and datasets and details on conducting our experiments, are available in the supplementary material Section C.

5 Conclusion

In this work, we investigated the generalization properties of quantized neural networks, which have received limited attention despite their significant impact on model performance. We demonstrated that quantization has a regularization effect and it leads to improved generalization capabilities. We empirically show that quantization could facilitate the convergence of models to flatter minima. Lastly, on distorted data, we provided empirical evidence that quantized models exhibit improved generalization compared to their full-precision counterparts across various experimental setups. Through the findings of this study, we hope that the inherent generalization capabilities of quantized models can be used to further improve their performance.

References

  • (1)Models and pre-trained weights and mdash; Torchvision 0.15 documentation.https://pytorch.org/vision/stable/models.html.[Online; accessed 2023-05-19].
  • Agustsson & Theis (2020)Eirikur Agustsson and Lucas Theis.Universally quantized neural compression.Advances in neural information processing systems, 33:12367–12376, 2020.
  • Bach (2017)Francis Bach.Breaking the curse of dimensionality with convex neural networks.The Journal of Machine Learning Research, 18(1):629–681, 2017.
  • Bai etal. (2021)Haoping Bai, Meng Cao, Ping Huang, and Jiulong Shan.Batchquant: Quantized-for-all architecture search with robust quantizer.Advances in Neural Information Processing Systems, 34:1074–1085, 2021.
  • Bartlett etal. (2017)PeterL Bartlett, DylanJ Foster, and MatusJ Telgarsky.Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017.
  • Chen etal. (2021)Wentao Chen, Hailong Qiu, Jian Zhuang, Chutong Zhang, YuHu, Qing Lu, Tianchen Wang, Yiyu Shi, Meiping Huang, and Xiaowe Xu.Quantization of deep neural networks for accurate edge computing.ACM Journal on Emerging Technologies in Computing Systems (JETC), 17(4):1–11, 2021.
  • Chmiel etal. (2020)Brian Chmiel, Ron Banner, Gil Shomron, Yury Nahshan, Alex Bronstein, Uri Weiser, etal.Robust quantization: One model to rule them all.Advances in neural information processing systems, 33:5308–5317, 2020.
  • Courbariaux etal. (2015)Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David.Binaryconnect: Training deep neural networks with binary weights during propagations.Advances in neural information processing systems, 28, 2015.
  • Cubuk etal. (2018)EkinD Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and QuocV Le.Autoaugment: Learning augmentation policies from data.arXiv preprint arXiv:1805.09501, 2018.
  • Défossez etal. (2021)Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve.Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987, 2021.
  • Deng etal. (2009)Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.doi: 10.1109/CVPR.2009.5206848.
  • Dinh etal. (2017)Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio.Sharp minima can generalize for deep nets.In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
  • Du etal. (2022)Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent Tan, and JoeyTianyi Zhou.Sharpness-aware training for free.Advances in Neural Information Processing Systems, 35:23439–23451, 2022.
  • Dziugaite & Roy (2017)GintareKarolina Dziugaite and DanielM Roy.Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017.
  • Dziugaite etal. (2020)GintareKarolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and DanielM Roy.In search of robust measures of generalization.Advances in Neural Information Processing Systems, 33:11723–11733, 2020.
  • Esser etal. (2020)StevenK. Esser, JeffreyL. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and DharmendraS. Modha.Learned step size quantization.In International Conference on Learning Representations, 2020.URL https://openreview.net/forum?id=rkgO66VKDS.
  • Foret etal. (2021)Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur.Sharpness-aware minimization for efficiently improving generalization.In International Conference on Learning Representations, 2021.URL https://openreview.net/forum?id=6Tm1mposlrM.
  • Galloway etal. (2017)Angus Galloway, GrahamW Taylor, and Medhat Moussa.Attacking binarized neural networks.arXiv preprint arXiv:1711.00449, 2017.
  • Gambardella etal. (2019)Giulio Gambardella, Johannes Kappauf, Michaela Blott, Christoph Doehring, Martin Kumm, Peter Zipf, and Kees Vissers.Efficient error-tolerant quantized neural network accelerators.In 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6. IEEE, 2019.
  • Gholami etal. (2021)Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, MichaelW Mahoney, and Kurt Keutzer.A survey of quantization methods for efficient neural network inference.arXiv preprint arXiv:2103.13630, 2021.
  • Gorsline etal. (2021)Micah Gorsline, James Smith, and Cory Merkel.On the adversarial robustness of quantized neural networks.In Proceedings of the 2021 on Great Lakes Symposium on VLSI, pp. 189–194, 2021.
  • Hendrycks & Dietterich (2019)Dan Hendrycks and Thomas Dietterich.Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019.
  • Hou etal. (2019)LuHou, Ruiliang Zhang, and JamesT Kwok.Analysis of quantized models.In International Conference on Learning Representations, 2019.
  • Hubara etal. (2016)Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio.Binarized neural networks.Advances in neural information processing systems, 29, 2016.
  • Ioffe & Szegedy (2015)Sergey Ioffe and Christian Szegedy.Batch normalization: Accelerating deep network training by reducing internal covariate shift.In International conference on machine learning, pp. 448–456. pmlr, 2015.
  • Izmailov etal. (2018)Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and AndrewGordon Wilson.Averaging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407, 2018.
  • Jacob etal. (2018)Benoit Jacob, Skirmantas Kligys, BoChen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko.Quantization and training of neural networks for efficient integer-arithmetic-only inference.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Jacot etal. (2018)Arthur Jacot, Franck Gabriel, and Clément Hongler.Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018.
  • Jiang etal. (2019)Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio.Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019.
  • Keskar etal. (2016)NitishShirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping TakPeter Tang.On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016.
  • Kingma & Ba (2014)DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
  • Krishnamoorthi (2018)Raghuraman Krishnamoorthi.Quantizing deep convolutional networks for efficient inference: A whitepaper.CoRR, abs/1806.08342, 2018.URL http://arxiv.org/abs/1806.08342.
  • Krizhevsky (2009)Alex Krizhevsky.Learning multiple layers of features from tiny images.pp. 32–33, 2009.URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  • Li etal. (2017)Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein.Training quantized nets: A deeper understanding.Advances in Neural Information Processing Systems, 30, 2017.
  • Li etal. (2018)Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein.Visualizing the loss landscape of neural nets.In Neural Information Processing Systems, 2018.
  • Liang etal. (2019)Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes.Fisher-rao metric, geometry, and complexity of neural networks.In The 22nd international conference on artificial intelligence and statistics, pp. 888–896. PMLR, 2019.
  • Lin etal. (2019)JiLin, Chuang Gan, and Song Han.Defensive quantization: When efficiency meets robustness.arXiv preprint arXiv:1904.08444, 2019.
  • Lin etal. (2013)Min Lin, Qiang Chen, and Shuicheng Yan.Network in network.arXiv preprint arXiv:1312.4400, 2013.
  • Liu etal. (2021)Jing Liu, Jianfei Cai, and Bohan Zhuang.Sharpness-aware quantization for deep neural networks.arXiv preprint arXiv:2111.12273, 2021.
  • Martens & Grosse (2015)James Martens and Roger Grosse.Optimizing neural networks with kronecker-factored approximate curvature.In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
  • McAllester (1999)DavidA McAllester.Pac-bayesian model averaging.In Proceedings of the twelfth annual conference on Computational learning theory, pp. 164–170, 1999.
  • Mishchenko etal. (2019)Yuriy Mishchenko, Yusuf Goren, Ming Sun, Chris Beauchene, Spyros Matsoukas, Oleg Rybakov, and Shiv NagaPrasad Vitaladevuni.Low-bit quantization and quantization-aware training for small-footprint keyword spotting.In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 706–711. IEEE, 2019.
  • Netzer etal. (2011)Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, BoWu, and AndrewY. Ng.Reading digits in natural images with unsupervised feature learning.In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
  • Neyshabur etal. (2017)Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro.A pac-bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017.
  • Polino etal. (2018)Antonio Polino, Razvan Pascanu, and Dan Alistarh.Model compression via distillation and quantization.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=S1XolQbRW.
  • Recht etal. (2019)Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar.Do imagenet classifiers generalize to imagenet?In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • Wang etal. (2022)Zheng Wang, JunchengB Li, Shuhui Qu, Florian Metze, and Emma Strubell.Squat: Sharpness-and quantization-aware training for bert.arXiv preprint arXiv:2210.07171, 2022.
  • Widrow etal. (1996)Bernard Widrow, Istvan Kollar, and Ming-Chang Liu.Statistical theory of quantization.IEEE Transactions on instrumentation and measurement, 45(2):353–361, 1996.
  • XijieHuang & Cheng (2023)ZhiqiangShen XijieHuang and Kwang-Ting Cheng.Variation-aware vision transformer quantization.arXiv preprint arXiv:2307.00331, 2023.
  • Xu etal. (2018)Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha.Alternating multi-bit quantization for recurrent neural networks.arXiv preprint arXiv:1802.00150, 2018.
  • Zhang etal. (2021)Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021.
  • Zhang etal. (2022a)Kaiqi Zhang, Ming Yin, and Yu-Xiang Wang.Why quantization improves generalization: Ntk of binary weight neural networks, 2022a.URL https://arxiv.org/abs/2206.05916.
  • Zhang etal. (2022b)Yedi Zhang, FuSong, and Jun Sun.Qebverif: Quantization error bound verification of neural networks.arXiv preprint arXiv:2212.02781, 2022b.
  • Zhou etal. (2017)Aojun Zhou etal.Incremental network quantization: Towards lossless cnns with low-precision weights.CoRR, abs/1702.03044, 2017.

Appendix A Flatness Landscape

The PAC-Bayesian and sharpness generalization measures both make use of the PAC-Bayes bounds, which estimate the bounds of the generalization error of a predictor (i.e. neural network). In our case, the PAC-Bayes bound is a function of the KL divergence of the prior distribution and posterior distribution of the model parameters, where the prior distribution is drawn without knowledge of the dataset and the posterior distribution is a perturbation on the trained parameters. It has been shown that when both distributions are isotropic Gaussian distributions, then PAC-Bayesian bounds are a good measure of generalization in small-scale experiments. We refer the reader to Jiang etal. (2019) for more detailed analysis and derivations, which we summarize here. The PAC-Bayes generalization measures are defined below:

μpac-bayes-init(f𝒘))\displaystyle\mu_{\text{pac-bayes-init}}(f_{\bm{w}}))italic_μ start_POSTSUBSCRIPT pac-bayes-init end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ) )=𝒘𝒘0224σ2+log(mσ)+10absentsuperscriptsubscriptnorm𝒘superscript𝒘0224superscript𝜎2𝑚𝜎10\displaystyle=\frac{||\bm{w}-\bm{w}^{0}||_{2}^{2}}{4\sigma^{2}}+\log(\frac{m}{%\sigma})+10= divide start_ARG | | bold_italic_w - bold_italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_σ end_ARG ) + 10(15)
μpac-bayes-orig(f𝒘))\displaystyle\mu_{\text{pac-bayes-orig}}(f_{\bm{w}}))italic_μ start_POSTSUBSCRIPT pac-bayes-orig end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ) )=𝒘224σ2+log(mδ)+10absentsuperscriptsubscriptnorm𝒘224superscript𝜎2𝑚𝛿10\displaystyle=\frac{||\bm{w}||_{2}^{2}}{4\sigma^{2}}+\log(\frac{m}{\delta})+10= divide start_ARG | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(16)

Where σ𝜎\sigmaitalic_σ is chosen to be the largest number such that 𝔼𝒖𝒩(μ,σ2I)[^(f𝒘+𝒖)]0.1subscript𝔼similar-to𝒖𝒩𝜇superscript𝜎2𝐼delimited-[]^subscript𝑓𝒘𝒖0.1\mathbb{E}_{\bm{u}\sim\mathcal{N}(\mu,\sigma^{2}I)}[\hat{\mathcal{L}}\left(f_{%\bm{w}+\bm{u}})\right]\leq 0.1blackboard_E start_POSTSUBSCRIPT bold_italic_u ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) end_POSTSUBSCRIPT [ over^ start_ARG caligraphic_L end_ARG ( italic_f start_POSTSUBSCRIPT bold_italic_w + bold_italic_u end_POSTSUBSCRIPT ) ] ≤ 0.1, and m𝑚mitalic_m is the sample size of the dataset

From the same PAC-Bayesian bound framework, we can also derive the sharpness measure, by using the worst-case noise α𝛼\alphaitalic_α rather than the Gaussian sampled noise.

μsharpness-init(f𝒘))\displaystyle\mu_{\text{sharpness-init}}(f_{\bm{w}}))italic_μ start_POSTSUBSCRIPT sharpness-init end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ) )=𝒘𝒘022log(2ω)4α2+log(mσ)+10absentsuperscriptsubscriptnorm𝒘superscript𝒘0222𝜔4superscript𝛼2𝑚𝜎10\displaystyle=\frac{||\bm{w}-\bm{w}^{0}||_{2}^{2}\log(2\omega)}{4\alpha^{2}}+%\log(\frac{m}{\sigma})+10= divide start_ARG | | bold_italic_w - bold_italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 italic_ω ) end_ARG start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_σ end_ARG ) + 10(17)
μsharpness-orig(f𝒘))\displaystyle\mu_{\text{sharpness-orig}}(f_{\bm{w}}))italic_μ start_POSTSUBSCRIPT sharpness-orig end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ) )=𝒘22log(2ω)4α2+log(mδ)+10absentsuperscriptsubscriptnorm𝒘222𝜔4superscript𝛼2𝑚𝛿10\displaystyle=\frac{||\bm{w}||_{2}^{2}\log(2\omega)}{4\alpha^{2}}+\log(\frac{m%}{\delta})+10= divide start_ARG | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 2 italic_ω ) end_ARG start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(18)

Where α𝛼\alphaitalic_α is chosen to be the largest number such that max|ui|α^(f𝒘+𝒖)0.1subscriptmaxsubscript𝑢𝑖𝛼^subscript𝑓𝒘𝒖0.1\text{max}_{|u_{i}|\leq\alpha}\hat{\mathcal{L}}(f_{\bm{w}+\bm{u}})\leq 0.1max start_POSTSUBSCRIPT | italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_α end_POSTSUBSCRIPT over^ start_ARG caligraphic_L end_ARG ( italic_f start_POSTSUBSCRIPT bold_italic_w + bold_italic_u end_POSTSUBSCRIPT ) ≤ 0.1and ω𝜔\omegaitalic_ω is the number of parameters in the model.

For magnitude-aware measures Keskar etal. (2016), the ratio of the magnitude of the perturbation to the magnitude of the parameter is bound by a constant αsuperscript𝛼\alpha^{\prime}italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. By bounding the ratio of perturbation to parameter magnitude, we prevent parameters from changing signs. This change leads to the following magnitude-aware generalization measures:

μpacbayesmaginit(f𝒘)subscript𝜇𝑝𝑎𝑐𝑏𝑎𝑦𝑒𝑠𝑚𝑎𝑔𝑖𝑛𝑖𝑡subscript𝑓𝒘\displaystyle\mu_{pac-bayes-mag-init}(f_{\bm{w}})italic_μ start_POSTSUBSCRIPT italic_p italic_a italic_c - italic_b italic_a italic_y italic_e italic_s - italic_m italic_a italic_g - italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT )=14i=1ωlog(ϵ2+(σ2+1)𝒘𝒘𝟎22/ωϵ2+σ2|wiwi0|2)+log(mδ)+10absent14superscriptsubscript𝑖1𝜔superscriptitalic-ϵ2superscript𝜎21superscriptsubscriptnorm𝒘superscript𝒘022𝜔superscriptitalic-ϵ2superscript𝜎2superscriptsubscript𝑤𝑖superscriptsubscript𝑤𝑖02𝑚𝛿10\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\sigma^{\prime 2}+1)||\bm{w}-\bm{w^{0}}||_{2}^{2}/\omega}{\epsilon^{2}+\sigma^%{\prime 2}|w_{i}-w_{i}^{0}|^{2}}\right)}+\log(\frac{m}{\delta})+10= divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + 1 ) | | bold_italic_w - bold_italic_w start_POSTSUPERSCRIPT bold_0 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ω end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(19)
μpacbayesmagorig(f𝒘)subscript𝜇𝑝𝑎𝑐𝑏𝑎𝑦𝑒𝑠𝑚𝑎𝑔𝑜𝑟𝑖𝑔subscript𝑓𝒘\displaystyle\mu_{pac-bayes-mag-orig}(f_{\bm{w}})italic_μ start_POSTSUBSCRIPT italic_p italic_a italic_c - italic_b italic_a italic_y italic_e italic_s - italic_m italic_a italic_g - italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT )=14i=1ωlog(ϵ2+(σ2+1)𝒘22/ωϵ2+σ2|wiwi0|2)+log(mδ)+10absent14superscriptsubscript𝑖1𝜔superscriptitalic-ϵ2superscript𝜎21superscriptsubscriptnorm𝒘22𝜔superscriptitalic-ϵ2superscript𝜎2superscriptsubscript𝑤𝑖superscriptsubscript𝑤𝑖02𝑚𝛿10\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\sigma^{\prime 2}+1)||\bm{w}||_{2}^{2}/\omega}{\epsilon^{2}+\sigma^{\prime 2}|%w_{i}-w_{i}^{0}|^{2}}\right)}+\log(\frac{m}{\delta})+10= divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + 1 ) | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ω end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(20)
μsharpnessmaginit(f𝒘)subscript𝜇𝑠𝑎𝑟𝑝𝑛𝑒𝑠𝑠𝑚𝑎𝑔𝑖𝑛𝑖𝑡subscript𝑓𝒘\displaystyle\mu_{sharpness-mag-init}(f_{\bm{w}})italic_μ start_POSTSUBSCRIPT italic_s italic_h italic_a italic_r italic_p italic_n italic_e italic_s italic_s - italic_m italic_a italic_g - italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT )=14i=1ωlog(ϵ2+(α2+4log(2ω/δ)||𝒘𝒘𝟎||22/ωϵ2+α2|wiwi0|2)+log(mδ)+10\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\alpha^{\prime 2}+4\log(2\omega/\delta)||\bm{w}-\bm{w^{0}}||_{2}^{2}/\omega}{%\epsilon^{2}+\alpha^{\prime 2}|w_{i}-w_{i}^{0}|^{2}}\right)}+\log(\frac{m}{%\delta})+10= divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_α start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + 4 roman_log ( 2 italic_ω / italic_δ ) | | bold_italic_w - bold_italic_w start_POSTSUPERSCRIPT bold_0 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ω end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(21)
μsharpnessmagorig(f𝒘)subscript𝜇𝑠𝑎𝑟𝑝𝑛𝑒𝑠𝑠𝑚𝑎𝑔𝑜𝑟𝑖𝑔subscript𝑓𝒘\displaystyle\mu_{sharpness-mag-orig}(f_{\bm{w}})italic_μ start_POSTSUBSCRIPT italic_s italic_h italic_a italic_r italic_p italic_n italic_e italic_s italic_s - italic_m italic_a italic_g - italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT )=14i=1ωlog(ϵ2+(α2+4log(2ω/δ)||𝒘||22/ωϵ2+α2|wiwi0|2)+log(mδ)+10\displaystyle=\frac{1}{4}\sum_{i=1}^{\omega}{\log\left(\frac{\epsilon^{2}+(%\alpha^{\prime 2}+4\log(2\omega/\delta)||\bm{w}||_{2}^{2}/\omega}{\epsilon^{2}%+\alpha^{\prime 2}|w_{i}-w_{i}^{0}|^{2}}\right)}+\log(\frac{m}{\delta})+10= divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_α start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + 4 roman_log ( 2 italic_ω / italic_δ ) | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ω end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) + 10(22)

Where ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001 and σ𝜎\sigmaitalic_σ is chosen to be the largest number such that 𝔼𝒖[^(f𝒘+𝒖)]0.1subscript𝔼𝒖delimited-[]^subscript𝑓𝒘𝒖0.1\mathbb{E}_{\bm{u}}\left[\hat{\mathcal{L}}(f_{\bm{w}+\bm{u}})\right]\leq 0.1blackboard_E start_POSTSUBSCRIPT bold_italic_u end_POSTSUBSCRIPT [ over^ start_ARG caligraphic_L end_ARG ( italic_f start_POSTSUBSCRIPT bold_italic_w + bold_italic_u end_POSTSUBSCRIPT ) ] ≤ 0.1,

Appendix B Experiment Setup For Measuring Sharpness-based Metrics

B.1 Training Setup

We used different models and datasets to compute the generalization gap using proxy metrics described in Section A.

Our experiments employed the LSQ method Esser etal. (2020) for weight quantization. The CIFAR-10, CIFAR-100, and ImageNet datasets were utilized for testing purposes. We applied three distinct quantization levels for quantized models: 2, 4, and 8 bits. The CIFAR-10 and CIFAR-100 NiN models are trained with a base width of 25, and they are trained for 300 epochs, with an SGD optimizer, an initial learning rate of 0.1, momentum of 0.9, and a weight decay of 0.0001. We utilize a multi-step scheduler with steps at epochs 100 and 200, and the gamma is 0.1. The ResNet models that we use for these two datasets have a base width of 16 and use the same optimizer as the NiN network. However, these models are trained for 200 epochs, and the steps happen at epochs 80 and 160. The ResNet models we utilize for comparing sharpness-based measures for the ImageNet dataset have a base width of 64. We again use the same optimizer only with a different learning rate of 0.01. We fine-tune the models from Pytorch pre-trained weights for 120 epochs, and the steps happen at epochs 30, 60, and 90.

B.2 Measuring the Metrics

To measure the PAC-Bayesian and sharpness measures, we measure these metrics for the cases of magnitude aware and the normal for each quantization level. In each case, we run the search for finding the maximum amount of possible noise (σ𝜎\sigmaitalic_σ), for 15 iterations, and within each iteration we calculate the mean of the accuracy on the training data over 10 runs to remove the effect of randomness. As an additional step in calculating the sharpness measures, we perform the gradient ascent step to maximize the loss value for 20 iterations. We use a learning rate of 0.0001 with an SGD optimizer for the gradient ascent process.

B.3 Measuring Generalization Gaps

In our experiments for measuring the generalization gaps, we trained almost 2000 CIFAR-10 and CIFAR-100 models. The main backbone in all these experiments was NiN. We trained the models over the variation of hyperparameter values for 5 hyperparameter, and each hyperparameter had 3 choices. For the case of CIFAR-10, here are the values for hyperparameters:

  • Optimizer algorithm: {SGD, ADAM, RMSProp}

  • Learning rate: {0.1, 0.05, 0.01} for SGD, {0.001, 0.0005, 0.0001} for ADAM and RMSProp

  • Weight decay: {0.0, 0.0001, 0.0002}

  • Width multiplier: {8, 10, 12}

  • Depth multiplier: {2, 3, 4}

For CIFAR-100 everything is the same with the minor difference of depth multipliers being in the set of {3, 4, 5}.

Each NiN training instance is trained for 300 epochs, in every case a step scheduler with steps at the 100th and 200th epoch and a gamma of 0.1 is utilized. The model with the lowest loss on training data is used with no information about the test data. Then the statistics in 1 are generated.

B.4 Computation Requirements

To train the NiN models for each quantization level, we use one NVIDIA A100 GPU with a batch size of 128. Each experiment takes almost 6 days to run, which on average is equivalent to 35 minutes per model training. We use 8 GPUs, 4 for CIFAR-10 and 4 for CIFAR-100.

For evaluating the sharpness measures, the main bottleneck is for ImageNet models, as evaluating the sharpness measures for each quantization level requires almost 600 evaluations on the training data in the worst. Running each quantization level on one NVIDIA A100 GPU requires 33 hours on average.

Appendix C Distortion Experiments

These are the extended results for investigating the generalization gap under distortion. We provide a generalization gap of quantized and full precision models on augmenteddatasets.

C.1 Training Setup

For full precision models, we used pre-trained models publicly available on the Pytorch website Pyt .For quantized models, we use weight quantization using LSQ Esser etal. (2020) method. We use CIFAR-100 and ImageNet datasets in our tests. We use three different quantization levels for quantized models: 2, 4, and 8 bits. We use a multi-step scheduler with steps at 30, 60, and 90 with an initial learning rate of 0.01 and gamma of 0.1. We use weight decay of 1e-4 and SGD optimizer. We trained all models for 120 epochs. Finally, we used pre-trained models from Pytorch to initialize weights for LSQ quantization.

C.2 Data Preparation

For augmented datasets, we use the corrupted Imagenet-C and CIFAR100-C datasets proposed in Hendrycks & Dietterich (2019).Table 6 presents the results of the experiments performed on the ResNet-18, MoobileNet V2 and ResNet-50 models trained on theImageNet dataset, and Table 5 present the results for ResNet-18, MobileNet V1, and VGG-19 models on CIFAR-100. These tables show the effect of distortion on the generalization gap of quantized models when various types and severity levels of distortions are used.Specifically, 15 different types of distortions were applied to the models. For each distortion type, the generalization gap was computed by subtracting the test loss on the distorted dataset from the loss on the original ImageNet training dataset.

C.3 Computation Setup

For these experiments, we used 8 NVIDIA A100 GPUs with 40 GB of RAM to train ImageNeta and CIFAR-100 models. With the above training hardware and submitted code, each ImageNet model takes 18 hours to train on average. CIFAR-100 models take much less time, and on average, one can train each model in less than an hour.

C.4 Results on CIFAR-100 Dataset

For the CIFAR-100 dataset, we computed the generalization gap ( by subtracting test loss on augmented CIFAR100-C data from train loss on the original CIFAR-100 dataset) for ResNet-18, MobileNet-V1, and VGG-19. Unlike the Imagenet-C dataset, the CIFAR100-C dataset comes with only one distortion severity level. Table 5 shows the result of our experiment on the CIFAR-100 dataset. Compared to full-precision models, quantized models show smaller generalization gaps in all cases.

ResNet-18MobileNet-V1VGG-19AugmentationFP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2Gaussian Noise2.7821.5461.6141.2254.3270.0220.0270.1914.261.6870.910.497Shot Noise2.0271.3361.41.0413.3570.0230.7730.1473.2671.2990.7010.427Impulse Noise2.0041.1551.2521.0042.9210.0230.6330.1564.2051.3690.9640.563Defocus Noise1.0040.7520.8150.5981.8890.0230.4790.0571.890.5010.4010.267Glass Blue4.6492.4452.4791.8895.8490.0220.1860.238.7092.3691.4570.797Motion Blur1.3730.9230.9710.7062.3860.0251.0010.0672.4130.6870.540.335Zoom Blur1.490.9150.9790.7362.6570.0220.2240.0682.5790.7880.5810.351Snow1.4090.7690.8530.6662.4330.0220.260.0852.6530.7090.4450.256Frost1.4730.4870.5810.42.4650.0230.2450.0032.5060.4390.1150.005Fog0.9910.5050.5830.4091.8580.0220.0560.0071.880.3380.2380.145Brightness0.9960.5430.6090.4351.8690.0220.0480.011.8780.3490.2290.112Contrast1.0180.5320.6180.4341.9050.0220.0810.0151.9070.3650.2690.179Elastic1.4150.9961.0480.8282.4930.0220.0550.1132.5070.80.6460.429Pixelate1.1850.8880.9820.7672.1710.0220.3360.0872.1360.7130.4790.33JPEG1.71.2171.3131.0022.750.0220.3090.1192.8681.120.7030.427

C.5 Results on ImageNet Dataset

For the ImageNet dataset, we computed the generalization gap ( by subtracting test loss on augmented data from train loss on the original ImageNet dataset) for ResNet-18, MobileNet V2, and ResNet-50.Table 6 shows the full list of experiments. As seen, compared to the full precision model and unlike the CIFAR-100 dataset, not all quantization levels show a better generalization gap. Especially for the MobileNet-V2 model, Int2 quantization shows the worst generalization gap in most distortion types and severity levels. But in general, Int8 and Int4 show better generalization gaps in almost all models, distortion types, and levels.

Severity 1Severity 2Severity 3Severity 4Severity 5ModelAugmentationFP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2FP32Int8Int4Int2Gaussian Noise1.0670.860.9391.211.9131.461.6292.2013.4392.522.7963.6585.5293.964.3525.1157.6535.626.1176.093Shot Noise1.2380.961.0541.3332.2891.721.8872.4213.8012.752.9873.6856.4384.44.7395.3047.9195.415.8056.003Impulse Noise2.2351.782.0582.3243.1772.352.6363.2794.0612.93.1854.0016.0964.264.5955.3247.7815.575.9956.114Defocus Noise0.9790.890.8580.8221.4321.371.3251.3022.3942.342.2632.2173.2853.193.0992.9343.9833.923.8083.498Glass Blue1.181.071.0310.9851.9691.861.8121.8043.8223.833.7273.544.2214.254.153.9094.6524.624.5424.139Motion Blur0.6870.550.5420.5091.371.221.2451.2612.5212.422.4542.3813.73.643.6783.4124.2854.234.2753.875Zoom Blur1.5181.381.3821.3862.2192.12.1192.0942.6882.582.5882.5183.2133.123.1373.0283.6663.593.5993.437Snow1.4011.010.9981.1243.2152.362.3742.6432.9692.072.0942.3153.972.812.8693.0744.5153.483.5173.572Frost0.9490.660.6260.6332.0931.681.6811.8122.9782.532.5532.6883.1412.742.7662.8933.7133.313.3623.447Fog0.8090.420.4440.4051.2140.680.7340.7741.8571.181.2731.4312.3471.681.7621.9583.773.033.1413.275Brightness0.1210.040.0190.1550.2210.10.080.0620.3780.190.1840.0840.6310.370.360.3230.9860.620.6260.672Contrast0.5230.240.2320.130.8670.40.4130.3961.6270.810.8611.0313.612.372.5292.9215.2644.634.7654.479Elastic0.5380.430.4060.2872.0261.951.9111.8331.1161.030.9690.8841.9971.941.8441.7554.1124.113.9573.57Pixelate0.6120.50.4920.4160.5990.510.5060.4651.8891.721.7341.9583.0462.932.883.3063.3693.323.3133.51ResNet-18JPEG0.590.480.4680.3750.8010.680.6740.6270.9720.850.8410.8241.5991.461.4461.4912.6152.432.4052.487Gaussian Noise1.0410.760.8572.781.9231.3821.5363.7553.4252.52.7625.0095.2514.0654.5186.1826.9975.8156.4237.124Shot Noise1.1320.8431.0272.9262.2141.5911.8463.9753.672.6243.0135.0585.8914.3635.0116.367.0455.4186.1376.96Impulse Noise1.6351.4831.5853.0432.5972.2232.2844.1443.4232.7512.9014.9615.3024.1714.5916.2966.9795.7536.3157.126Defocus Noise0.8630.7991.0053.8561.3261.2661.5194.2862.232.2132.4344.8583.0592.9833.3285.1193.7843.6554.25.303Glass Blue1.2231.1411.5093.5382.1152.0392.4434.2574.014.0034.3094.9434.3754.3534.5645.0394.6684.6014.7095.141Motion Blur0.6430.530.6413.0681.3351.2091.3543.7682.3922.2822.4354.3493.53.4183.5884.7744.1084.0284.2354.983Zoom Blur1.5391.4231.6073.5642.2822.1852.3583.972.7742.6852.8864.3353.3173.2633.4924.5673.7973.7384.0354.817Snow1.2530.9041.1682.3773.0742.4292.6944.0172.8382.162.4973.8693.7752.9453.2864.7974.5623.7763.9965.151Frost0.9410.6580.82.2362.1931.7842.0213.7133.1542.6732.9694.6823.3452.9043.2234.943.9563.4843.8355.46Fog0.6990.3540.8223.8741.0840.6241.244.4541.7151.1451.8024.9292.2561.6752.1114.9693.7923.1163.2985.371Brightness0.0340.050.0191.3420.1430.0080.0891.3590.3150.1170.2111.5490.5950.3030.4221.9871.0020.5910.7522.663Contrast0.4820.1880.8163.7170.8460.3941.5134.5711.6240.8993.0955.5563.662.7255.8166.3965.4114.8816.5056.59Elastic0.4420.3470.4812.3961.9731.8712.1054.0120.9620.871.1132.4741.9131.8072.212.9594.1063.9824.6934.036Pixelate0.9260.6530.8721.8831.4441.020.9341.8382.1551.8222.4682.1723.1113.0643.7732.7553.9793.9933.9883.301MoblieNet V2JPEG0.4910.3820.5541.7540.6750.5520.7841.8480.8260.6930.9671.9281.3571.1651.5552.1822.1821.9022.4622.545Gaussian Noise0.9380.9140.9280.9731.4371.2821.1121.5712.3632.0472.5132.893.7193.2553.7544.885.1344.9995.3397.828Shot Noise0.9610.9460.9571.0261.5851.4081.2461.832.4482.1662.0233.2154.0843.6973.8875.7484.9244.9195.2297.656Impulse Noise1.7031.6521.6761.7892.0131.8741.4642.3732.5642.2951.9953.2113.9623.5073.4585.4355.284.9425.1058.123Defocus Noise1.0591.0420.9110.8691.4411.4141.3091.2982.3442.3112.2862.2313.2443.2253.2263.1644.0524.0494.0193.994Glass Blue1.3491.271.0111.4572.2972.1691.8292.0884.6134.5514.1854.3465.0575.0094.8154.7935.3995.3765.345.102Motion Blur0.7310.6380.6230.5781.3141.3071.191.2382.5632.5512.3372.5014.1484.0573.6723.965.0485.0334.4044.75Zoom Blur1.5091.4731.2611.3612.2522.1872.0372.1342.8222.7362.5712.6973.443.3373.1393.3254.0393.9493.6763.916Snow1.2291.131.0481.1432.622.5292.4812.9332.3752.3172.1932.4783.1273.0162.9413.3473.6973.4373.84.209Frost0.8450.8370.6740.6531.7691.7261.5061.7852.5632.5222.2452.5882.7612.722.4352.8163.3223.2912.9723.427Fog0.6910.6850.5820.5010.8970.8760.7840.9261.3051.2481.1671.3371.811.6571.6351.9093.2612.962.9793.569Brightness0.3450.250.210.0810.3820.3140.2590.4310.4530.4460.3410.2220.5840.5140.4730.3670.7870.7160.6740.609Contrast0.5450.4490.420.3080.690.6770.5680.4941.0470.9980.8671.0512.3872.1731.852.6244.6864.3943.6734.914Elastic0.6550.6250.5170.4222.2742.2431.9082.1221.6021.5711.2421.4162.7042.6712.2092.5915.5985.5224.6475.348Pixelate0.8710.710.7270.6841.0421.0211.0150.7781.9711.5751.5561.9743.3733.2293.2083.4314.0384.0143.9964.179ResNet-50JPEG0.7930.7010.670.5220.9570.980.8320.7041.0921.0230.960.8491.5371.4251.3561.3942.2362.1181.9672.228

QGen: On the Ability to Generalize in Quantization Aware Training (2024)
Top Articles
Latest Posts
Article information

Author: Horacio Brakus JD

Last Updated:

Views: 6333

Rating: 4 / 5 (71 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.