一、参考文档

pytorch提供了两种量化模式:

  • Eager Mode Quantization:手动进行融合,并指定量化和反量化的位置
  • FX Graph Mode Quantization:自动

二、Eager模式支持的量化类型

  • PTQ支持:static、dynamic
  • QAT支持:static
  • 动态量化一般运用在NLP领域的模型
  • 静态量化一般运用在计算机视觉,主要针对CNN网络

1、Post Training Dynamic Quantization

这是最简单的量化形式,其中权重静态量化,输入在推理过程中动态量化。
激活是以浮点格式读取和写入存储器的
PTDQ API:

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

Post Training Dynamic Quantization,简称为Dynamic Quantization,也就是动态量化,或者叫作Weight-only的量化

可以有更高的精度(因为裁剪范围被精确校准)
目前只支持线性层(linear)和递归(LSTM, GRU, RNN)层的动态量化。并且在运行时对每一层的激活进行校准和量化会增加计算开销。

1.1动态量化的计算

默认只对部分op进行转换:Linear、LSTM、LSTMCell、RNNCell、GRUCell。

  • 用于activation的PlaceholderObserver 就是个占位符,什么也不做;
  • 用于weightMinMaxObserver就是记录输入tensor中的最大值和最小值,用来计算scale和zp。min_val,max_val代表op权重数据/input tensor数据分布的最小值和最大值;qmin, qmax代表量化后的取值范围的最小、最大值(-128和127)。使用对称量化公式计算。
  • 由此可知权重部分量化其实是“静态”的,之所以叫“动态量化”是因为在于前向推理的时候动态的把input的float tensor转换为量化tensor。
  • 动态量化的前向推理的时候,nnqd.Linear会调用torch.ops.quantized.linear_dynamic函数,输入就是上面pack好的量化后的权重浮点型的bias,linear_dynamic函数最终会被PyTorch分发到C++中的apply_dynamic_impl函数。为了将输入转为量化形式,apply_dynamic_impl函数使用下面逻辑对输入进行量化
Tensor q_input = at::quantize_per_tensor(input_contig, q_params.scale, q_params.zero_point, c10::kQUInt8);

动态量化的本质就是基于运行时对数据范围的观察,来动态确定对输入进行量化时的scale值,确保输入tensor的scale能基于输入数据进行优化。而模型参数则是提前转换成了INT8的格式。这样,当输出也被量化后,网络中的运算就使用向量化的INT8指令来完成。当前layer在输出时还需要把结果反量化为float32。

2、Post Training Static Quantization

权重和激活都是静态量化,将激活融合到前面的层中,量化后需要数据集进行校准,以确定激活的最佳量化参数。

与动态量化的共同点:都把网络的权重参数从float32转换为int8;不同点:需要把训练集或者和训练分布类似的的数据喂给模型(没有反向传播),然后通过每个op输入的分布特点来计算激活(activation)的量化参数–也就是Calibrate。静态量化包含激活量化,也就是op 前向推理之后的处理,

PTSQ API:

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

从上面的API可以看出静态量化主要五个步骤:

  • 1、fuse_model:合并一些layer,以提高速度和准确度
  • 2、设置qconfig:Qconfig的一个实例,维护量化observer
    – default_qconfig维护的两个observer如下表:

量化的backend

activation

weight

fbgemm

HistogramObserver (reduce_range=True)

PerChannelMinMaxObserver (default_per_channel_weight_observer)

qnnpack

HistogramObserver (reduce_range=False)

MinMaxObserver (default_weight_observer)

默认(非fbgemm和qnnpack)

MinMaxObserver (default_observer)

MinMaxObserver (default_weight_observer)

  • 3、 prepare:给每个子module插入Observer,用来收集和定标数据。
  • 4、喂数据:不是训练。是为了获取数据的分布特点,来更好的计算activation的scale和zp。至少要喂上几百个迭代的数据。
  • 5、转换模型:这个过程和dynamic量化类似,本质就是检索模型中op的type,如果某个op的type属于字典DEFAULT_STATIC_QUANT_MODULE_MAPPINGS的key(注意字典和动态量化的不一样了),那么,这个op将被替换为key对应的value
  • 不是实时校准激活,而是使用验证数据预校准和固定裁剪范围(静态的)
  • 静态量化比动态量化具有更快的推理速度,因为消除了层之间float和int的转换开销

2.1 静态量化过程中scale和zero point的计算

pytorch的scale和zero point的计算逻辑

#qscheme 是 torch.per_tensor_symmetric 或者torch.per_channel_symmetric时
max_val = torch.max(-min_val, max_val)
scale = max_val / (float(qmax - qmin) / 2)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
if self.dtype == torch.quint8:
    zero_point = zero_point.new_full(zero_point.size(), 128)

#qscheme 是 torch.per_tensor_affine时
scale = (max_val - min_val) / float(qmax - qmin)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.max(zero_point, torch.tensor(qmin, device=device, dtype=zero_point.dtype))
zero_point = torch.min(zero_point, torch.tensor(qmax, device=device, dtype=zero_point.dtype))
  • QuantStub的scale和zp:非对称量化计算。QuantStub使用的是HistogramObserver,根据输入从[-3,3]的分布,HistogramObserver计算得到min_val、max_val分别是-3、2.9971,而qmin和qmax又分别是0、127
  • conv activation的scale和zp:卷积后的tensor使用非对称量化公式计算。observer(quint8)是HistogramObserver,又是reduce_range的,因此其qmin,qmax = 0 ,127;min_val,max_val为输入数据 + 权重值根据L2Norm确定
  • conv weight的scale和zp:对卷积权重tensor使用对称量化公式计算。weight(qint8)是PerChannelMinMaxObserver,不是reduce_range的,因此其qmin, qmax = -128, 127;min_val,max_val为输入数据的最小值和最大值确定。
  • fc activation的scale和zp:计算方法同conv
  • fc weight的scale和zp:
  • relu activation的scale和zp:非对称量化计算
  • 在conv过程中假设权重为-0.7898,输入tensor的第一个值为-0.9912,那卷积后得到的应该是-0.7898 x -0.9912=0.7828,但实际得到的是0.7801,这说明已经在引入误差了(%0.34),因此fuse_modules可以提高精度(每一层都会引入类似的误差)。
  • 静态量化和动态量化最大的区别就是:静态量化的float输入必经QuantStub变为int,此后到输出之前都是int;动态量化的float输入是经动态计算的scale和zp量化为int,op输出时转换回float。

3、static quantization aware training

所有权重和偏差都以FP32存储,在前向传播中,量化通过FakeQuantize模块进行内部模拟(在数据量化后立刻反量化)
QAT API:

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
  • 1、设置qconfig:在设置之前,模型首先设置为训练模式。
    – 在QAT的qconfig中activation权重的observer都变成了FakeQuantize(和observer是has a的关系,也即包含一个observer),并且参数不一样(qmin、qmax、schema,dtype,qschema,reduce_range这些参数)
    – FakeQuantize包含的observer是MovingAverageMinMaxObserver,继承自前面提到过的MinMaxObserver,但是求最小值和最大值的方法有点区别。
  • 2、fuse_modules:与静态量化一样
  • 3、 prepare_qat:使用的是prepare_qat API。主要有两点区别:prepare_qat要把qconfig安插到每个op上,qconfig的内容本身就不同,参考五部曲中的第一步;prepare_qat 中需要多做一步转换子module的工作,需要inplace的把模型中的一些子module替换了,替换的逻辑就是从DEFAULT_QAT_MODULE_MAPPINGS的key替换为value,这个字典的定义也不同。
  • 4、喂数据:和静态量化完全不同,在QAT中这一步是用来训练的。每个op的输入都需要经过self.weight_fake_quant来处理下,输出又都需要经过self.activation_post_process来处理下,这两个都是FakeQuantize的实例,只是里面包含的observer不一样。
    FakeQuantize前向函数中的fake_quantize_per_channel_or_tensor_affine实现了quantize和dequantize,用公式表示的话为:out = (clamp(round(x/scale + zero_point), quant_min, quant_max) - zero_point) * scale。也就是说,这是把量化的误差引入到了训练loss之中了。
    – 这样,在QAT中,所有的weights和activations就像上面那样被fake quantized了,且参与模型训练中的前向和反向计算。float值被round成了(用来模拟的)int8值但是所有的计算仍然是通过float来完成的。 这样以来,所有的权重在优化过程中都能感知到量化带来的影响,称之为量化感知训练(支持cpu和cuda),精度也因此更高。
  • 5、转换convert:和静态量化一样,需要注意的是,QAT中,有一些module在prepare中已经转换成新的module了

四、Quantization Stack

量化流程中使用到的

  • Observer and FakeQuantize
    – Observer :收集张量信息,如统计张量的最大最小值,并计算量化参数
    – FakeQuantize:伪量化模块
  • QConfig: 是Observer 和 FakeQuantize模块类的命名元组,可以进行配置(namedtuple )
    – 不同类型的Observer/FakeQuantize
    – 支持权重和激活配置

五、量化 API

官方文档:https://pytorch.org/docs/stable/quantization-support.html 参考文档:https://zhuanlan.zhihu.com/p/299108528

1、顶层API

1.1 quantize & quantize_dynamic & quantize_qat:使用训练后静态量化 / 动态(仅weights-only)/ 量化感知;

细粒度通过qconfig设置

  • qunatize:需要先准备模型进行校准,API如下
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)
  • quantize_dynamic :本质就是检索模型中op的type,如果某个op的type属于字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS的key,那么,这个op将被替换为key对应的value,API如下
    – 其中qconfig_spec参数指定了一组qconfig,具体就是哪个op(operation,CNN中各种操作,比图conv、linear。batchnorm等)对应哪个qconfig
    – 每个qconfig是Qconfig类的实例(instance),封装了两个observer;
    – 两个observer分别是权重和激活的observer
    qconfig_spec=None时时默认行为
    qconfig_spec赋值为set,比如:{nn.LSTM, nn.Linear},意思是指定当前模型中的哪些layer要被dynamic quantization;
    qconfig_spec赋值为一个dict,key为submodule的name或type,value为QConfigDynamic实例
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS

# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
    nn.GRUCell: nnqd.GRUCell,
    nn.Linear: nnqd.Linear,
    nn.LSTM: nnqd.LSTM,
    nn.LSTMCell: nnqd.LSTMCell,
    nn.RNNCell: nnqd.RNNCell,
}
  • 当type从key换为value,新的type需要实例化,并且要使用之前的权重参数,这个一般是通过from_float()来进行实例化

1.2 prepare & prepare_qat:为量化校准或量化感知训练准备模型副本,需优先配置.qconfig。

  • 训练后静态量化(PTQ)中使用prepare :插入observer模块,以便在校准期间,观测激活张量;
torch.ao.quantization.prepare(model, inplace=False, allow_list=None, observer_non_leaf_module_list=None, prepare_custom_config_dict=None)

model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

  • 量化感知训练(QAT)中使用prepare_qat:插入observer 和fake_quants 模块,需要设置为train()模式才能运行,在校准期间观测权重和激活张量。
torch.ao.quantization.prepare_qat(model, mapping=None, inplace=False)

model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

1.3 convert:通过对目标模块类调用from_float方法来根据映射将输入模块中的子模块转换乘不同的模块。如果remove_qconfig设置的是True,则在末尾删除qconfig

在QAT量化中,整个计算是以浮点的形式进行的,在训练结束时,通过convert转换函数将浮点转为量化后的数据

torch.ao.quantization.convert(module, mapping=None, inplace=False, remove_qconfig=True, is_reference=False, convert_custom_config_dict=None)

model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

  • from_float()
    nnqat.Linear模块的from_float方法如下
@classmethod
    def from_float(cls, mod):
        r"""Create a qat module from a float module or qparams_dict
            Args: `mod` a float module, either produced by torch.ao.quantization utilities
            or directly from user
        """
        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
            " qat."
            + cls.__name__
            + ".from_float only works for "
            + cls._FLOAT_MODULE.__name__
        )
        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
        assert mod.qconfig, "Input float module must have a valid qconfig"
        if type_before_parametrizations(mod) == LinearReLU:
            mod = mod[0]

        qconfig = mod.qconfig
        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)

        if is_parametrized(mod, "weight"):
            transfer_parametrizations_and_params(mod, qat_linear, "weight")
        else:
            qat_linear.weight = mod.weight

        if is_parametrized(mod, "bias"):
            transfer_parametrizations_and_params(mod, qat_linear, "bias")
        else:
            qat_linear.bias = mod.bias

        return qat_linear

此方法会构造qat_linear类实例。
from_float()主要做的事情就是

  • 使用MinMaxObserver计算模型中op权重参数中tensor的最大值最小值(这个例子中只有Linear op),缩小量化时原始值的取值范围,提高量化的精度;
  • 通过上述步骤中得到四元组中的min_val和max_val,再结合算法确定的qmin, qmax计算出scale和zp,然后计算得到量化后的weight
  • 实例化nnqd.Linear,然后使用qlinear.set_weight_bias将量化后的weight和原始的bias设置到新的layer上。其中最后一步还涉及到weight和bias的打包,在源代码中是这样的:
#ifdef USE_FBGEMM
    if (ctx.qEngine() == at::QEngine::FBGEMM) {
      return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
    }
#endif

#ifdef USE_PYTORCH_QNNPACK
    if (ctx.qEngine() == at::QEngine::QNNPACK) {
      return PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
    }
#endif
    TORCH_CHECK(false,"Didn't find engine for operation quantized::linear_prepack ",toString(ctx.qEngine()));

其实就是依赖FBGEMM、QNNPACK这些backend

2、量化前准备

2.1 fuse_modules:融合模块,常见的融合模块包括“conv+ReLU” & “conv+BN+ReLU” ,需要根据模型结构手动完成.

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'bn', 'relu']])

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘relu’]])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘bn’, ‘relu’]])

2.2 QuantStub & DeQuantStub:量化和反量化

需要手动插入CNN结构中。

  • QuantStub: quantize stub模块,在校准前和observer相同,在convert中变换成nnq.Quantize;
  • DeQuantStub:DeQuantStub模块,在prepare阶段相当于Identity,在convert中变换成nnq.DeQuantize。

3、torch.ao.quantization.observer

  • ObserverBase
  • MinMaxObserver

4、torch.ao.quantization.qconfig

定义了用于配置单个操作的量化设置的QConfig对象

4.1 QConfig:描述如何分别设置激活和权重的observer类来量化网络的层或部分

需要包含observer类(如MinMaxObserver)或在调用时返回实例的可调用类,而不是具体的observer实例本身。

4.2 default_qconfig 默认qconfig配置

六、量化.qconfig

获取config的函数定义如下,常用的有两种方式,fbgemm是逐通道的,qnnpack是逐层的,目前“fbgemm”可以用“x86”代替,“x86”建议的默认值

def get_default_qconfig(backend='fbgemm', version=0):
    """
    Returns the default PTQ qconfig for the specified backend.
    Args:
      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
        `qnnpack` and `onednn`.
    Return:
        qconfig
    """
    if version == 0:
        if backend == 'fbgemm':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
                              weight=default_per_channel_weight_observer)
        elif backend == 'qnnpack':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                              weight=default_weight_observer)
        elif backend == 'onednn':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                              weight=default_per_channel_weight_observer)
        else:
            qconfig = default_qconfig
    else:
        raise AssertionError("Version number: " + str(version) +
                             " in get_default_qconfig is not supported. Version number must be 0")

    return qconfig

myModel.qconfig = torch.quantization.default_qconfig
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig(‘fbgemm’)

其中调用了with_args,定义如下

with_args = classmethod(_with_args)

当需要创建具有相同构造函数参数但实例不同的类时使用,是一个装饰器