一、参考文档
- pytorch官方文档
quantization:https://pytorch.org/docs/stable/quantization.html?highlight=quantization pytorch量化介绍:https://pytorch.org/blog/introduction-to-quantization-on-pytorch/ - 参考文章:
Gemfield:PyTorch的量化
pytorch提供了两种量化模式:
- Eager Mode Quantization:手动进行融合,并指定量化和反量化的位置
- FX Graph Mode Quantization:自动
二、Eager模式支持的量化类型
- PTQ支持:static、dynamic
- QAT支持:static
- 动态量化一般运用在NLP领域的模型
- 静态量化一般运用在计算机视觉,主要针对CNN网络
1、Post Training Dynamic Quantization
这是最简单的量化形式,其中权重静态量化,输入在推理过程中动态量化。
激活是以浮点格式读取和写入存储器的
PTDQ API:
import torch
# define a floating point model
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(4, 4)
def forward(self, x):
x = self.fc(x)
return x
# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
model_fp32, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights
# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)
Post Training Dynamic Quantization,简称为Dynamic Quantization,也就是动态量化,或者叫作Weight-only的量化
可以有更高的精度(因为裁剪范围被精确校准)
目前只支持线性层(linear)和递归(LSTM, GRU, RNN)层的动态量化。并且在运行时对每一层的激活进行校准和量化会增加计算开销。
1.1动态量化的计算
默认只对部分op进行转换:Linear、LSTM、LSTMCell、RNNCell、GRUCell。
- 用于activation的PlaceholderObserver 就是个占位符,什么也不做;
- 用于weight的MinMaxObserver就是记录输入tensor中的最大值和最小值,用来计算scale和zp。min_val,max_val代表op权重数据/input tensor数据分布的最小值和最大值;qmin, qmax代表量化后的取值范围的最小、最大值(-128和127)。使用对称量化公式计算。
- 由此可知权重部分的量化其实是“静态”的,之所以叫“动态量化”是因为在于前向推理的时候动态的把input的float tensor转换为量化tensor。
- 在动态量化的前向推理的时候,nnqd.Linear会调用torch.ops.quantized.linear_dynamic函数,输入就是上面pack好的量化后的权重和浮点型的bias,linear_dynamic函数最终会被PyTorch分发到C++中的apply_dynamic_impl函数。为了将输入转为量化形式,apply_dynamic_impl函数使用下面逻辑对输入进行量化
Tensor q_input = at::quantize_per_tensor(input_contig, q_params.scale, q_params.zero_point, c10::kQUInt8);
动态量化的本质就是基于运行时对数据范围的观察,来动态确定对输入进行量化时的scale值,确保输入tensor的scale能基于输入数据进行优化。而模型参数则是提前转换成了INT8的格式。这样,当输出也被量化后,网络中的运算就使用向量化的INT8指令来完成。当前layer在输出时还需要把结果反量化为float32。
2、Post Training Static Quantization
权重和激活都是静态量化,将激活融合到前面的层中,量化后需要数据集进行校准,以确定激活的最佳量化参数。
与动态量化的共同点:都把网络的权重参数从float32转换为int8;不同点:需要把训练集或者和训练分布类似的的数据喂给模型(没有反向传播),然后通过每个op输入的分布特点来计算激活(activation)的量化参数–也就是Calibrate。静态量化包含激活量化,也就是op 前向推理之后的处理,
PTSQ API:
import torch
# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()
def forward(self, x):
# manually specify where tensors will be converted from floating
# point to quantized in the quantized model
x = self.quant(x)
x = self.conv(x)
x = self.relu(x)
# manually specify where tensors will be converted from quantized
# to floating point in the quantized model
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to eval mode for static quantization logic to work
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')
# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)
# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)
# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
从上面的API可以看出静态量化主要五个步骤:
- 1、fuse_model:合并一些layer,以提高速度和准确度
- 2、设置qconfig:Qconfig的一个实例,维护量化observer
– default_qconfig维护的两个observer如下表:
量化的backend | activation | weight |
fbgemm | HistogramObserver (reduce_range=True) | PerChannelMinMaxObserver (default_per_channel_weight_observer) |
qnnpack | HistogramObserver (reduce_range=False) | MinMaxObserver (default_weight_observer) |
默认(非fbgemm和qnnpack) | MinMaxObserver (default_observer) | MinMaxObserver (default_weight_observer) |
- 3、 prepare:给每个子module插入Observer,用来收集和定标数据。
- 4、喂数据:不是训练。是为了获取数据的分布特点,来更好的计算activation的scale和zp。至少要喂上几百个迭代的数据。
- 5、转换模型:这个过程和dynamic量化类似,本质就是检索模型中op的type,如果某个op的type属于字典DEFAULT_STATIC_QUANT_MODULE_MAPPINGS的key(注意字典和动态量化的不一样了),那么,这个op将被替换为key对应的value
- 不是实时校准激活,而是使用验证数据预校准和固定裁剪范围(静态的)
- 静态量化比动态量化具有更快的推理速度,因为消除了层之间float和int的转换开销
2.1 静态量化过程中scale和zero point的计算
pytorch的scale和zero point的计算逻辑
#qscheme 是 torch.per_tensor_symmetric 或者torch.per_channel_symmetric时
max_val = torch.max(-min_val, max_val)
scale = max_val / (float(qmax - qmin) / 2)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
if self.dtype == torch.quint8:
zero_point = zero_point.new_full(zero_point.size(), 128)
#qscheme 是 torch.per_tensor_affine时
scale = (max_val - min_val) / float(qmax - qmin)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.max(zero_point, torch.tensor(qmin, device=device, dtype=zero_point.dtype))
zero_point = torch.min(zero_point, torch.tensor(qmax, device=device, dtype=zero_point.dtype))
- QuantStub的scale和zp:非对称量化计算。QuantStub使用的是HistogramObserver,根据输入从[-3,3]的分布,HistogramObserver计算得到min_val、max_val分别是-3、2.9971,而qmin和qmax又分别是0、127
- conv activation的scale和zp:卷积后的tensor使用非对称量化公式计算。observer(quint8)是HistogramObserver,又是reduce_range的,因此其qmin,qmax = 0 ,127;min_val,max_val为输入数据 + 权重值根据L2Norm确定
- conv weight的scale和zp:对卷积权重tensor使用对称量化公式计算。weight(qint8)是PerChannelMinMaxObserver,不是reduce_range的,因此其qmin, qmax = -128, 127;min_val,max_val为输入数据的最小值和最大值确定。
- fc activation的scale和zp:计算方法同conv
- fc weight的scale和zp:
- relu activation的scale和zp:非对称量化计算
- 在conv过程中假设权重为-0.7898,输入tensor的第一个值为-0.9912,那卷积后得到的应该是-0.7898 x -0.9912=0.7828,但实际得到的是0.7801,这说明已经在引入误差了(%0.34),因此fuse_modules可以提高精度(每一层都会引入类似的误差)。
- 静态量化和动态量化最大的区别就是:静态量化的float输入必经QuantStub变为int,此后到输出之前都是int;动态量化的float输入是经动态计算的scale和zp量化为int,op输出时转换回float。
3、static quantization aware training
所有权重和偏差都以FP32存储,在前向传播中,量化通过FakeQuantize模块进行内部模拟(在数据量化后立刻反量化)
QAT API:
import torch
# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
def __init__(self):
super().__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.ao.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.bn = torch.nn.BatchNorm2d(1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.ao.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to eval for fusion to work
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
[['conv', 'bn', 'relu']])
# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())
# run the training loop (not shown)
training_loop(model_fp32_prepared)
# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
- 1、设置qconfig:在设置之前,模型首先设置为训练模式。
– 在QAT的qconfig中,activation和权重的observer都变成了FakeQuantize(和observer是has a的关系,也即包含一个observer),并且参数不一样(qmin、qmax、schema,dtype,qschema,reduce_range这些参数)
– FakeQuantize包含的observer是MovingAverageMinMaxObserver,继承自前面提到过的MinMaxObserver,但是求最小值和最大值的方法有点区别。 - 2、fuse_modules:与静态量化一样
- 3、 prepare_qat:使用的是prepare_qat API。主要有两点区别:prepare_qat要把qconfig安插到每个op上,qconfig的内容本身就不同,参考五部曲中的第一步;prepare_qat 中需要多做一步转换子module的工作,需要inplace的把模型中的一些子module替换了,替换的逻辑就是从DEFAULT_QAT_MODULE_MAPPINGS的key替换为value,这个字典的定义也不同。
- 4、喂数据:和静态量化完全不同,在QAT中这一步是用来训练的。每个op的输入都需要经过self.weight_fake_quant来处理下,输出又都需要经过self.activation_post_process来处理下,这两个都是FakeQuantize的实例,只是里面包含的observer不一样。
– FakeQuantize前向函数中的fake_quantize_per_channel_or_tensor_affine实现了quantize和dequantize,用公式表示的话为:out = (clamp(round(x/scale + zero_point), quant_min, quant_max) - zero_point) * scale。也就是说,这是把量化的误差引入到了训练loss之中了。
– 这样,在QAT中,所有的weights和activations就像上面那样被fake quantized了,且参与模型训练中的前向和反向计算。float值被round成了(用来模拟的)int8值,但是所有的计算仍然是通过float来完成的。 这样以来,所有的权重在优化过程中都能感知到量化带来的影响,称之为量化感知训练(支持cpu和cuda),精度也因此更高。 - 5、转换convert:和静态量化一样,需要注意的是,QAT中,有一些module在prepare中已经转换成新的module了
四、Quantization Stack
量化流程中使用到的
- Observer and FakeQuantize
– Observer :收集张量信息,如统计张量的最大最小值,并计算量化参数
– FakeQuantize:伪量化模块 - QConfig: 是Observer 和 FakeQuantize模块类的命名元组,可以进行配置(namedtuple )
– 不同类型的Observer/FakeQuantize
– 支持权重和激活配置
五、量化 API
官方文档:https://pytorch.org/docs/stable/quantization-support.html 参考文档:https://zhuanlan.zhihu.com/p/299108528
1、顶层API
1.1 quantize & quantize_dynamic & quantize_qat:使用训练后静态量化 / 动态(仅weights-only)/ 量化感知;
细粒度通过qconfig设置
- qunatize:需要先准备模型进行校准,API如下
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)
- quantize_dynamic :本质就是检索模型中op的type,如果某个op的type属于字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS的key,那么,这个op将被替换为key对应的value,API如下
– 其中qconfig_spec参数指定了一组qconfig,具体就是哪个op(operation,CNN中各种操作,比图conv、linear。batchnorm等)对应哪个qconfig
– 每个qconfig是Qconfig类的实例(instance),封装了两个observer;
– 两个observer分别是权重和激活的observer
– qconfig_spec=None时时默认行为
– qconfig_spec赋值为set,比如:{nn.LSTM, nn.Linear},意思是指定当前模型中的哪些layer要被dynamic quantization;
– qconfig_spec赋值为一个dict,key为submodule的name或type,value为QConfigDynamic实例
torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)
字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS
# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
nn.GRUCell: nnqd.GRUCell,
nn.Linear: nnqd.Linear,
nn.LSTM: nnqd.LSTM,
nn.LSTMCell: nnqd.LSTMCell,
nn.RNNCell: nnqd.RNNCell,
}
- 当type从key换为value,新的type需要实例化,并且要使用之前的权重参数,这个一般是通过from_float()来进行实例化。
1.2 prepare & prepare_qat:为量化校准或量化感知训练准备模型副本,需优先配置.qconfig。
- 训练后静态量化(PTQ)中使用prepare :插入observer模块,以便在校准期间,观测激活张量;
torch.ao.quantization.prepare(model, inplace=False, allow_list=None, observer_non_leaf_module_list=None, prepare_custom_config_dict=None)
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)
- 量化感知训练(QAT)中使用prepare_qat:插入observer 和fake_quants 模块,需要设置为train()模式才能运行,在校准期间观测权重和激活张量。
torch.ao.quantization.prepare_qat(model, mapping=None, inplace=False)
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())
1.3 convert:通过对目标模块类调用from_float方法来根据映射将输入模块中的子模块转换乘不同的模块。如果remove_qconfig设置的是True,则在末尾删除qconfig
在QAT量化中,整个计算是以浮点的形式进行的,在训练结束时,通过convert转换函数将浮点转为量化后的数据
torch.ao.quantization.convert(module, mapping=None, inplace=False, remove_qconfig=True, is_reference=False, convert_custom_config_dict=None)
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)
- from_float()
nnqat.Linear模块的from_float方法如下
@classmethod
def from_float(cls, mod):
r"""Create a qat module from a float module or qparams_dict
Args: `mod` a float module, either produced by torch.ao.quantization utilities
or directly from user
"""
assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
" qat."
+ cls.__name__
+ ".from_float only works for "
+ cls._FLOAT_MODULE.__name__
)
assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
assert mod.qconfig, "Input float module must have a valid qconfig"
if type_before_parametrizations(mod) == LinearReLU:
mod = mod[0]
qconfig = mod.qconfig
qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
if is_parametrized(mod, "weight"):
transfer_parametrizations_and_params(mod, qat_linear, "weight")
else:
qat_linear.weight = mod.weight
if is_parametrized(mod, "bias"):
transfer_parametrizations_and_params(mod, qat_linear, "bias")
else:
qat_linear.bias = mod.bias
return qat_linear
此方法会构造qat_linear类实例。
from_float()主要做的事情就是:
- 使用MinMaxObserver计算模型中op权重参数中tensor的最大值最小值(这个例子中只有Linear op),缩小量化时原始值的取值范围,提高量化的精度;
- 通过上述步骤中得到四元组中的min_val和max_val,再结合算法确定的qmin, qmax计算出scale和zp,然后计算得到量化后的weight
- 实例化nnqd.Linear,然后使用qlinear.set_weight_bias将量化后的weight和原始的bias设置到新的layer上。其中最后一步还涉及到weight和bias的打包,在源代码中是这样的:
#ifdef USE_FBGEMM
if (ctx.qEngine() == at::QEngine::FBGEMM) {
return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
}
#endif
#ifdef USE_PYTORCH_QNNPACK
if (ctx.qEngine() == at::QEngine::QNNPACK) {
return PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
}
#endif
TORCH_CHECK(false,"Didn't find engine for operation quantized::linear_prepack ",toString(ctx.qEngine()));
其实就是依赖FBGEMM、QNNPACK这些backend
2、量化前准备
2.1 fuse_modules:融合模块,常见的融合模块包括“conv+ReLU” & “conv+BN+ReLU” ,需要根据模型结构手动完成.
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'bn', 'relu']])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘relu’]])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘bn’, ‘relu’]])
2.2 QuantStub & DeQuantStub:量化和反量化
需要手动插入CNN结构中。
- QuantStub: quantize stub模块,在校准前和observer相同,在convert中变换成nnq.Quantize;
– - DeQuantStub:DeQuantStub模块,在prepare阶段相当于Identity,在convert中变换成nnq.DeQuantize。
3、torch.ao.quantization.observer
- ObserverBase
- MinMaxObserver
等
4、torch.ao.quantization.qconfig
定义了用于配置单个操作的量化设置的QConfig对象
4.1 QConfig:描述如何分别设置激活和权重的observer类来量化网络的层或部分
需要包含observer类(如MinMaxObserver)或在调用时返回实例的可调用类,而不是具体的observer实例本身。
4.2 default_qconfig 默认qconfig配置
六、量化.qconfig
获取config的函数定义如下,常用的有两种方式,fbgemm是逐通道的,qnnpack是逐层的,目前“fbgemm”可以用“x86”代替,“x86”建议的默认值
def get_default_qconfig(backend='fbgemm', version=0):
"""
Returns the default PTQ qconfig for the specified backend.
Args:
* `backend`: a string representing the target backend. Currently supports `fbgemm`,
`qnnpack` and `onednn`.
Return:
qconfig
"""
if version == 0:
if backend == 'fbgemm':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
weight=default_per_channel_weight_observer)
elif backend == 'qnnpack':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
weight=default_weight_observer)
elif backend == 'onednn':
qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
weight=default_per_channel_weight_observer)
else:
qconfig = default_qconfig
else:
raise AssertionError("Version number: " + str(version) +
" in get_default_qconfig is not supported. Version number must be 0")
return qconfig
myModel.qconfig = torch.quantization.default_qconfig
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig(‘fbgemm’)
其中调用了with_args,定义如下
with_args = classmethod(_with_args)
当需要创建具有相同构造函数参数但实例不同的类时使用,是一个装饰器