


NVIDIA发布了针对BERT的新的TensorRT优化,允许您在t4gpu上执行2.2ms*的推理。这比仅使用CPU的平台快17倍,而且在对话式人工智能应用程序所需的10ms延迟预算之内。这些优化使得在生产中使用BERT变得切实可行,例如,作为会话AI服务的一部分。              TensorRT是一个用于高性能深度学习推理的平台,它包括一个优化器和运行时,可以最大限度地减少延迟并最大化生产中的吞吐量。使用TensorRT,您可以优化在所有主要框架中训练的模型,以高精度校准较低精度,最后在生产中部署。


问答(QA)或阅读理解是测试模型理解上下文能力的一种非常流行的方法。SQuAD leaderboard3排行榜3跟踪此任务的最佳执行者,以及他们提供的数据集和测试集。在过去的几年里,随着学术界和公司的全球贡献,质量保证能力有了快速的发展。在本文中,将演示如何使用Python创建一个简单的问答应用程序,它由我们今天发布的TensorRT优化BERT代码提供支持。这个例子提供了一个API来输入段落和问题,并返回由BERT模型生成的响应。             

从使用TensorRT for BERT执行训练和推理的步骤开始。

BERT Training and Inference Pipeline







  1. Create a TensorRT engine by passing the fine-tuned weights and network definition to the TensorRT builder.
  2. Start the TensorRT runtime with this engine.
  3. Feed a passage and a question to the TensorRT runtime and receive as output the answer predicted by the network.

This entire workflow is outlined in Figure 2




Figure 1: Generating BERT TensorRT engine from pretrained checkpoints





Figure 2: Workflow to perform inference with TensorRT runtime engine for BERT QA task

Let’s Run the Sample!

Set up your environment to perform BERT inference with the steps below:

  1. Create a Docker image with the prerequisites
  2. Compile TensorRT optimized plugins
  3. Build the TensorRT engine from the fine-tuned weights
  4. Perform inference given a passage and a query

We use scripts to perform these steps, which you can find in the TensorRT BERT sample repo. While we describe several options you can pass to each script, you could also execute the code below at the command prompt to get started quickly:

# Clone the TensorRT repository, check out the specific release, and navigate to the BERT demo directory

git clone --recursive https://github.com/NVIDIA/TensorRT && cd TensorRT/ && git checkout release/5.1 && cd demo/BERT


# Create and launch the Docker image

sh python/create_docker_container.sh


# Build the plugins and download the fine-tuned models

cd TensorRT/demo/BERT && sh python/build_examples.sh


# Build the TensorRT runtime engine

python python/bert_builder.py -m /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/model.ckpt-8144 -o bert_base_384.engine -b 1 -s 384 -c /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2

Now, give it a passage and see how much information it can decipher by asking it a few questions.

python python/bert_inference.py -e bert_base_384.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/vocab.txt -b 1


Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.


Question: What is TensorRT?


Answer: 'a high performance deep learning inference platform'

—- Given the same passage with a different question —-

Question: What is included in TensorRT?


Answer: 'parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference'

模型提供的答案是准确的,基于所提供的文章文本。该示例使用FP16精度执行TensorRT推理。这有助于实现NVIDIA GPU中张量核心的最高性能。在我们的测试中,我们测量了TensorRT的精确度,与框架内推理的FP16精度相当。             


sh create_docker_container.sh


Usage: sh build_examples.sh [base | large] [ft-fp16 | ft-fp32] [128 | 384]

  • base | large – determine whether to download a BERT-base or BERT-large model to optimize
  • ft-fp16 | ft-fp32 – determine whether to download a BERT model fine-tuned with precision FP16 or FP32
  • 128 | 384 – determine whether to download a BERT model for sequence length 128 or 384


# Running with default parameters

sh build_examples.sh


# Running with custom parameters (BERT-large, FP132 fine-tuned weights, 128 sequence length)

sh build_examples.sh large ft-fp32 128

此脚本将首先使用示例存储库中的代码,并为BERT推断构建TensorRT插件。接下来,它下载并安装NGC CLI,从NVIDIA的NGC模型库下载一个经过微调的模型。生成的命令行build_examples.sh指定要使用TensorRT优化的模型。默认情况下,它下载经过微调的BERT-base,精度为FP16,序列长度为384。             



Usage:python bert_builder.py -m <checkpoint> -o <bert.engine> -b <batch size> -s <sequence length> -c <config file_directory>

  • -m,  – checkpoint file for the fine-tuned model
  • -o,  – path for the output TensorRT engine file (i.e. bert.engine)
  • -b,  – batch size for inference (default=1)
  • -s,  – sequence length matching the downloaded BERT fine-tuned model
  • -c,  – directory containing configuration file for BERT parameters (attention heads, hidden layers, etc.)


python python/bert_builder.py -m /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/model.ckpt-8144 -o bert_base_384.engine -b 1 -s 384 -c /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2

你现在应该有一个TensorRT引擎(即bert.engine)在推理脚本中使用(bert_inference.py)对于QA。我们将在后面的章节中描述构建TensorRT引擎的过程。现在您可以向bert提供一个段落和一个bert_inference.py并查看模型是否能够正确回答您的查询。与推理脚本交互的方法很少:段落和问题可以作为命令行参数提供(使用passage and –question标志),也可以从给定文件传入(使用–passage_file and –question_file标志)。如果在执行过程中没有给出这两个标志,则在执行开始后,将提示用户输入段落和问题。bert_inference.py脚本如下:

Usage: python bert_inference.py --bert_engine <bert.engine> [--passage | --passage_file] [--question | --question_file] --vocab_file <vocabulary file> --batch_size <batch_size>

  • -e, –bert_engine – path to the TensorRT engine created in the previous step
  • -p, –passage – text for paragraph/passage for BERT QA
  • -pf, –passage_file – file containing text for paragraph/passage
  • -q, –question – text for query/question for BERT QA
  • -qf, –question_file – file containing text for query/question
  • -v, –vocab_file – file containing entire dictionary of words
  • -b, –batch_size – batch size for inference