python opencv cuda 加速 opencv使用gpu加速

转载

doscommand 2024-02-03 22:59:36

文章标签 opencv dnn cnn ide CUDA 文章分类 Python 后端开发

python opencv cuda 加速 opencv使用gpu加速_CUDA

在本教程中，您将学习如何将 OpenCV 的“dnn”模块与 NVIDIA GPU 结合使用，以将对象检测（YOLO 和 SSD）和实例分割（Mask R-CNN）的速度提高 1,549%。

上周，我们发现了如何配置和安装 OpenCV 及其“深度神经网络”（dnn）模块以使用 NVIDIA GPU 进行推理。

使用 OpenCV 的 GPU 优化 dnn 模块，我们只需三行代码即可将给定网络的计算从 CPU 推送到 GPU：

# load the model from disk and set the backend target to a
# CUDA-enabled GPU
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

今天我们将更详细地讨论完整的代码示例——在本教程结束时，您将能够应用：

65.90 FPS 的单次检测器 (SSD)
YOLO 目标检测速度为 11.87 FPS
Mask R-CNN 实例分割速度为 11.05 FPS

要了解如何使用 OpenCV 的 dnn 模块和 NVIDIA GPU 进行更快的对象检测和实例分割，请继续阅读！

带有 NVIDIA GPU 的 OpenCV ‘dnn’：YOLO、SSD 和 Mask R-CNN 的速度提高了 1,549%

在本教程中，您将学习如何使用 OpenCV 的“深度神经网络”(dnn) 模块和支持 NVIDIA/CUDA 的 GPU 来实现 Single Shot Detectors、YOLO 和 Mask R-CNN。

使用 NVIDIA GPU 支持编译 OpenCV 的“dnn”模块

python opencv cuda 加速 opencv使用gpu加速_ide_02

项目结构如下：

$ tree --dirsfirst
.
├── example_videos
│   ├── dog_park.mp4
│   ├── guitar.mp4
│   └── janie.mp4
├── opencv-ssd-cuda
│   ├── MobileNetSSD_deploy.caffemodel
│   ├── MobileNetSSD_deploy.prototxt
│   └── ssd_object_detection.py
├── opencv-yolo-cuda
│   ├── yolo-coco
│   │   ├── coco.names
│   │   ├── yolov3.cfg
│   │   └── yolov3.weights
│   └── yolo_object_detection.py
├── opencv-mask-rcnn-cuda
│   ├── mask-rcnn-coco
│   │   ├── colors.txt
│   │   ├── frozen_inference_graph.pb
│   │   ├── mask_rcnn_inception_v2_coco_2018_01_28.pbtxt
│   │   └── object_detection_classes_coco.txt
│   └── mask_rcnn_segmentation.py
└── output_videos
7 directories, 15 files

在今天的教程中，我们将回顾三个 Python 脚本：

ssd_object_detection.py：使用 CUDA 对 20 个 COCO 类执行基于 Caffe 的 MobileNet SSD 对象检测。
yolo_object_detection.py：使用 CUDA 对 80 个 COCO 类执行 YOLO V3 对象检测。
mask_rcnn_segmentation.py：使用 CUDA 对 90 个 COCO 类执行基于 TensorFlow 的 Inception V2 分割。

每个模型文件和类名文件都包含在各自的文件夹中，除了我们的 MobileNet SSD（类名直接在脚本中硬编码在 Python 列表中）。让我们按照今天使用的顺序查看文件夹名称：

opencv-ssd-cuda/
opencv-yolo-cuda/
opencv-mask-rcnn-cuda/

正如所有三个目录名称所表明的那样，我们将使用 OpenCV 的 DNN 模块，该模块由 CUDA 支持编译。如果您的 OpenCV 没有为您的 NVIDIA GPU 编译并支持 CUDA，那么您需要使用上周教程中的说明配置您的系统。

使用 OpenCV 的支持 NVIDIA GPU 的“dnn”模块实现单次检测器 (SSD)

python opencv cuda 加速 opencv使用gpu加速_cnn_03

我们将要研究的第一个物体检测器是单次检测器 (SSD)，我们最初在 2017 年就介绍过它：

使用深度学习和 OpenCV 进行目标检测使用深度学习和 OpenCV 进行实时目标检测那时我们只能在 CPU 上运行这些 SSD；但是，今天我将向您展示如何使用 NVIDIA GPU 将推理速度提高多达 211%。

打开项目目录结构中的 ssd_object_detection.py 文件，并插入以下代码：

# import the necessary packages
from imutils.video import FPS
import numpy as np
import argparse
import imutils
import cv2
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--prototxt", required=True,
	help="path to Caffe 'deploy' prototxt file")
ap.add_argument("-m", "--model", required=True,
	help="path to Caffe pre-trained model")
ap.add_argument("-i", "--input", type=str, default="",
	help="path to (optional) input video file")
ap.add_argument("-o", "--output", type=str, default="",
	help="path to (optional) output video file")
ap.add_argument("-d", "--display", type=int, default=1,
	help="whether or not output frame should be displayed")
ap.add_argument("-c", "--confidence", type=float, default=0.2,
	help="minimum probability to filter weak detections")
ap.add_argument("-u", "--use-gpu", type=bool, default=False,
	help="boolean indicating if CUDA GPU should be used")
args = vars(ap.parse_args())

在这里，我们已经导入了我们的包。请注意，我们不需要任何特殊的 CUDA 导入。 CUDA 功能内置（通过我们上周的编译）到第 6 行的 cv2 导入中。

接下来让我们解析命令行参数：

--prototxt：我们预训练的 Caffe MobileNet SSD “部署”prototxt 文件路径。
--model：我们预训练的 Caffe MobileNet SSD 模型的路径。
--input：我们输入视频文件的可选路径。如果未提供，则默认使用您的第一台相机。
--output：我们输出视频文件的可选路径。
--display：可选的布尔标志，指示我们是否将输出帧显示到 OpenCV GUI 窗口。显示帧会消耗 CPU 周期，因此对于真正的基准测试，您可能希望关闭显示（默认情况下它是打开的）。
--confidence：过滤弱检测的最小概率阈值。默认情况下，该值设置为 20%；但是，如果您愿意，您可以覆盖它。
--use-gpu：指示是否应使用 CUDA GPU 的布尔值。默认情况下，此值为 False（即关闭）。如果您希望将支持 NVIDIA CUDA 的 GPU 用于 OpenCV 的对象检测，则需要将 1 值传递给此参数。

接下来我们将指定我们的类和相关的随机颜色：

# initialize the list of class labels MobileNet SSD was trained to
# detect, then generate a set of bounding box colors for each class
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
	"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
	"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
	"sofa", "train", "tvmonitor"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

然后我们将加载基于 Caffe 的模型：

# load our serialized model from disk
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
# check if we are going to use GPU
if args["use_gpu"]:
	# set CUDA as the preferable backend and target
	print("[INFO] setting preferable backend and target to CUDA...")
	net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
	net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

正如第 35 行所示，我们使用 OpenCV 的 dnn 模块加载我们的 Caffe 对象检测模型。

检查是否应使用支持 NVIDIA CUDA 的 GPU。从那里，我们相应地设置后端和目标（第 38-42 行）。

让我们继续使用我们的 GPU 开始处理帧和执行对象检测（当然，前提是 --use-gpu 命令行参数已打开）：

# initialize the video stream and pointer to output video file, then
# start the FPS timer
print("[INFO] accessing video stream...")
vs = cv2.VideoCapture(args["input"] if args["input"] else 0)
writer = None
fps = FPS().start()
# loop over the frames from the video stream
while True:
	# read the next frame from the file
	(grabbed, frame) = vs.read()
	# if the frame was not grabbed, then we have reached the end
	# of the stream
	if not grabbed:
		break
	# resize the frame, grab the frame dimensions, and convert it to
	# a blob
	frame = imutils.resize(frame, width=400)
	(h, w) = frame.shape[:2]
	blob = cv2.dnn.blobFromImage(frame, 0.007843, (300, 300), 127.5)
	# pass the blob through the network and obtain the detections and
	# predictions
	net.setInput(blob)
	detections = net.forward()
	# loop over the detections
	for i in np.arange(0, detections.shape[2]):
		# extract the confidence (i.e., probability) associated with
		# the prediction
		confidence = detections[0, 0, i, 2]
		# filter out weak detections by ensuring the `confidence` is
		# greater than the minimum confidence
		if confidence > args["confidence"]:
			# extract the index of the class label from the
			# `detections`, then compute the (x, y)-coordinates of
			# the bounding box for the object
			idx = int(detections[0, 0, i, 1])
			box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
			(startX, startY, endX, endY) = box.astype("int")
			# draw the prediction on the frame
			label = "{}: {:.2f}%".format(CLASSES[idx],
				confidence * 100)
			cv2.rectangle(frame, (startX, startY), (endX, endY),
				COLORS[idx], 2)
			y = startY - 15 if startY - 15 > 15 else startY + 15
			cv2.putText(frame, label, (startX, y),
				cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)

在这里我们访问我们的视频流。请注意，该代码旨在与视频文件和实时视频流兼容，这就是我选择不使用线程化 VideoStream 类的原因。

循环帧，我们依次进行下述处理：

读取和预处理传入的帧。
从框架构建一个 blob。
使用 Single Shot Detector 和我们的 GPU 检测对象（如果设置了 --use-gpu 标志）。
过滤对象只允许高置信度对象通过。
注释边界框、类标签和概率。

最后，我们将总结：

# check to see if the output frame should be displayed to our
	# screen
	if args["display"] > 0:
		# show the output frame
		cv2.imshow("Frame", frame)
		key = cv2.waitKey(1) & 0xFF
		# if the `q` key was pressed, break from the loop
		if key == ord("q"):
			break
	# if an output video file path has been supplied and the video
	# writer has not been initialized, do so now
	if args["output"] != "" and writer is None:
		# initialize our video writer
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(frame.shape[1], frame.shape[0]), True)
	# if the video writer is not None, write the frame to the output
	# video file
	if writer is not None:
		writer.write(frame)
	# update the FPS counter
	fps.update()
# stop the timer and display FPS information
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

在剩下的几行中，我们：

如果需要，显示带注释的视频帧。
如果我们正在显示，则捕获按键。
将带注释的输出帧写入磁盘上的视频文件。
更新、计算和打印 FPS 统计数据。

开发您的 SSD + OpenCV + CUDA 脚本做得很好。在接下来的部分中，我们将使用 GPU 和 CPU 分析结果。

Single Shot Detectors：使用 OpenCV 的“dnn”模块和 NVIDIA GPU 将对象检测速度提高 211%

要查看我们的 Single Shot Detector 的运行情况，请确保使用本教程的“下载”部分下载 (1) 源代码和 (2) 与 OpenCV 的 dnn 模块兼容的预训练模型。

从那里，执行以下命令，通过在我们的 CPU 上运行它来获取我们的 SSD 的基线：

$ python ssd_object_detection.py \
	--prototxt MobileNetSSD_deploy.prototxt \
	--model MobileNetSSD_deploy.caffemodel \
	--input ../example_videos/guitar.mp4 \
	--output ../output_videos/ssd_guitar.avi \
	--display 0
[INFO] accessing video stream...
[INFO] elasped time: 11.69
[INFO] approx. FPS: 21.13

在这里，我们在 CPU 上获得了约 21 FPS，这对于物体检测器来说非常好！

要真正看到检测器，让我们提供 --use-gpu 1 命令行参数，指示 OpenCV 将 dnn 计算推送到我们的 NVIDIA Tesla V100 GPU：

$ python ssd_object_detection.py \
	--prototxt MobileNetSSD_deploy.prototxt \
	--model MobileNetSSD_deploy.caffemodel \
	--input ../example_videos/guitar.mp4 \
	--output ../output_videos/ssd_guitar.avi \
	--display 0 \
	--use-gpu 1
[INFO] setting preferable backend and target to CUDA...
[INFO] accessing video stream...
[INFO] elasped time: 3.75
[INFO] approx. FPS: 65.90

python opencv cuda 加速 opencv使用gpu加速_CUDA_04

使用我们的 NVIDIA GPU，我们现在达到了约 66 FPS，这将我们的每秒帧数吞吐量提高了 211% 以上！正如视频演示所示，我们的 SSD 非常准确。

注意：正如 Yashas 的这条评论所讨论的，MobileNet SSD 的性能可能很差，因为 cuDNN 没有针对所有 NVIDA GPU 上的深度卷积优化内核。如果您看到 GPU 结果与 CPU 结果相似，这可能是问题所在。

为 OpenCV 的支持 NVIDIA GPU/CUDA 的“dnn”模块实现 YOLO 对象检测

python opencv cuda 加速 opencv使用gpu加速_dnn_05

虽然 YOLO 无疑是最快的基于深度学习的对象检测器之一，但 OpenCV 中包含的 YOLO 模型却是——在 CPU 上，YOLO 努力打破 3 FPS。

因此，如果您打算将 YOLO 与 OpenCV 的 dnn 模块一起使用，则最好使用 GPU。

我们来看看如何使用 YOLO 物体检测器（yolo_object_detection.py）和 OpenCV 的 CUDA-enabled dnn 模块：

# import the necessary packages
from imutils.video import FPS
import numpy as np
import argparse
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-y", "--yolo", required=True,
	help="base path to YOLO directory")
ap.add_argument("-i", "--input", type=str, default="",
	help="path to (optional) input video file")
ap.add_argument("-o", "--output", type=str, default="",
	help="path to (optional) output video file")
ap.add_argument("-d", "--display", type=int, default=1,
	help="whether or not output frame should be displayed")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
	help="threshold when applyong non-maxima suppression")
ap.add_argument("-u", "--use-gpu", type=bool, default=0,
	help="boolean indicating if CUDA GPU should be used")
args = vars(ap.parse_args())

我们的导入几乎与我们之前的脚本相同，只是一次交换。在这个脚本中，我们不需要 imutils，但我们需要 Python 的 os 模块来进行文件 I/O。同样，CUDA 功能已融入我们自定义编译的 OpenCV 安装中。

让我们回顾一下我们的命令行参数：

--yolo：预训练 YOLO 模型目录的基本路径。
--input：我们输入视频文件的可选路径。如果未提供，则默认使用您的第一台相机。
--output：我们输出视频文件的可选路径。
--display：可选的布尔标志，指示我们是否将输出帧用于 OpenCV GUI 窗口。显示帧会消耗 CPU 周期，因此对于真正的基准测试，您可能希望关闭显示（默认情况下它是打开的）。
--confidence：过滤弱检测的最小概率阈值。默认情况下，该值设置为 50%；但是，如果您愿意，您可以覆盖它。
--threshold：默认情况下，非极大值抑制 (NMS) 阈值设置为 30%。
--use-gpu：指示是否应使用 CUDA GPU 的布尔值。默认情况下，此值为 False（即关闭）。如果您希望将支持 NVIDIA CUDA 的 GPU 用于 OpenCV 的对象检测，则需要将 1 值传递给此参数。

接下来我们将加载我们的类标签并分配随机颜色：

# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join([args["yolo"], "coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# initialize a list of colors to represent each possible class label
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
	dtype="uint8")

我们从 coco.names 文件加载类标签并分配随机颜色。

现在我们准备好从磁盘加载我们的 YOLO 模型，包括在需要时设置 GPU 后端/目标：

# derive the paths to the YOLO weights and model configuration
weightsPath = os.path.sep.join([args["yolo"], "yolov3.weights"])
configPath = os.path.sep.join([args["yolo"], "yolov3.cfg"])
# load our YOLO object detector trained on COCO dataset (80 classes)
print("[INFO] loading YOLO from disk...")
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
# check if we are going to use GPU
if args["use_gpu"]:
	# set CUDA as the preferable backend and target
	print("[INFO] setting preferable backend and target to CUDA...")
	net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
	net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

第 36 行和第 37 行获取我们预训练的 YOLO 检测器模型和权重路径。

如果设置了 --use-gpu 命令行标志，则第 41-48 行加载模型并将 GPU 设置为后端。

继续，我们将开始使用 YOLO 执行对象检测：

# determine only the *output* layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# initialize the width and height of the frames in the video file
W = None
H = None
# initialize the video stream and pointer to output video file, then
# start the FPS timer
print("[INFO] accessing video stream...")
vs = cv2.VideoCapture(args["input"] if args["input"] else 0)
writer = None
fps = FPS().start()
# loop over frames from the video file stream
while True:
	# read the next frame from the file
	(grabbed, frame) = vs.read()
	# if the frame was not grabbed, then we have reached the end
	# of the stream
	if not grabbed:
		break
	# if the frame dimensions are empty, grab them
	if W is None or H is None:
		(H, W) = frame.shape[:2]
	# construct a blob from the input frame and then perform a forward
	# pass of the YOLO object detector, giving us our bounding boxes
	# and associated probabilities
	blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
		swapRB=True, crop=False)
	net.setInput(blob)
	layerOutputs = net.forward(ln)

第 51 行和第 52 行仅从 YOLO 模型中获取输出层名称。我们需要这些来使用 OpenCV 对 YOLO 进行推理。

然后我们抓取帧尺寸并初始化我们的视频流 + FPS 计数器。从那里，我们将遍历帧并开始 YOLO 对象检测。在循环内部，我们：

抓图。
从框架构建一个 blob。
计算预测（即对 blob 执行 YOLO 推理）。

继续，我们将处理结果：

# initialize our lists of detected bounding boxes, confidences,
	# and class IDs, respectively
	boxes = []
	confidences = []
	classIDs = []
	# loop over each of the layer outputs
	for output in layerOutputs:
		# loop over each of the detections
		for detection in output:
			# extract the class ID and confidence (i.e., probability)
			# of the current object detection
			scores = detection[5:]
			classID = np.argmax(scores)
			confidence = scores[classID]
			# filter out weak predictions by ensuring the detected
			# probability is greater than the minimum probability
			if confidence > args["confidence"]:
				# scale the bounding box coordinates back relative to
				# the size of the image, keeping in mind that YOLO
				# actually returns the center (x, y)-coordinates of
				# the bounding box followed by the boxes' width and
				# height
				box = detection[0:4] * np.array([W, H, W, H])
				(centerX, centerY, width, height) = box.astype("int")
				# use the center (x, y)-coordinates to derive the top
				# and and left corner of the bounding box
				x = int(centerX - (width / 2))
				y = int(centerY - (height / 2))
				# update our list of bounding box coordinates,
				# confidences, and class IDs
				boxes.append([x, y, int(width), int(height)])
				confidences.append(float(confidence))
				classIDs.append(classID)
	# apply non-maxima suppression to suppress weak, overlapping
	# bounding boxes
	idxs = cv2.dnn.NMSBoxes(boxes, confidences, args["confidence"],
		args["threshold"])
	# ensure at least one detection exists
	if len(idxs) > 0:
		# loop over the indexes we are keeping
		for i in idxs.flatten():
			# extract the bounding box coordinates
			(x, y) = (boxes[i][0], boxes[i][1])
			(w, h) = (boxes[i][2], boxes[i][3])
			# draw a bounding box rectangle and label on the frame
			color = [int(c) for c in COLORS[classIDs[i]]]
			cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
			text = "{}: {:.4f}".format(LABELS[classIDs[i]],
				confidences[i])
			cv2.putText(frame, text, (x, y - 5),
				cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

仍然在我们的循环中，现在我们将：

初始化结果列表。
循环检测并累积输出，同时过滤低置信度检测。
应用非极大值抑制 (NMS)。
使用对象的边界框、类标签和置信度值注释输出帧。

我们将结束我们的帧处理循环并接下来执行清理：

# check to see if the output frame should be displayed to our
	# screen
	if args["display"] > 0:
		# show the output frame
		cv2.imshow("Frame", frame)
		key = cv2.waitKey(1) & 0xFF
		# if the `q` key was pressed, break from the loop
		if key == ord("q"):
			break
	# if an output video file path has been supplied and the video
	# writer has not been initialized, do so now
	if args["output"] != "" and writer is None:
		# initialize our video writer
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(frame.shape[1], frame.shape[0]), True)
	# if the video writer is not None, write the frame to the output
	# video file
	if writer is not None:
		writer.write(frame)
	# update the FPS counter
	fps.update()
# stop the timer and display FPS information
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

其余行处理显示、按键、打印 FPS 统计信息和清理。虽然我们的 YOLO + OpenCV + CUDA 脚本比 SSD 脚本更难实现，但你在那里做得很好。在下一节中，我们将分析结果。

YOLO：使用 OpenCV 的支持 NVIDIA GPU 的“dnn”模块将对象检测速度提高 380%

我们现在准备测试我们的 YOLO 目标检测器。确保您已使用本教程的“下载”部分下载与 OpenCV 的 dnn 模块兼容的源代码和预训练模型。从那里，执行以下命令以获取我们 CPU 上 YOLO 的基线：

$ python yolo_object_detection.py --yolo yolo-coco \
	--input ../example_videos/janie.mp4 \
	--output ../output_videos/yolo_janie.avi \
	--display 0
[INFO] loading YOLO from disk...
[INFO] accessing video stream...
[INFO] elasped time: 51.11
[INFO] approx. FPS: 2.47

在我们的 CPU 上，YOLO 获得了相当可怜的 2.47 FPS。

但是通过将计算推送到我们的 NVIDIA V100 GPU，我们现在达到了 11.87 FPS，提高了 380%：

$ python yolo_object_detection.py --yolo yolo-coco \
	--input ../example_videos/janie.mp4 \
	--output ../output_videos/yolo_janie.avi \
	--display 0 \
	--use-gpu 1
[INFO] loading YOLO from disk...
[INFO] setting preferable backend and target to CUDA...
[INFO] accessing video stream...
[INFO] elasped time: 10.61
[INFO] approx. FPS: 11.87

python opencv cuda 加速 opencv使用gpu加速_CUDA_06

python opencv cuda 加速 opencv使用gpu加速_ide_07

正如我在最初的 YOLO + OpenCV 博文中讨论的那样，我不太确定为什么 YOLO 获得如此低的每秒帧数吞吐率。 YOLO 一直被认为是最快的物体检测器之一。

也就是说，转换后的模型或 OpenCV 处理推理的方式似乎有问题——不幸的是，我不知道确切的问题是什么，但我欢迎在评论部分提供反馈。

为 OpenCV 的启用 CUDA 的“dnn”模块实现 Mask R-CNN 实例分割

python opencv cuda 加速 opencv使用gpu加速_ide_08

在这一点上，我们已经研究了 SSD 和 YOLO，这两种不同类型的基于深度学习的对象检测器——但是像 Mask R-CNN 这样的实例分割网络呢？我们能否将我们的 NVIDIA GPU 与 OpenCV 的支持 CUDA 的 dnn 模块一起使用来提高 Mask R-CNN 的每秒帧数处理速度？

在您的目录结构中打开 mask_rcnn_segmentation.py 以了解如何：

# import the necessary packages
from imutils.video import FPS
import numpy as np
import argparse
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--mask-rcnn", required=True,
	help="base path to mask-rcnn directory")
ap.add_argument("-i", "--input", type=str, default="",
	help="path to (optional) input video file")
ap.add_argument("-o", "--output", type=str, default="",
	help="path to (optional) output video file")
ap.add_argument("-d", "--display", type=int, default=1,
	help="whether or not output frame should be displayed")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
	help="minimum threshold for pixel-wise mask segmentation")
ap.add_argument("-u", "--use-gpu", type=bool, default=0,
	help="boolean indicating if CUDA GPU should be used")
args = vars(ap.parse_args())

首先我们处理我们的进口。它们与我们之前的 YOLO 脚本相同。

从那里我们将解析命令行参数：

--mask-rcnn：预训练的 Mask R-CNN 模型目录的基本路径。
--input：我们输入视频文件的可选路径。如果未提供，则默认使用您的第一台相机。
--output：我们输出视频文件的可选路径。
--display：可选的布尔标志，指示我们是否将输出帧显示到 OpenCV GUI 窗口。显示帧会消耗 CPU 周期，因此对于真正的基准测试，您可能希望关闭显示（默认情况下它是打开的）。
--confidence：过滤弱检测的最小概率阈值。默认情况下，该值设置为 50%；但是，如果您愿意，您可以覆盖它。
--threshold：像素分割的最小阈值。默认情况下，此值设置为 30%。
--use-gpu：指示是否应使用 CUDA GPU 的布尔值。默认情况下，此值为 False（即；关闭）。如果您希望将支持 NVIDIA CUDA 的 GPU 用于 OpenCV 的实例分割，则需要将 1 值传递给该参数。

有了我们的导入和命令行参数，现在我们将加载我们的类标签并分配随机颜色：

# load the COCO class labels our Mask R-CNN was trained on
labelsPath = os.path.sep.join([args["mask_rcnn"],
	"object_detection_classes_coco.txt"])
LABELS = open(labelsPath).read().strip().split("\n")
# initialize a list of colors to represent each possible class label
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
	dtype="uint8")

加载模型

# derive the paths to the Mask R-CNN weights and model configuration
weightsPath = os.path.sep.join([args["mask_rcnn"],
	"frozen_inference_graph.pb"])
configPath = os.path.sep.join([args["mask_rcnn"],
	"mask_rcnn_inception_v2_coco_2018_01_28.pbtxt"])
# load our Mask R-CNN trained on the COCO dataset (90 classes)
# from disk
print("[INFO] loading Mask R-CNN from disk...")
net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath)
# check if we are going to use GPU
if args["use_gpu"]:
	# set CUDA as the preferable backend and target
	print("[INFO] setting preferable backend and target to CUDA...")
	net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
	net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

在这里，我们获取预训练的 Mask R-CNN 权重和模型的路径。

如果设置了 --use-gpu 命令行标志，我们然后从磁盘加载模型并将目标后端设置为 GPU。当只使用你的 CPU 时，分割会像糖蜜一样缓慢。如果您设置了 --use-gpu 标志，您将以扭曲速度处理您的输入视频或相机流。

让我们开始处理帧：

# initialize the video stream and pointer to output video file, then
# start the FPS timer
print("[INFO] accessing video stream...")
vs = cv2.VideoCapture(args["input"] if args["input"] else 0)
writer = None
fps = FPS().start()
# loop over frames from the video file stream
while True:
	# read the next frame from the file
	(grabbed, frame) = vs.read()
	# if the frame was not grabbed, then we have reached the end
	# of the stream
	if not grabbed:
		break
	# construct a blob from the input frame and then perform a
	# forward pass of the Mask R-CNN, giving us (1) the bounding box
	# coordinates of the objects in the image along with (2) the
	# pixel-wise segmentation for each specific object
	blob = cv2.dnn.blobFromImage(frame, swapRB=True, crop=False)
	net.setInput(blob)
	(boxes, masks) = net.forward(["detection_out_final",
		"detection_masks"])

抓取一帧后，我们将其转换为 blob 并通过我们的网络执行前向传递以预测对象框和掩码。现在我们准备好处理我们的结果：

# loop over the number of detected objects
	for i in range(0, boxes.shape[2]):
		# extract the class ID of the detection along with the
		# confidence (i.e., probability) associated with the
		# prediction
		classID = int(boxes[0, 0, i, 1])
		confidence = boxes[0, 0, i, 2]
		# filter out weak predictions by ensuring the detected
		# probability is greater than the minimum probability
		if confidence > args["confidence"]:
			# scale the bounding box coordinates back relative to the
			# size of the frame and then compute the width and the
			# height of the bounding box
			(H, W) = frame.shape[:2]
			box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
			(startX, startY, endX, endY) = box.astype("int")
			boxW = endX - startX
			boxH = endY - startY
			# extract the pixel-wise segmentation for the object,
			# resize the mask such that it's the same dimensions of
			# the bounding box, and then finally threshold to create
			# a *binary* mask
			mask = masks[i, classID]
			mask = cv2.resize(mask, (boxW, boxH),
				interpolation=cv2.INTER_CUBIC)
			mask = (mask > args["threshold"])
			# extract the ROI of the image but *only* extracted the
			# masked region of the ROI
			roi = frame[startY:endY, startX:endX][mask]
			# grab the color used to visualize this particular class,
			# then create a transparent overlay by blending the color
			# with the ROI
			color = COLORS[classID]
			blended = ((0.4 * color) + (0.6 * roi)).astype("uint8")
			# store the blended ROI in the original frame
			frame[startY:endY, startX:endX][mask] = blended
			# draw the bounding box of the instance on the frame
			color = [int(c) for c in color]
			cv2.rectangle(frame, (startX, startY), (endX, endY),
				color, 2)
			# draw the predicted label and associated probability of
			# the instance segmentation on the frame
			text = "{}: {:.4f}".format(LABELS[classID], confidence)
			cv2.putText(frame, text, (startX, startY - 5),
				cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

接下来循环处理：

根据置信度过滤它们。
调整大小和绘制/注释对象透明彩色蒙版。
在输出帧上注释边界框、标签和概率。

从那里我们将继续并结束我们的循环，计算 FPS 统计数据，并清理：

# check to see if the output frame should be displayed to our
	# screen
	if args["display"] > 0:
		# show the output frame
		cv2.imshow("Frame", frame)
		key = cv2.waitKey(1) & 0xFF
		# if the `q` key was pressed, break from the loop
		if key == ord("q"):
			break
	# if an output video file path has been supplied and the video
	# writer has not been initialized, do so now
	if args["output"] != "" and writer is None:
		# initialize our video writer
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(frame.shape[1], frame.shape[0]), True)
	# if the video writer is not None, write the frame to the output
	# video file
	if writer is not None:
		writer.write(frame)
	# update the FPS counter
	fps.update()
# stop the timer and display FPS information
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

开发你的 Mask R-CNN + OpenCV + CUDA 脚本，干得好！在下一节中，我们将比较 CPU 与 GPU 的结果。有关实现的更多详细信息，请参阅此关于使用 OpenCV 的 Mask R-CNN 的博客文章。

Mask R-CNN：使用 OpenCV 的“dnn”NVIDIA GPU 模块将实例分割速度提高 1,549%

我们的最终测试将是使用 CPU 和 NVIDIA GPU 来比较 Mask R-CNN 的性能确保您已使用本教程的“下载”部分下载源代码和预训练的 OpenCV 模型文件。然后，您可以打开命令行并在 CPU 上对 Mask R-CNN 模型进行基准测试：

$ python mask_rcnn_segmentation.py \
	--mask-rcnn mask-rcnn-coco \
	--input ../example_videos/dog_park.mp4 \
	--output ../output_videos/mask_rcnn_dog_park.avi \
	--display 0
[INFO] loading Mask R-CNN from disk...
[INFO] accessing video stream...
[INFO] elasped time: 830.65
[INFO] approx. FPS: 0.67

Mask R-CNN 架构在计算上非常昂贵，因此在 CPU 上看到 0.67 FPS 的结果是可以预料的。

但是 GPU 呢？GPU 能否将我们的 Mask R-CNN 推向接近实时的性能？

要回答这个问题，只需向 mask_rcnn_segmentation.pyscript 提供 --use-gpu 1 命令行参数：

$ python mask_rcnn_segmentation.py \
	--mask-rcnn mask-rcnn-coco \
	--input ../example_videos/dog_park.mp4 \
	--output ../output_videos/mask_rcnn_dog_park.avi \
	--display 0 \
	--use-gpu 1
[INFO] loading Mask R-CNN from disk...
[INFO] setting preferable backend and target to CUDA...
[INFO] accessing video stream...
[INFO] elasped time: 50.21
[INFO] approx. FPS: 11.05

python opencv cuda 加速 opencv使用gpu加速_dnn_09

在我的 NVIDIA Telsa V100 上，我们的 Mask R-CNN 模型现在达到了 11.05 FPS，大幅提高了 1,549%！使几乎所有与 OpenCV 的“dnn”模块兼容的模型在 NVIDIA GPU 上运行
如果您一直在关注今天帖子中的每个源代码示例，您会注意到每个示例都遵循特定模式将计算推送到支持 NVIDIA CUDA 的 GPU：

从磁盘加载训练好的模型。
将 OpenCV 后端设置为 CUDA。
将计算推送到支持 CUDA 的设备。

这三点巧妙地转化为三行代码：

net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

通常，您可以在使用 OpenCV 的 dnn 模块时遵循相同的方法——如果您有一个与 OpenCV 和 dnn 兼容的模型，那么只需将 CUDA 设置为后端和目标，它就可以用于 GPU 推理。你真正需要做的就是用你用来从磁盘加载网络的任何方法替换 cv2.dnn.readNetFromCaffe 函数，包括：

cv2.dnn.readNet
cv2.dnn.readNetFromDarknet
cv2.dnn.readNetFromModelOptimizer
cv2.dnn.readNetFromONNX
cv2.dnn.readNetFromTensorflow
cv2.dnn.readNetFromTorch
cv2.dnn.readTensorFromONNX

你需要参考你的模型训练所用的确切框架，以确认它是否与 OpenCV 的 dnn 库兼容——我希望将来也能介绍这样的教程。

概括

在本教程中，您学习了如何应用 OpenCV 的“深度神经网络”(dnn) 模块进行 GPU 优化推理。在 OpenCV 4.2 发布之前，OpenCV 的 dnn 模块的计算能力极其有限——大多数读者只能在他们的 CPU 上运行推理，这肯定不太理想。然而，多亏了 dlib 的 Davis King、Yashas Samaga（他实现了 OpenCV 的“dnn”NVIDIA GPU 支持）和 Google Summer of Code 2019 计划，OpenCV 现在可以享受 NVIDIA GPU 和 CUDA 支持，从而比以往更容易应用 state- 最先进的网络到您自己的项目。

翻译自：《opencv-dnn-with-nvidia-gpus-1549-faster-yolo-ssd-and-mask-r-cnn》感兴趣可以阅读原文。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。