使用pytorch DataParallel进行分布式训练

  • 一、nn.DataParallel大致流程
  • 二、nn.DataParallel参数解读
  • 三、代码讲解
  • 1.使用DataParallell的细节
  • 2.全部代码
  • 四、总结



深度学习中经常要使用大量数据进行训练,但单个GPU训练往往速度过慢,因此多GPU训练就变得十分重要。pytorch使用两种方式进行多GPU训练,他们分别是

DataParallel

parallel.DistributedDataParallel前者便是本文所讲解的训练方式。


nn.DataParallel只能用于单机多卡训练,但

nn.parallel.DistributedDataParallel既能用于单机多卡又能用于多机多卡训练。具体方法参考pytorch官方教程:

一、nn.DataParallel大致流程

pytorch dataloader假死 pytorch dataloader gpu_数据

  1. 将模型与minibatch从磁盘加载进GPU0
  2. 根据GPU数量将数据均分成若干份sub-minibatch,并发送至各个GPU
  3. 模型从GPU0被复制到其他GPU
  4. minibatch进行正向传播得到输出output
  5. 将各个gpu上的output汇总到GPU0,求取各个output的loss
  6. 将各个loss分发到各个GPU,进行反向传播求取梯度
  7. 将各个GPU上的梯度聚集到GPU0,求取平均梯度
  8. 用平均梯度在GPU上更新模型参数,接着把更新后的模型参数复制到其他GPU

二、nn.DataParallel参数解读

class DataParallel(Module):
	"""
    Args:
        module (Module): module to be parallelized
        device_ids (list of int or torch.device): CUDA devices (default: all devices)
        output_device (int or torch.device): device location of output (default: device_ids[0])

    Attributes:
        module (Module): the module to be parallelized

    Example::

        >>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
        >>> output = net(input_var)  # input_var can be on any device, including CPU
    """
 def __init__(self, module, device_ids=None, output_device=None, dim=0):
        super(DataParallel, self).__init__()
        torch._C._log_api_usage_once("torch.nn.parallel.DataParallel")
        device_type = _get_available_device_type()
        if device_type is None:
            self.module = module
            self.device_ids = []
            return

以上是从 DataParallell类定义的代码中截取的部分:

  • device_ids代表并行的GPU的编号
  • output_device代表进行loss求取与梯度聚合的GPU编号(默认是GPU0,即device_ids[0],与上图一致)
  • module不用多讲,便是需要数据并行的模型

三、代码讲解

1.使用DataParallell的细节

1).在文件导入模块部分进行以下设置

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,3"

代码设置了训练使用哪几个GPU,上面的代码就是说使用GPU0与GPU3,它们对应device_id便是0,1这对下面非常有用。
2).接下来将定义好的模型放入DataParallell中:

if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model = nn.DataParallel(model,device_ids = [0,1])
    #model.to(device)的目的是将模型放入device_id对应为的GPU
    model.to(device)

3)将数据放入device0中即(device_id为1的GPU)

for X, y in dataloader:
            X, y = X.to(device), y.to(device)

2.全部代码

本次搭建了一个简单的全连接网络来实现分类任务,数据集为FashionMNIST,损失函数使用交叉损失熵,使用SGD优化器,以下是全部代码,如果要复现代码,尽量将数据集下载到本地。

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda, Compose
import os
import matplotlib.pyplot as plt
import time
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,3"
# Download training data from open datasets.


# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

if __name__ == '__main__':


    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    path = '/home/ymx/PycharmProjects/pytorch_reinforce_example/fashion-mnist-master/data/fashion'

    training_data = datasets.FashionMNIST(
        root='data',
        train=True,
        download=True,
        transform=ToTensor(),
    )

    test_data = datasets.FashionMNIST(
        root='data',
        train=False,
        download=True,
        transform=ToTensor(),
    )

    batch_size = 1024

    # Create data loaders.
    train_dataloader = DataLoader(training_data, batch_size=batch_size)
    test_dataloader = DataLoader(test_data, batch_size=batch_size)



    model = NeuralNetwork()
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        model = nn.DataParallel(model,device_ids = [0,1])
    #model = nn.DataParallel(model,device_ids = [0,1])
    model.to(device)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=2e-3)
    epochs = 5
    T1 = time.time()
    for t in range(epochs):
        print(f"Epoch {t+1}\n-------------------------------")
        train(train_dataloader, model, loss_fn, optimizer)
        test(test_dataloader, model, loss_fn)
    T2 = time.time()
    print('time is %s ms' % ((T2 - T1) * 1000))
    print("Done!")

四、总结

  • 使用DataParallel进行单机多卡训练网络时能加快训练速度,并且代码十分简洁,但要注意当数据量比较小的时候,使用数据并行并不会加快训练速度,因为此时通信耗时特别显著,反而会拖慢训练速度,效果还不如单卡训练。
  • 如果有多机多卡时,十分建议使用parallel.DistributedDataParallel进行分布式训练,这种训练方式带来的速度提升会远大于数据并行。