目录
一、多GPU单机多卡训练的方法
1、nn.DataParallel
2、 torch.distributed
3、一些注意的地方
二、实战演练
三、单机多卡训练下的一个加速trick——梯度累加
多GPU训练能够加快模型的训练速度,也可以实现在单卡上不能训练的模型可以使用多个小卡达到训练的目的。多GPU训练可以分为单机多卡和多机多卡这两种,后面一种也就是分布式训练——训练方式比较麻烦,而且要关注的性能问题也有很多,据网上的资料有人建议能单机训练最好单机训练,不要使用多机训练。这篇文章就重点对单机多卡训练的实现和一些注意的地方进行梳理,然后做一个实战演练。
一、多GPU单机多卡训练的方法
1、nn.DataParallel
使用这中方式是最简单最直接的方法,代码中只需要一句代码就可以完成单卡多GPU训练了。其他的代码和单卡单GPU训练是一样的。
模型并行model = nn.DataParallel(model.cuda(), device_ids=gpus, output_device=gpus[0])
这里在模型初始化以后,直接使用这句代码,device_ids=gpus指定用到的多显卡序号,output_device=gpus[0]指定梯度汇总的GPU。
这种方法的有点就是特别简单,缺点也很明显,就是每个batch中,模型的权重都是在单一的线程上算出来的,然后分发到多个GPU上,这里就有一个GPU通信瓶颈,使得GPU的利用率不是很高,模型训练的速度也不快。
2、 torch.distributed
这种方法旨在缓解nn.DataParallel方法GPU使用效率低的问题。这方法会使得GPU的显存分配更加平衡一点,同时这个是多线程的,显卡的利用效率自然也就高一点。
首先第一步就是要进行init_process_group的初始化,声明GPU的NCCL通信方式。
torch.distributed.init_process_group(backend='nccl')
其次,由于是多线程的,因此数据加载和模型加载也要做对应的修改如下:
train_data = ReadDataSet('train.tsv',args,sentences_count = None)
train_sample = torch.utils.data.distributed.DistributedSampler(train_data)
train_loader = DataLoader(dataset=train_data, batch_size=args.batch_size, shuffle=(train_sample is None),sampler=train_sample)
model = nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],find_unused_parameters=True)#多进程多GPU并行
3、一些注意的地方
首先就是代码使用bash脚本启动的时候是不一样的。一定要像下面这么定义:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train_textBert_multi_gpu_distributedDataParallel.py \
--batch_size 12 \
--model_path ./pretrain_model/Chinese-BERT-wwm \
--requires_grad true\
--data_file_path data_set/patent \
--max_sentence_length 400
在 pyton关键字之前把可用显卡号用它CUDA_VISIBLE_DEVICES=0,1来指定;同时python关键字之后-m torch.distributed.launch --nproc_per_node=2 来指定 分布式启动和采用的节点数,也就是有几个显卡也就用几个节点。
其次就是日志信息的打印,在代码中直接打印的话就会打印nproc_per_node次指定的输出信息,这个时候就需要指定进程号。
if (step+1)%200 == 0 and args.local_rank==0:
print('Train Epoch[{}/{}],step[{}/{}],tra_acc{:.6f} %,loss:{:.6f}'.format(epoch,epochs,step,len(train_iter),two_pro_train_acc*100,two_pro_loss))
这种就只会答应进程为0的对应各种信息。
再次就是loss和准确率的合并,这里有多个线程,肯定就需要对一个batch中多个线程对应的不同loss和准确率进行合并。实现方式如下:
def reduce_tensor(tensor: torch.Tensor):
rt = tensor.clone()
dist.all_reduce(rt,op=dist.ReduceOp.SUM)
rt /= dist.get_world_size() # 总进程数
return rt
把各自的loss或者accuracy做分布式操作的加法,然后在求平均值。
最后,关于batch_size和lr的设置,这里一般可以采用batch_size = n*batch_size_base的方式;而lr = (1,n)*lr_base的方式。
二、实战演练
这里就放一下,最后一种单卡多GPU训练模型和评价指标的计算的代码,以及不同方式训练模型的对比效果:
from DataReader.ReadDataSet import ReadDataSet
from torch.utils.data import DataLoader
import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau,StepLR
from tqdm import tqdm
import torch.nn.functional as F
from model.TextBert import TextBert
from tensorboardX import SummaryWriter
from transformers import AdamW,get_linear_schedule_with_warmup
import argparse
import time
import os
import numpy as np
import torch.nn as nn
import torch.distributed as dist
"""
训练过程中还是要监控验证集准确率
"""
writer = SummaryWriter('runs/exp')
def train(model,train_iter,dev_iter,args):
model.cuda(args.local_rank)
model = nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],find_unused_parameters=True)#多进程多GPU并行
loss_function = nn.CrossEntropyLoss().cuda(args.local_rank)
if args.requires_grad:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
# 设置模型参数的权重衰减
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
# t_total = len(train_iter)
# # 学习率的设置
# optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=1e-8)
# scheduler = get_linear_schedule_with_warmup(
# optimizer, num_warmup_steps=100, num_training_steps=t_total
# )
# AdamW 这个优化器是主流优化器
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8)
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.8, min_lr=1e-7, patience=5, verbose=True,
eps=1e-8) # mode max表示当监控量停止上升时,学习率将减小;min表示当监控量停止下降时,学习率将减小;这里监控的是dev_acc因此应该用max
else:
# 初始学习率
optimizer_params = {'lr': 1e-3, 'eps': 1e-8}
optimizer = AdamW(model.parameters(), **optimizer_params)
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.8, min_lr=1e-6, patience=2, verbose=True,
eps=1e-8) # mode max表示当监控量停止上升时,学习率将减小;min表示当监控量停止下降时,学习率将减小;这里监控的是dev_acc因此应该用max
early_stop_step = 100000
epochs = 3
last_improve = 0 #记录上次提升的step
flag = False # 记录是否很久没有效果提升
dev_best_acc = 0
show_dev_loss = 5.0
correct = 0
total = 0
global_step = 0
t1 = time.time()
for epoch in range(epochs):
for step,batch in enumerate(tqdm(train_iter,desc='Train iteration:')):
two_pro_loss = 0
two_pro_dev_loss = 0
two_pro_train_acc = 0
two_pro_dev_acc =0
global_step += 1
optimizer.zero_grad()
# batch = tuple(t.to('cuda') for t in batch)
batch = tuple(t.cuda(args.local_rank, non_blocking=True) for t in batch)
input_ids = batch[0]
input_mask = batch[1]
label = batch[2]
model.train()
output = model(input_ids,input_mask)
loss = loss_function(output,label)
two_pro_loss += reduce_tensor(loss).item()#有多个进程,把进程0和1的loss加起来平均
loss.backward()
optimizer.step()
total += label.size(0)
_,predict = torch.max(output,1)
correct += (predict==label).sum().item()
train_acc = correct / total
# train_acc = torch.tensor(train_acc).cuda(args.local_rank)
two_pro_train_acc += reduce_tensor(torch.tensor(train_acc).cuda(args.local_rank)).item()
if (step + 1) % 200 == 0:
print('*' * 100)
print('Train Epoch[{}/{}],step[{}/{}],tra_acc{:.6f} %,loss:{:.6f}'.format(epoch, epochs, step,
len(train_iter),
train_acc * 100,
loss))
if (step+1)%200 == 0 and args.local_rank==0:
print('Train Epoch[{}/{}],step[{}/{}],tra_acc{:.6f} %,loss:{:.6f}'.format(epoch,epochs,step,len(train_iter),two_pro_train_acc*100,two_pro_loss))
print('*' * 100)
if (step+1)%(int(len(train_iter)/5))==0:#(step不能+1,dev_acc和dev_loss的初始化)
dev_acc,dev_loss = dev(model, dev_iter,args)
two_pro_dev_loss += reduce_tensor(dev_loss)
show_dev_loss = two_pro_dev_loss.item()
two_pro_dev_acc += reduce_tensor(torch.tensor(dev_acc).cuda(args.local_rank))
if dev_best_acc < two_pro_dev_acc:
dev_best_acc = two_pro_dev_acc
path = 'savedmodel/pytorch_model.bin'
if args.local_rank==0:
torch.save(model,path)
last_improve = global_step
print('&'*100)
print(
"DEV Epoch[{}/{}],step[{}/{}],tra_acc{:.6f} %,dev_acc{:.6f} %,best_dev_acc{:.6f} %,train_loss:{:.6f},dev_loss:{:.6f}".format(
epoch, epochs, step, len(train_loader), train_acc * 100, dev_acc * 100,
dev_best_acc * 100, loss, dev_loss))
if args.local_rank==0:#进程为0的时候输出信息
print("DEV Epoch[{}/{}],step[{}/{}],tra_acc{:.6f} %,dev_acc{:.6f} %,best_dev_acc{:.6f} %,train_loss:{:.6f},dev_loss:{:.6f}".format(epoch, epochs, step, len(train_loader), two_pro_train_acc * 100, two_pro_dev_acc * 100,dev_best_acc*100,two_pro_loss,two_pro_dev_loss))
print('&' * 100)
if global_step-last_improve >= early_stop_step:
print("No optimization for a long time, auto-stopping...")
flag = True
break
if args.local_rank==0:
writer.add_scalar('train_loss', two_pro_loss, global_step=global_step)
writer.add_scalar('dev_loss', show_dev_loss, global_step=global_step)
writer.add_scalar('train_acc', two_pro_train_acc, global_step=global_step)
writer.add_scalar('dev_best_acc', dev_best_acc, global_step=global_step)
scheduler.step(dev_best_acc)
if flag:
break
writer.close()
t2 = time.time()
print('trian and eval model time is %.4f'%(t2-t1))
def reduce_tensor(tensor: torch.Tensor):
rt = tensor.clone()
dist.all_reduce(rt,op=dist.ReduceOp.SUM)
rt /= dist.get_world_size() # 总进程数
return rt
def dev(model, dev_iter,args):
model.eval()
loss_total = 0
with torch.no_grad():
correct = 0
total = 0
for step,batch in enumerate(tqdm(dev_iter,desc='dev iteration:')):
batch = tuple(t.cuda(args.local_rank, non_blocking=True) for t in batch)
input_ids = batch[0]
input_mask = batch[1]
label = batch[2]
output = model(input_ids,input_mask)
loss = F.cross_entropy(output, label)
loss_total += loss
total += label.size(0)
_, predict = torch.max(output, 1)
correct += (predict == label).sum().item()
res = correct/total
return res,loss_total/len(dev_iter)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='init params configuration')
parser.add_argument('--batch_size',type=int,default=100)
parser.add_argument('--model_path',type=str,default='./pretrain_model')
parser.add_argument('--requires_grad', type= bool,default=True)
parser.add_argument('--data_file_path',type=str,default='data_set/patent')
parser.add_argument('--max_sentence_length',type=int,default=400)
parser.add_argument('--local_rank', default=-1, type=int,
help='node rank for distributed training')
args = parser.parse_args()
print(args)
dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)
train_data = ReadDataSet('train.tsv',args,sentences_count = None)
train_sample = torch.utils.data.distributed.DistributedSampler(train_data)
train_loader = DataLoader(dataset=train_data, batch_size=args.batch_size, shuffle=(train_sample is None),sampler=train_sample)
dev_data = ReadDataSet('dev.tsv',args,sentences_count = None)
dev_sample = torch.utils.data.distributed.DistributedSampler(dev_data)
dev_loader = DataLoader(dataset=dev_data, batch_size=args.batch_size, shuffle=(dev_sample is None),sampler=dev_sample)
model = TextBert(args)
train(model,train_loader,dev_loader,args)
对比结果如下:
1、单GPU batch_size = 8 lr_base
时间为208s,验证集最高准确率是59.2%
2、多GPU batch_size = 16 lr=[1,n]*lr_base
其实在每个forward中batch_size 是8;按照经验来说lr应该是要扩大相应的倍数的
nn.DataParallel(model)
总时间是140s,最高准确率是61.4%,相比单卡速度提升了48.6%,耗费时间减少了32.7%。
3、多进程多GPU并行 batch_size = 8 lr=lr_base
model=nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank])
这里的lr并没有变化
时间118s,速度提升了76.3%,耗费时间减少了43.3%,验证集准确率62.8%
如果要采用单机多卡训练模型,无疑是采用nn.parallel.DistributedDataParallel这种方式,速度最快;有限时间内,训练效果最好。
三、单机多卡训练下的一个加速trick——梯度累加
单机多卡训练模型加速方式有采用混合精度和梯度累加等。这里只有梯度累加能够起加速作用的训练是多卡训练才能享受到的,单卡并不能加速。简单的分析就是,多卡训练需要一个梯度同步的过程,就是GPU之间在每一个batch的计算上都会进行通信,这个时间就会导致训练处于等待状态。而梯度累加就是变相增大batch_size,减小batch数目,从而减少GPU之间的通信,起到加速作用。当然梯度累加的代码实现也比较简单,正常的训练代码:
for i, (inputs, labels) in enumerate(training_set):
loss = model(inputs, labels) # 计算loss
optimizer.zero_grad() # 清空梯度
loss.backward() # 反向计算梯度
optimizer.step() # 更新参数
使用梯度累加的代码:
for i, (inputs, labels) in enumerate(training_set):
loss = model(inputs, labels) # 计算loss
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # 反向计算梯度,累加到之前梯度上
if (i+1) % accumulation_steps == 0:
optimizer.step() # 更新参数
model.zero_grad() # 清空梯度
当然这里的效果取决于模型的大小,模型越大收益越大。
下面是采用robert_large模型、2张3090显卡做文本2分类的一个速度(1W训练集和1K验证集):
单机多卡
--accumulation_steps 2
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 686.4237
tra_acc73.325002 %,dev_acc76.400002 %,best_dev_acc76.400002 %
*******************************************************************************
--accumulation_steps 5
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 578.8834
tra_acc73.329997 %,dev_acc75.500000 %,best_dev_acc76.100006 %
*******************************************************************************
--accumulation_steps 10
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 579.5692
tra_acc71.015000 %,dev_acc75.400402 %,best_dev_acc77.300002 %
*******************************************************************************
--accumulation_steps 20
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 613.6300s
tra_acc64.775002 %,dev_acc78.199997 %,best_dev_acc78.199997 %
*******************************************************************************
--accumulation_steps 20
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 580.7058
tra_acc64.754999 %,dev_acc77.400002 %,best_dev_acc77.400002 %
*******************************************************************************
--accumulation_steps 50
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 621.0073
tra_acc53.034997 %,dev_acc71.900002 %,best_dev_acc71.900002 %
*******************************************************************************
--accumulation_steps 80
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 568.5933
tra_acc43.325001 %,dev_acc67.199997 %,best_dev_acc67.199997 %
*******************************************************************************
--accumulation_steps 80
roberta_large textclassification task train and dev 2 epochs with grad accmulation time is 575.0746
tra_acc44.005001 %,dev_acc67.500000 %,best_dev_acc67.500000 %
*******************************************************************************
--accumulation_steps 0
roberta_large textclassification task train and dev 2 epochs time is 718.4363s
tra_acc74.285001 %,dev_acc73.199997 %,best_dev_acc73.199997 %
*******************************************************************************
--accumulation_steps 0
roberta_large textclassification task train and dev 2 epochs time is 694.9744
tra_acc74.559999 %,dev_acc74.000000 %,best_dev_acc74.000000 %
单卡单GPU
*******************************************************************************
trian and eval model time is 1023.3577s
tra_acc64.715000 %,dev_acc71.400000 %,best_dev_acc71.400000 %
*******************************************************************************
trian and eval model time is 1034.7063
tra_acc72.760000 %,dev_acc74.300000 %,best_dev_acc74.300000 %
*******************************************************************************
结论:
单卡3090耗时:1029s
双卡3090耗时:707s——提升:45.5%
双卡3090+梯度累加耗时: 580s——提升77.4%,21.9%
可以看到在当前数据集和模型的情况下,accumulation_step = 10 可以取得最好的效果,相对于单卡提速77.4%;双卡梯度累加相对于双卡不采用梯度累加提速21.9%,前提是模型的准确率并没有降低。这个trick就很好用了。
参考