总结自CS231n

Tensor: Like a numpy array, but can run on GPU

Autograd: Package for building computational graphs out of Tensors, and automatically computing gradients

Module: A neural network layer; may store state or learnable weights

torch.clamp(input, min, max, out=None) → Tensor

将输入input张量每个元素的夹紧到区间 [min,max],并返回结果到一个新张量。

| min, if x_i < min
y_i = | x_i, if min <= x_i <= max
      | max, if x_i > max
  • input (Tensor) – 输入张量
  • min (Number) – 限制范围下限
  • max (Number) – 限制范围上限
  • out (Tensor, optional) – 输出张量

使用pytorch建立神经网络举例

mmdet3d和pytorch的对应版本 pytorch maxout_pytorch

最基本的代码
import torch

device = torch.device('cpu')

# y_pred = w2 * max(w1*x, 0)
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
w1 = torch.randn(D_in, H, device = device)
w2 = torch.randn(H, D_out, device = device)
y = torch.randn(N, D_out, device = device)

learning_rate = 1e-6
for t in range(500):
    h = x.mm(w1)
    h_relu = h.clamp(min = 0)	# max(W1 * x, 0)
    y_pred = h_relu.mm(w2)
    loss = (y_pred - y).pow(2).sum()	# L2 loss
    
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone() # ????????!!!!!!!!!!
    grad_h[h < 0] = 0			# 哦哦哦哦,看到这就懂了上一行了
    grad_w1 = x.t().mm(grad_h)
    
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
代码改进1

创建张量时,设置 mmdet3d和pytorch的对应版本 pytorch maxout_数据_02

import torch

device = torch.device('cuda:0') # device设置为: gpu

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
w1 = torch.randn(D_in, H, device = device, requires_grad = True)	# requires_grad = True 时pytorch为w1建立相应的计算图模型
w2 = torch.randn(H, D_out, device = device, requires_grad = True)	# requires_grad = True 时pytorch为w2建立相应的计算图模型
y = torch.randn(N, D_out, device = device)

learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min = 0).mm(w2) # pytorch在计算图中保存了中间值,故不需要手动保存中间值了
    loss = (y_pred - y).pow(2).sum()
    
    loss.backward()	# 反向传播,使用计算图来计算loss对 w1 和 w2 的梯度
     with torch.no_grad():	# 这一部分不要建立计算图
            w1 -= learning_rate * w1.grad
            w2 -= learning_rate * w2.grad
            w1.grad.zero_()	# 带下划线的方法在原位修改调用者
            w2.grad.zero_()
代码改进2——使用 torch.nn

torch.nn:对神经网络更高层次的封装

import torch
import torch.nn as nn

device = torch.device('cuda:0') # device设置为: gpu

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)

model = nn.Sequential(	# 使用神经网络层序列定义我们的模型
	nn.Linear(D_in, H),	# 每一层都包含 learnable weights,它的形状为:D_in * H
    nn.ReLU(),
    nn.Linear(H, D_out) # w2 的形状为 H * D_out
)

learning_rate = 1e-2
for t in range(500):
    y_pred = model(x)	# 使用模型在输入 x 下得到 y_pred
    loss = nn.functional.mse_loss(y_pred, y)	# 调用库函数mse_loss,详见代码块之后的介绍
    
    loss.backward()		# 计算 loss 对两个线性层的 w 参数的梯度,这些可学习参数的数据类型为 torch,且设置了 requirs_grad = True
    
    with torch.no_grad():
        for param in model.parameters(): # model.parameters() 存放 model 中的所有可学习参数
            param -= learning_rate * param.grad
        model.zero_grad()	# 将模型中参数的梯度置为 0
nn.functional.mse_loss(y_pred, y, reduce = True, size_average = True)

很多的 loss 函数都有 size_average 和 reduce 两个布尔类型的参数。因为一般损失函数都是直接计算 batch 的数据,因此返回的 loss 结果都是维度为 (batch_size, ) 的向量。

  1. 如果 reduce = False,那么 size_average 参数失效,直接返回向量形式的 loss
  2. 如果 reduce = True,那么 loss 返回的是标量
a)如果 size_average = True,返回 loss.mean();
b)如果 size_average = False,返回 loss.sum();

注意:默认情况下, reduce = True,size_average = True

代码改进 3——使用优化器
import torch
import torch.optim as optim
import torch.nn as nn

device = torch.device('cuda:0') # device设置为: gpu

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)

model = nn.Sequential(
	nn.Linear(D_in, H),
	nn.RELU(),
	nn.Linear(H, D_out)
)

learning_rate = 1e-6
optimizer = optim.Adam(model.parameters(), lr = learning_rate) # 实例化一个 Adam 优化器,详情见代码块下方

for t in range(500):
	y_pred = model(x)
    loss = nn.functional.mse_loss(y_pred, y)
    
    loss.backward()	# 计算出 model.parameters() 中的所有参数(都是可学习参数)的梯度.grad
    
    optimizer.step() # 使用上一步计算出的梯度,对可学习参数进行更新
    optimizer.zero_grad()	# 重置优化器中的可学习参数的梯度为 0,详情见代码块下方
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
  1. params(iterable):可用于迭代优化的参数或者定义参数组的dicts。
  2. lr (float, optional) :学习率(默认: 1e-3)
  3. betas (Tuple[float, float], optional):用于计算梯度的平均和平方的系数(默认: (0.9, 0.999))
  4. eps (float, optional):为了提高数值稳定性而添加到分母的一个项(默认: 1e-8)
  5. weight_decay (float, optional):权重衰减(如L2惩罚)(默认: 0)
  • step(closure=None)函数:执行单一的优化步骤
  • closure (callable, optional):用于重新评估模型并返回损失的一个闭包
model.zero_grad()
optimizer.zero_grad()

首先,这两种方式都是把模型中参数的梯度设为0

当optimizer = optim.Optimizer(net.parameters())时,二者等效,其中Optimizer可以是Adam、SGD等优化器

代码改进 4——使用 torch.nn.Module

定义 torch.nn.Module 的子类

import torch
import torch.nn as nn
import torch.optim as optim


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super().__init__()
        self.linear1 = nn.Linear(D_in, H)
        self.linear2 = nn.Linear(H, D_out)
       
    def forward(self, x): # x: 输入数据
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred
    # 不需要定义 backward()方法,autograd会自动生成它

device = torch.device('cuda:0') # device设置为: gpu

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)

model = TwoLayerNet(D_in, H, D_out)

optimizer = optim.SGD(model.parameters(), lr = 1e-4)
for t in range(500):
    y_pred = model(x)
    loss = nn.functional.mse_loss(y_pred, y)
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()