总结自CS231n
Tensor: Like a numpy array, but can run on GPU
Autograd: Package for building computational graphs out of Tensors, and automatically computing gradients
Module: A neural network layer; may store state or learnable weights
torch.clamp(input, min, max, out=None) → Tensor
将输入input
张量每个元素的夹紧到区间 [min,max],并返回结果到一个新张量。
| min, if x_i < min
y_i = | x_i, if min <= x_i <= max
| max, if x_i > max
- input (Tensor) – 输入张量
- min (Number) – 限制范围下限
- max (Number) – 限制范围上限
- out (Tensor, optional) – 输出张量
使用pytorch建立神经网络举例
最基本的代码
import torch
device = torch.device('cpu')
# y_pred = w2 * max(w1*x, 0)
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
w1 = torch.randn(D_in, H, device = device)
w2 = torch.randn(H, D_out, device = device)
y = torch.randn(N, D_out, device = device)
learning_rate = 1e-6
for t in range(500):
h = x.mm(w1)
h_relu = h.clamp(min = 0) # max(W1 * x, 0)
y_pred = h_relu.mm(w2)
loss = (y_pred - y).pow(2).sum() # L2 loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone() # ????????!!!!!!!!!!
grad_h[h < 0] = 0 # 哦哦哦哦,看到这就懂了上一行了
grad_w1 = x.t().mm(grad_h)
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
代码改进1
创建张量时,设置
import torch
device = torch.device('cuda:0') # device设置为: gpu
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
w1 = torch.randn(D_in, H, device = device, requires_grad = True) # requires_grad = True 时pytorch为w1建立相应的计算图模型
w2 = torch.randn(H, D_out, device = device, requires_grad = True) # requires_grad = True 时pytorch为w2建立相应的计算图模型
y = torch.randn(N, D_out, device = device)
learning_rate = 1e-6
for t in range(500):
y_pred = x.mm(w1).clamp(min = 0).mm(w2) # pytorch在计算图中保存了中间值,故不需要手动保存中间值了
loss = (y_pred - y).pow(2).sum()
loss.backward() # 反向传播,使用计算图来计算loss对 w1 和 w2 的梯度
with torch.no_grad(): # 这一部分不要建立计算图
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_() # 带下划线的方法在原位修改调用者
w2.grad.zero_()
代码改进2——使用 torch.nn
torch.nn:对神经网络更高层次的封装
import torch
import torch.nn as nn
device = torch.device('cuda:0') # device设置为: gpu
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)
model = nn.Sequential( # 使用神经网络层序列定义我们的模型
nn.Linear(D_in, H), # 每一层都包含 learnable weights,它的形状为:D_in * H
nn.ReLU(),
nn.Linear(H, D_out) # w2 的形状为 H * D_out
)
learning_rate = 1e-2
for t in range(500):
y_pred = model(x) # 使用模型在输入 x 下得到 y_pred
loss = nn.functional.mse_loss(y_pred, y) # 调用库函数mse_loss,详见代码块之后的介绍
loss.backward() # 计算 loss 对两个线性层的 w 参数的梯度,这些可学习参数的数据类型为 torch,且设置了 requirs_grad = True
with torch.no_grad():
for param in model.parameters(): # model.parameters() 存放 model 中的所有可学习参数
param -= learning_rate * param.grad
model.zero_grad() # 将模型中参数的梯度置为 0
nn.functional.mse_loss(y_pred, y, reduce = True, size_average = True)
很多的 loss 函数都有 size_average 和 reduce 两个布尔类型的参数。因为一般损失函数都是直接计算 batch 的数据,因此返回的 loss 结果都是维度为 (batch_size, ) 的向量。
- 如果 reduce = False,那么 size_average 参数失效,直接返回向量形式的 loss
- 如果 reduce = True,那么 loss 返回的是标量
a)如果 size_average = True,返回 loss.mean();
b)如果 size_average = False,返回 loss.sum();
注意:默认情况下, reduce = True,size_average = True
代码改进 3——使用优化器
import torch
import torch.optim as optim
import torch.nn as nn
device = torch.device('cuda:0') # device设置为: gpu
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)
model = nn.Sequential(
nn.Linear(D_in, H),
nn.RELU(),
nn.Linear(H, D_out)
)
learning_rate = 1e-6
optimizer = optim.Adam(model.parameters(), lr = learning_rate) # 实例化一个 Adam 优化器,详情见代码块下方
for t in range(500):
y_pred = model(x)
loss = nn.functional.mse_loss(y_pred, y)
loss.backward() # 计算出 model.parameters() 中的所有参数(都是可学习参数)的梯度.grad
optimizer.step() # 使用上一步计算出的梯度,对可学习参数进行更新
optimizer.zero_grad() # 重置优化器中的可学习参数的梯度为 0,详情见代码块下方
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
- params(iterable):可用于迭代优化的参数或者定义参数组的dicts。
- lr (float, optional) :学习率(默认: 1e-3)
- betas (Tuple[float, float], optional):用于计算梯度的平均和平方的系数(默认: (0.9, 0.999))
- eps (float, optional):为了提高数值稳定性而添加到分母的一个项(默认: 1e-8)
- weight_decay (float, optional):权重衰减(如L2惩罚)(默认: 0)
- step(closure=None)函数:执行单一的优化步骤
- closure (callable, optional):用于重新评估模型并返回损失的一个闭包
model.zero_grad()
optimizer.zero_grad()
首先,这两种方式都是把模型中参数的梯度设为0
当optimizer = optim.Optimizer(net.parameters())时,二者等效,其中Optimizer可以是Adam、SGD等优化器
代码改进 4——使用 torch.nn.Module
定义 torch.nn.Module 的子类
import torch
import torch.nn as nn
import torch.optim as optim
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
super().__init__()
self.linear1 = nn.Linear(D_in, H)
self.linear2 = nn.Linear(H, D_out)
def forward(self, x): # x: 输入数据
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# 不需要定义 backward()方法,autograd会自动生成它
device = torch.device('cuda:0') # device设置为: gpu
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device = device)
y = torch.randn(N, D_out, device = device)
model = TwoLayerNet(D_in, H, D_out)
optimizer = optim.SGD(model.parameters(), lr = 1e-4)
for t in range(500):
y_pred = model(x)
loss = nn.functional.mse_loss(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()