通常设定什么样的权重初始值,经常关系到神经网络收敛的快慢以及学习能否成功。
可以将权重初始值设为0吗
权值衰减可以有效抑制过拟合、提高泛化能力。
一般我们会将初始值设为较小的值。比如使用0.01 * np.random.randn(10, 100)
生成标准差为0.01的高斯分布的权重
将权重初始值设为0岂不更小?
想一下在反向传播中,比如,在2层神经网络中,假设第1层和第2层的权重为0。这样一来,正向传播时,因为输入层的权重为0,所以第2层的神经元全部会被传递相同的值。第2层的神经元中全部输入相同的值,这意味着反向传播时第2层的权重全部都会进行相同的更新。
因此,权重被更新为相同的值,并拥有了重复的值。
这使得神经网络拥有许多不同的权重的意义丧失了,所以必须随机生成初始值。
隐藏层的激活值的分布
参考斯坦福大学的课程CS231n的实验:创建一个5层的神经网络,每层有100个神经元,初试权重为标准差为1的高斯分布。然后,用高斯分布随机生成1000个数据作为输入数据,并把它们传给5层神经网络。激活函数使用sigmoid函数,将激活函数的输出值称为“激活值”,各层的激活值的直方图如下:
从图可知,各层的激活值呈偏向0和1的分布。这里使用的sigmoid 函数是S型函数,随着输出不断地靠近0(或者靠近1),它的导数的值逐渐接近0。因此,偏向0和1的数据分布会造成反向传播中梯度的值不断变小,最后消失。这个问题称为梯度消失。
层次加深的深度学习中,梯度消失的问题会更加严重。下面,将权重的标准差设为0.01,进行相同的实验,得到如下结果
这次集中在0.5附近的分布,所以不会发生梯度消失的问题。但是,激活值的分布有所偏向,说明在表现力上会有很大问题。
为什么这么说呢?因为如果有多个神经元都输出几乎相同的值,那它们就没有存在的意义了。比如,如果100个神经元都输出几乎相同的值,那么也可以由1个神经元来表达基本相同的事情。因此,激活值在分布上有所偏向会出现“表现力受限”的问题。
那如何解决上面的问题呢?Xavier等大牛科学家早就想到解决办法了。
为了使各层的激活值呈现出具有相同广度的分布,大牛们推导出:如果前一层的节点数为n,则初始值使用标准差为 的分布
使用Xavier初始化后,权重就分散了许多。
ReLU的权重初始值
Xavier初始值是以激活函数是线性函数为前提而推导出来的。因为sigmoid函数和tanh函数左右对称,且中央附近可以视作线性函数,所以适合使用Xavier初始值。
但当激活函数使用ReLU时,一般推荐使用ReLU专用的初始值,也就是Kaiming He等人推荐的初始值,也称为“He初始值”。当前一层的节点数为n时,He初始值使用标准差为 的高斯分布。
直观上可以解释为,因为ReLU的负值区域的值为0,为了使它更有广度,所以需要2倍的系数。
j将激活函数改为Relu后,再使用三种分布测试,得到以下结果:
观察实验结果可知,当“std = 0.01”时,各层的激活值非常小 。神经网络上传递的是非常小的值,说明逆向传播时权重的梯度也同样很小。这是很严重的问题,实际上学习基本上没有进展。
接下来是初始值为Xavier初始值时的结果。在这种情况下,随着层的加深,偏向一点点变大。实际上,层加深后,激活值的偏向变大,学习时会出现梯度消失的问题。
而当初始值为He初始值时,各层中分布的广度相同。由于即便层加深,数据的广度也能保持不变,因此逆向传播时,也会传递合适的值。
总结一下,当激活函数使用ReLU时,权重初始值使用He初始值,当激活函数为sigmoid或tanh等S型曲线函数时,初始值使用Xavier初始值。
最近重温深度学习基础时找了很多书,找书当然是先看目录了,忽然看到一本目录只有全连接神经网络和基础的卷积神经网络,感到有些与众不同。
因为很多作者为了自己的书内容多而充实,而很多细节都没写,一昧追求啥都写从Python安装到感知机到全连接到卷积到目标检测甚至到生成对抗,这很难让人真正的看明白。
不得不赞美一下日本人的工匠精神,斋藤康毅写的这本《深度学习入门基于Python的理论与实现》,是一本非常适合入门的书,写的很具体很细致,而且浅显易懂,非常适合小白入门已经重温复习。
第3章 神经网络·············································· 37
3.1 从感知机到神经网络· ··································· 37
3.1.1 神经网络的例子· ································· 37
3.1.2 复习感知机······································ 38
3.1.3 激活函数登场· ··································· 40
3.2 激活函数·············································· 42
3.2.1 sigmoid函数· ···································· 42
3.2.2 阶跃函数的实现· ································· 43
3.2.3 阶跃函数的图形· ································· 44
3.2.4 sigmoid函数的实现· ······························ 45
3.2.5 sigmoid函数和阶跃函数的比较······················ 46
3.2.6 非线性函数······································ 48
3.2.7 ReLU函数· ····································· 49
3.3 多维数组的运算· ······································· 50
3.3.1 多维数组········································ 50
3.3.2 矩阵乘法········································ 51
3.3.3 神经网络的内积· ································· 55
3.4 3层神经网络的实现· ···································· 56
3.4.1 符号确认········································ 57
3.4.2 各层间信号传递的实现· ··························· 58
3.4.3 代码实现小结· ··································· 62
3.5 输出层的设计· ········································· 63
3.5.1 恒等函数和softmax函数· ·························· 64
3.5.2 实现softmax函数时的注意事项· ···················· 66
3.5.3 softmax函数的特征· ······························ 67
3.5.4 输出层的神经元数量· ····························· 68
3.6 手写数字识别· ········································· 69
3.6.1 MNIST数据集· ·································· 70
3.6.2 神经网络的推理处理· ····························· 73
3.6.3 批处理·········································· 75
3.7 小结·················································· 79
第4章 神经网络的学习· ······································· 81
4.1 从数据中学习· ········································· 81
4.1.1 数据驱动········································ 82
4.1.2 训练数据和测试数据· ····························· 84
4.2 损失函数·············································· 85
4.2.1 均方误差········································ 85
4.2.2 交叉熵误差······································ 87
4.2.3 mini-batch学习· ································· 88
4.2.4 mini-batch版交叉熵误差的实现· ···················· 91
4.2.5 为何要设定损失函数· ····························· 92
4.3 数值微分·············································· 94
4.3.1 导数············································ 94
4.3.2 数值微分的例子· ································· 96
4.3.3 偏导数·········································· 98
4.4 梯度··················································100
4.4.1 梯度法··········································102
4.4.2 神经网络的梯度· ·································106
4.5 学习算法的实现· ·······································109
4.5.1 2层神经网络的类·································110
4.5.2 mini-batch的实现· ·······························114
4.5.3 基于测试数据的评价· ·····························116
4.6 小结··················································118
第5章 误差反向传播法· ·······································121
5.1 计算图················································121
5.1.1 用计算图求解· ···································122
5.1.2 局部计算········································124
5.1.3 为何用计算图解题· ·······························125
5.2 链式法则··············································126
5.2.1 计算图的反向传播· ·······························127
5.2.2 什么是链式法则· ·································127
5.2.3 链式法则和计算图· ·······························129
5.3 反向传播··············································130
5.3.1 加法节点的反向传播· ·····························130
5.3.2 乘法节点的反向传播· ·····························132
5.3.3 苹果的例子······································133
5.4 简单层的实现· ·········································135
5.4.1 乘法层的实现· ···································135
5.4.2 加法层的实现· ···································137
5.5 激活函数层的实现· ·····································139
5.5.1 ReLU层· ·······································139
5.5.2 Sigmoid层·······································141
5.6 Affine/Softmax层的实现·································144
5.6.1 Affine层· ·······································144
5.6.2 批版本的Affine层· ·······························148
5.6.3 Softmax-with-Loss 层· ····························150
5.7 误差反向传播法的实现· ·································154
5.7.1 神经网络学习的全貌图· ···························154
5.7.2 对应误差反向传播法的神经网络的实现· ··············155
5.7.3 误差反向传播法的梯度确认·························158
5.7.4 使用误差反向传播法的学习·························159
5.8 小结··················································161
目录 x
第6章 与学习相关的技巧· ·····································163
6.1 参数的更新············································163
6.1.1 探险家的故事· ···································164
6.1.2 SGD· ··········································164
6.1.3 SGD的缺点· ····································166
6.1.4 Momentum······································168
6.1.5 AdaGrad········································170
6.1.6 Adam· ·········································172
6.1.7 使用哪种更新方法呢· ·····························174
6.1.8 基于MNIST数据集的更新方法的比较················175
6.2 权重的初始值· ·········································176
6.2.1 可以将权重初始值设为0吗· ························176
6.2.2 隐藏层的激活值的分布· ···························177
6.2.3 ReLU的权重初始值·······························181
6.2.4 基于MNIST数据集的权重初始值的比较· ·············183
6.3 Batch Normalization· ···································184
6.3.1 Batch Normalization的算法· ·······················184
6.3.2 Batch Normalization的评估· ·······················186
6.4 正则化················································188
6.4.1 过拟合··········································189
6.4.2 权值衰减········································191
6.4.3 Dropout· ·······································192
6.5 超参数的验证· ·········································195
6.5.1 验证数据········································195
6.5.2 超参数的最优化· ·································196
6.5.3 超参数最优化的实现· ·····························198
6.6 小结··················································200
目录 xi
第7章 卷积神经网络· ·········································201
7.1 整体结构··············································201
7.2 卷积层················································202
7.2.1 全连接层存在的问题· ·····························203
7.2.2 卷积运算········································203
7.2.3 填充············································206
7.2.4 步幅············································207
7.2.5 3维数据的卷积运算· ······························209
7.2.6 结合方块思考· ···································211
7.2.7 批处理··········································213
7.3 池化层················································214
7.4 卷积层和池化层的实现· ·································216
7.4.1 4维数组· ·······································216
7.4.2 基于im2col的展开· ·······························217
7.4.3 卷积层的实现· ···································219
7.4.4 池化层的实现· ···································222
7.5 CNN的实现· ··········································224
7.6 CNN的可视化· ········································228
7.6.1 第1层权重的可视化·······························228
7.6.2 基于分层结构的信息提取· ·························230
7.7 具有代表性的CNN·····································231
7.7.1 LeNet· ·········································231
7.7.2 AlexNet·········································232
7.8 小结··················································233
第8章 深度学习··············································235
8.1 加深网络··············································235
8.1.1 向更深的网络出发· ·······························235
8.1.2 进一步提高识别精度· ·····························238
目录 xii
8.1.3 加深层的动机· ···································240
8.2 深度学习的小历史· ·····································242
8.2.1 ImageNet· ······································243
8.2.2 VGG· ··········································244
8.2.3 GoogLeNet· ·····································245
8.2.4 ResNet· ········································246
8.3 深度学习的高速化· ·····································248
8.3.1 需要努力解决的问题· ·····························248
8.3.2 基于GPU的高速化· ······························249
8.3.3 分布式学习······································250
8.3.4 运算精度的位数缩减· ·····························252
8.4 深度学习的应用案例· ···································253
8.4.1 物体检测········································253
8.4.2 图像分割········································255
8.4.3 图像标题的生成· ·································256
8.5 深度学习的未来· ·······································258
8.5.1 图像风格变换· ···································258
8.5.2 图像的生成······································259
8.5.3 自动驾驶········································261
8.5.4 Deep Q-Network(强化学习)· ·······················262
8.6 小结··················································264
附录A Softmax-with-Loss层的计算图· ···························267
A.1 正向传播· ············································268
A.2 反向传播· ············································270
A.3 小结· ················································277
高清PDF可以关注添加下方微信