文章目录
- 路透社数据集
- 3-12 加载数据集
- 3-13 将索引解码为新闻文本
- 3-14 编码数据
- 数据向量化
- 标签向量化
- 3-15 多分类模型定义
- 3-16 编译模型
- 3-17 验证方法
- 3-21 使用更少的训练步数来训练神经网络
- 将所得结果与完全随机的分类器结果对比
- 3-22 在数据集上生成新的预测结果
- 3-23 具有信息瓶颈的模型
- 对比结论
- 写在最后
路透社数据集
对于这个新闻数据集来说,这个是一个多分类问题
数据集特征:
- 文本分类数据集
- 包含46个不同的主题
- 训练集中每个主题至少有10个样本
- 该数据集在Keras中,可以直接调入
多分类问题与0-1问题(单分类)的区别(举个例子):
- 单分类:这个物体是不是人? A:是, B:不是
- 多分类:这个物体属于以下哪一种类? A:人, B:汽车, C:飞机, D:🐟 …
3-12 加载数据集
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = 10000)
# 测试加载是否成功
print(len(train_data))
print(len(test_data))
# print(train_data[10])
3-13 将索引解码为新闻文本
word_index = reuters.get_word_index()
# 反转数据
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# 解码文本
decoded_newswire = ' '.join(reverse_word_index.get(i-3,'?') for i in train_data[0])
print(decoded_newswire)
print(train_labels[10])
? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3
3
- 翻译
(后面有些语法混乱但是还是能够看出这是一段小新闻)
? ? ? 表示由于其12月收购航天公司预计1987年每股收益1 15到30每股dlr从70年的1986 cts公司说税前净应该上升到9到10 mln dlr从六个mln dlr和租赁经营收入1986年19到22 mln dlr从12 5 mln dlr据说今年每股现金流应该是2个50到3个DLRS路透社
3-14 编码数据
数据向量化
import numpy as np
# 定义向量化函数
def vectorize_sequence(sequences, dimension = 10000):
results = np.zeros((len(sequences), dimension))
for i ,sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
# 向量化样本
x_train = vectorize_sequence(train_data) # 训练数据向量化
x_test = vectorize_sequence(test_data) # 测试数据向量化
标签向量化
经过对数据进行向量化以后,我们使用one - hot 编码对标签进行向量化
def to_one_hot(labels, dimension = 46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i,label] = 1.
return results
# 向量化标签
one_hot_train_labels = to_one_hot(train_labels) # 训练集
one_hot_test_labels = to_one_hot(test_labels) # 测试集
# 使用Keras的内置方法进行向量化
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
3-15 多分类模型定义
由于在这个问题中,对于样本中的一条新闻数据,其可以输出的类别有46种可能,而上一个数据集中使用16维度的中间层可能没有办法区分,使得其较小的维度反而成为信息传递的中间瓶颈,出于这个原因,我们在这个例子中使用更大维度的神经网络
# 模型定义
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(46, activation = 'softmax')) # 最后的输出有46种可能性
3-16 编译模型
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])
3-17 验证方法
# 3-17-1 我们在训练数据中留出1000个样本作为验证集合
# 这里简单地运用了python的数据切片知识
x_val = x_train[:1000] # 第0到1000是验证集合
partial_x_train = x_train[1000:] # 我们就用1000以后的数据作为我们的训练数据
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
# 3-17-2 训练模型
history = model.fit(partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 512,
validation_data = (x_val, y_val))
Train on 7982 samples, validate on 1000 samples
Epoch 1/20
7982/7982 [==============================] - 1s 182us/step - loss: 2.5544 - accuracy: 0.5014 - val_loss: 1.7032 - val_accuracy: 0.6170
Epoch 2/20
7982/7982 [==============================] - 1s 100us/step - loss: 1.4136 - accuracy: 0.6994 - val_loss: 1.2892 - val_accuracy: 0.7080
Epoch 3/20
7982/7982 [==============================] - 1s 107us/step - loss: 1.0453 - accuracy: 0.7741 - val_loss: 1.1258 - val_accuracy: 0.7470
Epoch 4/20
7982/7982 [==============================] - 1s 95us/step - loss: 0.8166 - accuracy: 0.8254 - val_loss: 1.0081 - val_accuracy: 0.7840
Epoch 5/20
7982/7982 [==============================] - 1s 94us/step - loss: 0.6438 - accuracy: 0.8631 - val_loss: 0.9416 - val_accuracy: 0.8130
Epoch 6/20
7982/7982 [==============================] - 1s 91us/step - loss: 0.5099 - accuracy: 0.8968 - val_loss: 0.8941 - val_accuracy: 0.8200
Epoch 7/20
7982/7982 [==============================] - 1s 104us/step - loss: 0.4163 - accuracy: 0.9136 - val_loss: 0.8934 - val_accuracy: 0.8080
Epoch 8/20
7982/7982 [==============================] - 1s 117us/step - loss: 0.3332 - accuracy: 0.9290 - val_loss: 0.8881 - val_accuracy: 0.8150
Epoch 9/20
7982/7982 [==============================] - 1s 93us/step - loss: 0.2763 - accuracy: 0.9392 - val_loss: 0.8839 - val_accuracy: 0.8200
Epoch 10/20
7982/7982 [==============================] - 1s 85us/step - loss: 0.2371 - accuracy: 0.9464 - val_loss: 0.8970 - val_accuracy: 0.8140
Epoch 11/20
7982/7982 [==============================] - 1s 96us/step - loss: 0.2001 - accuracy: 0.9505 - val_loss: 0.9158 - val_accuracy: 0.8110
Epoch 12/20
7982/7982 [==============================] - 1s 88us/step - loss: 0.1777 - accuracy: 0.9518 - val_loss: 0.9198 - val_accuracy: 0.8110
Epoch 13/20
7982/7982 [==============================] - 1s 81us/step - loss: 0.1604 - accuracy: 0.9543 - val_loss: 0.9159 - val_accuracy: 0.8190
Epoch 14/20
7982/7982 [==============================] - 1s 92us/step - loss: 0.1455 - accuracy: 0.9569 - val_loss: 0.9516 - val_accuracy: 0.8130
Epoch 15/20
7982/7982 [==============================] - 1s 86us/step - loss: 0.1388 - accuracy: 0.9559 - val_loss: 0.9443 - val_accuracy: 0.8190
Epoch 16/20
7982/7982 [==============================] - 1s 80us/step - loss: 0.1306 - accuracy: 0.9546 - val_loss: 1.0283 - val_accuracy: 0.7990
Epoch 17/20
7982/7982 [==============================] - 1s 96us/step - loss: 0.1217 - accuracy: 0.9587 - val_loss: 1.0271 - val_accuracy: 0.8100
Epoch 18/20
7982/7982 [==============================] - 1s 88us/step - loss: 0.1173 - accuracy: 0.9578 - val_loss: 1.0426 - val_accuracy: 0.8070
Epoch 19/20
7982/7982 [==============================] - 1s 86us/step - loss: 0.1151 - accuracy: 0.9563 - val_loss: 1.0390 - val_accuracy: 0.8090
Epoch 20/20
7982/7982 [==============================] - 1s 142us/step - loss: 0.1093 - accuracy: 0.9583 - val_loss: 1.0477 - val_accuracy: 0.8090
# 画图
import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss =history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label = 'Training loss') # 'bo'l表示蓝色原点
plt.plot(epochs, val_loss, 'b', label = 'Validation loss') # b 表示蓝色实线
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O4fdPieH-1633347174466)(output_17_0.png)]
plt.clf() # 清空图像
# 注意,以下版本为书中的版本
# acc = history.history['acc']
# val_acc = history.history['val_acc']
# 在我的Keras的版本中,acc被替换为accuracy,仅供参考
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
plt.plot(epochs, acc, 'bo', label = 'Training acc')
plt.plot(epochs, val_acc, 'b', label = 'Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Hj9Huu6e-1633347174469)(output_18_0.png)]
3-21 使用更少的训练步数来训练神经网络
model_2 = models.Sequential()
model_2.add(layers.Dense(64, activation = 'relu', input_shape = (10000, )))
model_2.add(layers.Dense(64, activation = 'relu'))
model_2.add(layers.Dense(46, activation = 'softmax'))
model_2.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
model_2.fit(partial_x_train, partial_y_train, epochs = 9,
batch_size = 512, validation_data = (x_val, y_val))
results = model.evaluate(x_test, one_hot_test_labels)
print(results)
Train on 7982 samples, validate on 1000 samples
Epoch 1/9
7982/7982 [==============================] - 1s 97us/step - loss: 2.4856 - accuracy: 0.5234 - val_loss: 1.6474 - val_accuracy: 0.6420
Epoch 2/9
7982/7982 [==============================] - 1s 95us/step - loss: 1.3877 - accuracy: 0.6994 - val_loss: 1.2912 - val_accuracy: 0.7030
Epoch 3/9
7982/7982 [==============================] - 1s 92us/step - loss: 1.0489 - accuracy: 0.7699 - val_loss: 1.1425 - val_accuracy: 0.7600
Epoch 4/9
7982/7982 [==============================] - 1s 85us/step - loss: 0.8293 - accuracy: 0.8182 - val_loss: 1.0284 - val_accuracy: 0.7830
Epoch 5/9
7982/7982 [==============================] - 1s 99us/step - loss: 0.6605 - accuracy: 0.8543 - val_loss: 0.9781 - val_accuracy: 0.7820
Epoch 6/9
7982/7982 [==============================] - 1s 83us/step - loss: 0.5260 - accuracy: 0.8887 - val_loss: 0.9659 - val_accuracy: 0.7810
Epoch 7/9
7982/7982 [==============================] - 1s 84us/step - loss: 0.4247 - accuracy: 0.9138 - val_loss: 0.9053 - val_accuracy: 0.8040
Epoch 8/9
7982/7982 [==============================] - 1s 102us/step - loss: 0.3496 - accuracy: 0.9253 - val_loss: 0.8785 - val_accuracy: 0.8150
Epoch 9/9
7982/7982 [==============================] - 1s 100us/step - loss: 0.2851 - accuracy: 0.9394 - val_loss: 0.8921 - val_accuracy: 0.8210
2246/2246 [==============================] - 0s 157us/step
[1.251534348179587, 0.7853962779045105]
将所得结果与完全随机的分类器结果对比
import copy
# 随机生成test_label
test_label_copy = copy.copy(test_labels)
np.random.shuffle(test_label_copy)
hits_array = np.array(test_labels) == np.array(test_label_copy)
float(np.sum(hits_array)) / len(test_labels)
0.18655387355298308
3-22 在数据集上生成新的预测结果
prediction = model.predict(x_test)
# prediction中的每个元素都是一个长度为46的向量
print(prediction[0].shape)
# 且其和为1(表示该数据分到46个不同分类中的概率)
print(np.sum(prediction[0]))
# 打印其最大的预测类别
print(np.argmax(prediction[0]))
(46,)
1.0000001
3
3-23 具有信息瓶颈的模型
在这里,为了验证我们前面关于中间全链接层关于维度输出太小会使得全连接层成为信息流动的障碍,我们在这里特意把全连接层的维度调小了,再通过比对预测的准确性,作进一步对比分析。
# 模型的建立,注意我们把中间的全连接层改为了4
model_3 = models.Sequential()
model_3.add(layers.Dense(64, activation = 'relu', input_shape = (10000,)))
# 可以通过更改这一部分的数字, 8, 32, 64看看有什么影响
# --------------------------------------------------
model_3.add(layers.Dense(128, activation = 'relu'))
# --------------------------------------------------
model_3.add(layers.Dense(46, activation = 'softmax'))
# 模型的训练
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
model.fit(partial_x_train, partial_y_train,
epochs = 20, batch_size = 128,
validation_data = (x_val, y_val))
Train on 7982 samples, validate on 1000 samples
Epoch 1/20
7982/7982 [==============================] - 1s 176us/step - loss: 0.0738 - accuracy: 0.9588 - val_loss: 2.5090 - val_accuracy: 0.7760
Epoch 2/20
7982/7982 [==============================] - 1s 133us/step - loss: 0.0698 - accuracy: 0.9597 - val_loss: 2.6685 - val_accuracy: 0.7700
Epoch 3/20
7982/7982 [==============================] - 1s 130us/step - loss: 0.0676 - accuracy: 0.9607 - val_loss: 2.7785 - val_accuracy: 0.7700
Epoch 4/20
7982/7982 [==============================] - 1s 129us/step - loss: 0.0675 - accuracy: 0.9584 - val_loss: 2.9254 - val_accuracy: 0.7700
Epoch 5/20
7982/7982 [==============================] - 1s 133us/step - loss: 0.0670 - accuracy: 0.9588 - val_loss: 3.0094 - val_accuracy: 0.7710
Epoch 6/20
7982/7982 [==============================] - 1s 129us/step - loss: 0.0666 - accuracy: 0.9578 - val_loss: 3.0232 - val_accuracy: 0.7680
Epoch 7/20
7982/7982 [==============================] - 1s 168us/step - loss: 0.0665 - accuracy: 0.9590 - val_loss: 3.0974 - val_accuracy: 0.7730
Epoch 8/20
7982/7982 [==============================] - 1s 122us/step - loss: 0.0659 - accuracy: 0.9587 - val_loss: 3.2057 - val_accuracy: 0.7670
Epoch 9/20
7982/7982 [==============================] - 1s 133us/step - loss: 0.0648 - accuracy: 0.9599 - val_loss: 3.2828 - val_accuracy: 0.7650
Epoch 10/20
7982/7982 [==============================] - 1s 124us/step - loss: 0.0644 - accuracy: 0.9602 - val_loss: 3.1684 - val_accuracy: 0.7700
Epoch 11/20
7982/7982 [==============================] - 1s 131us/step - loss: 0.0647 - accuracy: 0.9577 - val_loss: 3.2552 - val_accuracy: 0.7650
Epoch 12/20
7982/7982 [==============================] - 1s 138us/step - loss: 0.0627 - accuracy: 0.9578 - val_loss: 3.4422 - val_accuracy: 0.7710
Epoch 13/20
7982/7982 [==============================] - 1s 135us/step - loss: 0.0631 - accuracy: 0.9594 - val_loss: 3.3429 - val_accuracy: 0.7610
Epoch 14/20
7982/7982 [==============================] - 1s 148us/step - loss: 0.0636 - accuracy: 0.9585 - val_loss: 3.6921 - val_accuracy: 0.7660
Epoch 15/20
7982/7982 [==============================] - 1s 160us/step - loss: 0.0632 - accuracy: 0.9597 - val_loss: 3.4518 - val_accuracy: 0.7640
Epoch 16/20
7982/7982 [==============================] - 1s 186us/step - loss: 0.0616 - accuracy: 0.9590 - val_loss: 3.7733 - val_accuracy: 0.7620
Epoch 17/20
7982/7982 [==============================] - 1s 139us/step - loss: 0.0621 - accuracy: 0.9590 - val_loss: 3.7500 - val_accuracy: 0.7610
Epoch 18/20
7982/7982 [==============================] - 1s 132us/step - loss: 0.0610 - accuracy: 0.9589 - val_loss: 3.9891 - val_accuracy: 0.7540
Epoch 19/20
7982/7982 [==============================] - 1s 150us/step - loss: 0.0622 - accuracy: 0.9588 - val_loss: 3.8385 - val_accuracy: 0.7500
Epoch 20/20
7982/7982 [==============================] - 1s 134us/step - loss: 0.0602 - accuracy: 0.9599 - val_loss: 4.1126 - val_accuracy: 0.7570
<keras.callbacks.callbacks.History at 0x1b52af48be0>
对比结论
# 打印结果
results = model.evaluate(x_test, one_hot_test_labels)
print(results)
2246/2246 [==============================] - 0s 89us/step
[4.692387377058727, 0.75200355052948]
- 如果将数据样本分到N个类别,最后一层必定是大小为N的全连接层
- 对于单标签、多分类的问题,最后一层应该使用softmax作为激活
- 通过使用分类编码(one-hot)来对标签及进行编码,最后使用categorical_crossentropy作为损失函数
写在最后
才疏学浅,若有纰漏,恳请斧正