语言模型包括文法语言模型和统计语言模型。一般我们指的是统计语言模型。
统计语言模型是指:把语言(词的序列)看作一个随机事件,并赋予相应的概率来描述其属于某种语言集合的可能性。
其作用是为一个长度为m的字符串确定一个概率分布P(w1; w2; ...;wm),表示其存在的可能性。其中,w1~wm依次表示这段文本中的各个词。用一句话简单地说就是计算一个句子的概率大小。
用这种模型来衡量一个句子的合理性,概率越高,说明越符合人们说出来的自然句子,另一个用处是通过这些方法均可以保留住一定的词序信息,获得一个词的上下文信息。
词向量解释,在神经网络中是通过一个描述词分布关系的方法来实现语义的理解,这种方法描述的词与one_hot描述的词都可以叫做词向量,但它还有个另外的名字叫word embedding(词嵌入)。如何理解呢?将one_hot词向量中的每个元素由整型改为浮点型,变为整个实数范围的表示;然后将原来稀疏的巨大维度压缩嵌入到一个更小维度的空间内,所有向量之间是有距离远近关系的。
word embedding的映射方法是建立在分布假说(distributional hypothesis)基础上的,即假设词的语义由其上下文决定的,上下文相似的词,其语义也相似。词向量的核心步骤由两部分组成:
(1)选择一种方式描述上下文。
(2)选择一种模型刻画某个词(下文称“目标词”)与其上下文之间的关系。
一般来讲就是使用前面介绍的语言模型来完成这种任务。这类方法的最大优势在于可以表示复杂的上下文。
词向量训练,在神经网络训练的词嵌入(word embedding)中,一般会将所有的embedding随机初始化,然后在训练过程中不断更新embedding矩阵的值。对于每一个词与它对应向量的映射值,在TensorFlow中使用了一个叫tf.nn.embedding_lookup的方法来完成。
举例如下:
with tf.device('/cpu:0'):
embeddings = tf.get_variable("embedding",[words_size, embedding_size])
embed = tf.nn.embedding_lookup(embeddings, train_inputs) # 目前只支持在CPU上运行
候选采样技术,对于语言模型相关问题,本质上还是属于多分类问题。对于多分类问题,一般的做法是在最后一层生成与类别相等维度的节点,然后根据输入样本对应的标签来计算损失值,最终反向传播优化参数。但是由于词汇量的庞大,导致要分类的基数也会非常巨大,这会使得最后一层要有海量的节点来对应词汇的个数(如上亿的词汇量),并且还要对其逐个计算概率值,判断其是该词汇的可能性。这种做法会使训练过程变得非常缓慢,进而无法完成任务。
候选采样的技巧,每次只评估所有类别的一个很小的子集,让网络的最后一层只在这个子集中做每个类别的评估计算。因为是监督学习,所以能够知道对应的正确标签(即正样本),额外挑选的子集(对应标签为0)被称为负样本。这样来训练网络,可在保证效率的同时同样会有很好的效果。
词向量的应用,word2vec是Google提出的一种词嵌入的工具或者算法集合,采用了两种模型(CBOW与Skip-Gram模型)与两种方法(负采样与层次softmax方法)的组合,比较常见的组合为Skip-Gram和负采样方法。因为其速度快、效果好而广为人知,在任何场合可直接使用。
在TensorFlow中提供了几个候选采样函数,用来处理loss计算中候选采样的工作,它们按不同的采样规则被封装成了不同的函数,说明如下:
- tf.nn.uniform_candidate_sampler:均匀地采样出类别子集。
- tf.nn.log_uniform_candidate_sampler:按照log-uniform(Zipfian)分布采样。zipfian叫齐夫分布,指只有少数词经常被使用,大部分词很少被使用。
- tf.nn.learned_unigram_candidate_sampler:按照训练数据中出现的类别分布进行采样。
- tf.nn.fixed_unigram_candidate_sampler:按照用户提供的概率分布进行采样。
在实际使用中一般先通过统计或者其他渠道知道待处理的类别满足哪些分布,接着就可以指定函数(或是在fixed_unigram_candidate_sampler中指定对应的分布)来进行候选采样。如果实在不知道类别分布,还可以采用learned_unigram_candidate_sampler的做法是先初始化一个[0, range_max]的数组,数组元素初始为1,在训练过程中碰到一个类别,就将相应数组元素加1,每次按照数组归一化得到的概率进行采样来实现的。
注意:在语言相关的任务中,词按照出现频率从大到小排序之后,服从Zipfian分布。一般会先对类别按照出现频率从大到小排序,然后使用log_uniform_candidate_sampler函数。
nce_loss函数中,其默认使用的是log_uniform_candidate_sampler采样函数,在不指定特殊的采样器时,在该函数实现中会把词频越大的词,其类别编号也定义得越大,即优先采用词频高的词作为负样本,词频越高越有可能成为负样本。nce_loss函数配合优化器可以对最后一层的权重进行调优,更重要的是其还会以同样的方式调节word embedding(词嵌入)中的向量,让它们具有更合理的空间关系。
tf.nn.nce_loss(weights=nce_weights, # shape为(N, K)的权重
biases=nce_biases, # shape为(N)的偏置
labels=train_labels, # 输入数据,shape为(batch_size, K)
inputs=embed, # 标签数据,shape为(batch_size, num_true)
num_true=num_true, # 实际的正样本个数
num_sampled=num_samped, # 采样出多少个负样本
num_classes=words_size, # 类的个数N
sampled_values, # 采样出的负样本,如果是None,就会用默认的sampler去采样,优先采用词频高的词作为负样本
remove_accidental_hits, # 如果采样时采样到的负样本刚好是正样本,是否要去掉
partition_strategy, # 对weights进行embedding_lookup时并行查表时的策略
注意:在TensorFlow中还有一个类似于nce_loss的函数sampled_softmax_loss,其用法与nce_loss函数完全一样。不同的是内部实现,nce_loss函数可以进行多标签分类问题,即标签之前不互斥,原因在于其对每一个输出的类都连接一个logistic二分类。而sampled_softmax_loss只能对单个标签分类,即输出的类别是互斥的,原因是其对每个类的输出放在一起统一做了一个多分类操作。
实例:用CBOW模型训练自己的word2vec,将使用CBOW模型来训练word2vec,最终将所学到的词向量分布关系可视化出来,同时通过该例子练习使用nce_loss函数与word embedding技术,实现自己的word2vec。
import numpy as np
import tensorflow as tf
import random
import collections
from collections import Counter
import jieba
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif']=['SimHei']#用来正常显示中文标签
mpl.rcParams['font.family'] = 'STSong'
mpl.rcParams['font.size'] = 20
training_file = '人体阴阳与电能.txt' # 可以选择自己的书籍
def get_ch_label(txt_file):
labels = ""
with open(txt_file, 'r', encoding='utf-8') as f:
for label in f:
# print(label)
labels = labels + label
return labels
training_data = get_ch_label(training_file)
print("总字数",len(training_data))
#分词
def fenci(training_data):
seg_list = jieba.cut(training_data) # 默认是精确模式
training_ci = " ".join(seg_list)
training_ci = training_ci.split()
#以空格将字符串分开
training_ci = np.array(training_ci)
training_ci = np.reshape(training_ci, [-1, ])
return training_ci
training_ci =fenci(training_data)
#print(training_ci)
print("总词数",len(training_ci))
# 构建数据集
def build_dataset(words, n_words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words-1))
# print(count)
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary
training_label, count, dictionary, words = build_dataset(training_ci, 3500)
words_size = len(dictionary)
print("字典词数",words_size)
#print(training_label)#将文本转为词向量
#print(words)#每个编号对应的词
#print(dictionary)#每个词对应的编号
#print(count)#每个词对应的个数
print('Sample data', training_label[:100], [words[i] for i in training_label[:100]])
data_index = 0
# 批量数据
def generate_batch(data, batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [skip_window target skip_window]
buffer = collections.deque(maxlen=span)
if data_index + span > len(data):
data_index
buffer.extend(data[data_index : data_index + span])
data_index += span
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span-1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
if data_index == len(data):
buffer = data[:span]
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
data_index = (data_index + len(data) - span) % len(data)
return batch, labels
batch, labels = generate_batch(training_label, batch_size=8, num_skips=2, skip_window=1)
for i in range(8):# 取第一个字,后一个是标签,再取其前一个字当标签,
print(batch[i], words[batch[i]], '->', labels[i, 0], words[labels[i, 0]])
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window =np.int32( words_size/2 ) # Only pick dev samples in the head of the distribution.
print("valid_window",valid_window)
valid_examples = np.random.choice(valid_window, valid_size, replace=False)#0-words_size/2,中的数取16个。不能重复。
num_sampled = 64 # Number of negative examples to sample.
tf.reset_default_graph()
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# 词向量
with tf.device('/cpu:0'):
embeddings = tf.Variable(tf.random_uniform([words_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
#
nce_weights = tf.Variable(tf.truncated_normal([words_size, embedding_size],
stddev=1.0/tf.sqrt(np.float32(embedding_size))))
nce_biases = tf.Variable(tf.zeros([words_size]))
# Compute the average NCE loss for the batch
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights, biases=nce_biases,
labels=train_labels, inputs=embed,
num_sampled=num_sampled, num_classes=words_size)
)
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# Compute the cosine similarity
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings/norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
print("________________________",similarity.shape)
#Begin training.
num_steps = 100001
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_inputs, batch_labels = generate_batch(training_label, batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
#通过打印测试可以看到 embed的值在逐渐的被调节
# emv = sess.run(embed,feed_dict = {train_inputs: [37,18]})
# print("emv-------------------",emv[0])
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step ', step, ': ', average_loss)
average_loss = 0
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval(session=sess)# 当前词与整个词典中每个词的夹角余弦
#print(valid_size)
for i in range(valid_size):
valid_word = words[valid_examples[i]]
#print("valid_word",valid_word)#16
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1] #argsort函数返回的是数组值从小到大的索引值
#print("nearest",nearest,top_k)
log_str = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = words[nearest[k]]
log_str = '%s,%s' % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()
# 可视化
import matplotlib
matplotlib.matplotlib_fname()
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,xy=(x, y),xytext=(5, 2), textcoords='offset points',
ha='right',va='bottom')
plt.savefig(filename)
try:
# pylint: disable=g-import-not-at-top
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 80#输出100个词
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [words[i] for i in range(plot_only)]
#print(labels)
plot_with_labels(low_dim_embs, labels)
except ImportError:
print('Please install sklearn, matplotlib, and scipy to show embeddings.')
实例:使用指定候选采样本训练word2vec,上例使用nce_loss中默认的候选采样方法,本例将其扩展成可以手动指定候选样本来计算loss。例子中,通过手动指定词频数据生成样本,然后再根据生成的样本计算loss。该方法具有更强的通用性,使模型不仅适用于满足Zipfian分布的样本,对于其他分布的样本,只需按照本方法配置指定分布的样本即可。具体步骤如下:
生成词频数据,
import numpy as np
import tensorflow as tf
import random
import collections
from collections import Counter
import jieba
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif']=['SimHei']#用来正常显示中文标签
mpl.rcParams['font.family'] = 'STSong'
mpl.rcParams['font.size'] = 20
training_file = '人体阴阳与电能.txt'
#中文字
def get_ch_lable(txt_file):
labels= ""
with open(txt_file, 'rb') as f:
for label in f:
#labels =label.decode('utf-8')
labels =labels+label.decode('gb2312')
return labels
#分词
def fenci(training_data):
seg_list = jieba.cut(training_data) # 默认是精确模式
training_ci = " ".join(seg_list)
training_ci = training_ci.split()
#以空格将字符串分开
training_ci = np.array(training_ci)
training_ci = np.reshape(training_ci, [-1, ])
return training_ci
def build_dataset(words, n_words):
"""Process raw inputs into a dataset."""
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words - 1))
dictionary = dict()
vocab_freqs = []
for word, nvocab in count:
dictionary[word] = len(dictionary)
vocab_freqs.append(nvocab)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary,vocab_freqs
training_data =get_ch_lable(training_file)
print("总字数",len(training_data))
training_ci =fenci(training_data)
#print(training_ci)
print("总词数",len(training_ci))
training_label, count, dictionary, words,vocab_freqs = build_dataset(training_ci, 350)
words_size = len(dictionary)
print("字典词数",words_size)
#print(training_label)#将文本转为词向量
#print(words)#每个编号对应的词
#print(dictionary)#每个词对应的编号
#print(count)#每个词对应的个数
####################################################
print('Sample data', training_label[:10], [words[i] for i in training_label[:10]])
data_index = 0
def generate_batch(data,batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
if data_index + span > len(data):
data_index = 0
buffer.extend(data[data_index:data_index + span])
data_index += span
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
if data_index == len(data):
#print(data_index,len(data),span,len(data[:span]))
#buffer[:] = data[:span]
buffer = data[:span]
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
# Backtrack a little bit to avoid skipping words in the end of a batch
data_index = (data_index + len(data) - span) % len(data)
return batch, labels
batch, labels = generate_batch(training_label,batch_size=8, num_skips=2, skip_window=1)
for i in range(8):# 取第一个字,后一个是标签,再取其前一个字当标签,
print(batch[i], words[batch[i]], '->', labels[i, 0], words[labels[i, 0]])
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window =np.int32( words_size/2 ) # Only pick dev samples in the head of the distribution.
print("valid_window",valid_window)
valid_examples = np.random.choice(valid_window, valid_size, replace=False)#0-words_size/2,中的数取16个。不能重复。
num_sampled = 64 # Number of negative examples to sample.
tf.reset_default_graph()
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
# Look up embeddings for inputs.
embeddings = tf.Variable(tf.random_uniform([words_size, embedding_size], -1.0, 1.0))#94个,每个128个向量
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable( tf.truncated_normal([words_size, embedding_size],
stddev=1.0 / tf.sqrt(np.float32(embedding_size))))
nce_biases = tf.Variable(tf.zeros([words_size]))
vocab_freqs[0] = 90
sampled = tf.nn.fixed_unigram_candidate_sampler(
true_classes=tf.cast(train_labels,tf.int64),
num_true=1,
num_sampled=num_sampled,
unique=True,
range_max=words_size,
unigrams=vocab_freqs)
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=nce_weights, biases=nce_biases,
labels=train_labels, inputs=embed,
num_sampled=num_sampled, num_classes=words_size,sampled_values=sampled))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
print("________________________",similarity.shape)
#Begin training.
num_steps = 100001
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_inputs, batch_labels = generate_batch(training_label, batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
#通过打印测试可以看到 embed的值在逐渐的被调节
# emv = sess.run(embed,feed_dict = {train_inputs: [37,18]})
# print("emv-------------------",emv[0])
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step ', step, ': ', average_loss)
average_loss = 0
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval(session=sess)
#print(valid_size)
for i in range(valid_size):
valid_word = words[valid_examples[i]]
#print("valid_word",valid_word)#16
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1] #argsort函数返回的是数组值从小到大的索引值
#print("nearest",nearest,top_k)
log_str = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = words[nearest[k]]
log_str = '%s,%s' % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,xy=(x, y),xytext=(5, 2), textcoords='offset points',
ha='right',va='bottom')
plt.savefig(filename)
try:
# pylint: disable=g-import-not-at-top
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 80#输出100个词
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [words[i] for i in range(plot_only)]
#print(labels)
plot_with_labels(low_dim_embs, labels)
except ImportError:
print('Please install sklearn, matplotlib, and scipy to show embeddings.')