文章目录

  • Word2Vec
  • 说明
  • 环境准备
  • 常用的API
  • 实践
  • GloVe
  • 说明
  • 环境准备
  • 实践


在处理NLP任务时,首先要解决的就是词(或字)在计算机中的表示问题。优秀的词(或字)表示要求能准确的表达出semantic(语义)syntactic(语法)的特征。

目前常用的词嵌入(word embedding)训练方法有两种:

  1. word2vec;
  2. glove;

本文旨在介绍如何使用 word2vecglove 算法训练自己的词向量;

Word2Vec

说明

Gensim很好地实现了word2vec算法,可以使用gensim.models.word2vec模块训练自己的词向量(word embedding)。

环境准备

  1. 搭建python环境,可参考链接;
  2. 安装gensimpip install gensim

常用的API

官方API

1. LineSentence:该类是可以遍历包含sentence的文件(一个sentence一行,其中sentence是经过预处理,且用空格进行分词);

class gensim.models.word2vec.LineSentence(object):
    def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):

重要参数:

  • source: 包含句子的文件;
  • max_sentence_length:句子的最大长度,默认是:10000

2. Word2Vec:该类实现word2vec算法,可以训练词向量;

class gensim.models.word2vec.Word2Vec(BaseWordEmbeddingsModel):
    def __init__(self, sentences=None, corpus_file=None, size=100, alpha=0.025, 
                 window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1, 
                 workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, 
                 ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
                 trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH,
                 compute_loss=False, callbacks=(),max_final_vocab=None
                ):

重要参数:

  • sentences:可迭代的sentence,可以是LineSentence对象;
  • size:词向量维度,默认值是:100
  • window:在一个句子中,当前词(current word)和预测词(predicted word)的距离,默认值是:5
  • min_count:小于该数值频率的词不计入词典,默认值是:5
  • workers:线程数量,默认值是:3

实践

def train_word_vec():
    hanzi_sentences = LineSentence(file_util.get_project_path() + './data/hanzi.txt')
    hanzi_model = Word2Vec(hanzi_sentences, size=hidden_dims, window=5, min_count=3,
                           workers=multiprocessing.cpu_count())
    hanzi_model.save(file_util.get_project_path() + './data/word2vec_models/hanzi_embedding_{}.model'.format(hidden_dims))
    hanzi_model.wv.save_word2vec_format(file_util.get_project_path() + './data/word2vec_models/hanzi_embedding_{}.txt'.format(hidden_dims))

其中:

  • hanzi_model.save:保存模型,该模型可继续进行训练;
  • hanzi_model.wv.save_word2vec_format:保存训练好的词向量;

GloVe

说明

GloVe是斯坦福大学Jeffrey、Richard等提供的一种词向量表示算法,GloVe的全称是Global Vectors for Word Representation,是一个基于全局词频统计(count-based & overall staticstics)的词表征(word representation)算法。该算法综合了global matrix factorization(全局矩阵分解)local context window(局部上下文窗口) 两种方法的优点。

环境准备

目前斯坦福官方提供的glove工具只支持在linux系统下运行,网址

  1. 下载GloVe 源码:git clone http://github.com/stanfordnlp/glove ./
  2. 编译:make,注意:在该过程中会出现一些警告warning,可以忽略;编译完成之后会在当前目录生成build文件夹;

实践

Glove训练词向量的脚本是demo.sh脚本,所有的参数都是在该脚本里进行配置。现对脚本做如下注释:

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make
# 下载语料库,自己训练时可以注释掉
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi
# 语料库文件
CORPUS=text8
# 训练好的词典文件
VOCAB_FILE=vocab.txt
# 共现矩阵二进制文件
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
# 词向量文件名 vectors.txt
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
# 词频小于该值,不计入词典
VOCAB_MIN_COUNT=5
# 词向量维度
VECTOR_SIZE=50
MAX_ITER=15
# 训练窗口大小
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
   if [ "$1" = 'matlab' ]; then
       matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 
   elif [ "$1" = 'octave' ]; then
       octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
   else
       echo "$ python eval/python/evaluate.py"
       python eval/python/evaluate.py
   fi
fi

配置完成后运行该脚本即可:./demo.sh