Word2Vec 训练集数据下载

转载

西洋无悔 2024-07-19 20:58:42

文章标签 Word2Vec 训练集数据下载自然语言处理 python 机器学习预处理 文章分类 机器学习人工智能

文章目录

Word2vec
第三方库

gensim
nltk

训练Word2vec

语料库(corpus)
预处理
使用gensim训练

读取Word2vec
Code
参考

Word2vec

在NLP中，想要处理文本，避不开的问题就是如何表示词。在Word2vec出现之前，词以one-hot形式的编码表示，即一个词由一个仅包含0或1的向量表示，出现的单词位置置为1，其余单词位置置为0。这样的编码方式有一些缺点，其中之一就是任意两个单词计算欧氏距离均相同，这样显然是不太合理的。比如apple和banana应该更加接近，而apple和dog举例应该更远。

顾名思义，Word2vec就是将一个单词转换为一个向量，但是不同于one-hot的编码方式，他更能表现词与词之间的关系。Word2vec所转换的向量的维度是一个参数，需要在训练前手动指定，维度越高，所包含的信息越多，但在训练时时间和空间开销也就越大。

由于本篇主要介绍如何训练Word2vec，算法原理就不展开了。

第三方库

gensim

gensim是一个python主题模型第三方工具包，主要用于自然语言处理（NLP）和信息检索（IR），其中包含了许多高效、易于训练的算法，如Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) 和 word2vec。本篇主要使用的是gensim中的word2vec模型。

安装gensim也很简单，可以选择使用pip或者conda进行安装。

pip install gensim
conda install gensim

nltk

nltk的全称是Natural Language Toolkit，是一个用于自然语言处理的第三方库。nltk提供了超过50个语料库和字典资源，同时也提供文本方法如分类、分词、标注等。本篇中主要使用到的是nltk中的分词和停用词。

安装nltk的方法同安装gensim，使用pip或者conda。

pip install nltk
conda install nltk

安装完成后，还需要下载nltk中的资源（其内置的语料、模块需要单独下载），在命令行启动python，执行

>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('punkt')

Windows用户，可以在C:\Users\你的用户名\AppData\Roaming\nltk_data中看到下载的内容，网络状态不好的情况下，可能会下载失败，此时可以从网上下载对应的压缩，放入指定的文件夹后解压，这样不需要安装上述的方式下载也是可以使用的。

官方下载地址有两个，其中一个是http://www.nltk.org/nltk_data/，另外一个是https://github.com/nltk/nltk_data/tree/gh-pages/packages，找到需要下载的zip压缩包。

其中，stopwords.zip解压到C:\Users\你的用户名\AppData\Roaming\nltk_data\corpora中，punkt.zip解压到C:\Users\你的用户名\AppData\Roaming\nltk_data\tokenizers中。

训练Word2vec

语料库(corpus)

本篇选用的语料库为**20 Newsgroups。**语料库共有20个类别，每个类别有若干篇短文。可以从这里下载到语料库。使用这个语料库训练出的Word2vec，可以应用到下游的分类任务中。

预处理

在训练之前，首先要对文本进行预处理。gensim中的word2vec模型接收的是完整的分词结果，如[‘At’, ‘eight’, “o’clock”, ‘on’, ‘Thursday’, ‘morning’,‘Arthur’, ‘did’, “n’t”, ‘feel’, ‘very’, ‘good’]。

查看数据集中/alt/atheism/49960，前21行均不是正文，引入这些文本会对模型造成一定影响（数据集中的每一个文件都有类似的前缀）。所以首先需要对文本进行清洗。清洗的要求有：

不包含标点符号；
所有单词应该转换为小写；
不包含空行；

总之，我们希望得到的是文章的单词组成的列表。考虑到文本内容的复杂性，分得的词中可能包含数字，或者由符号组成的字符串，或者一些停用词等等，需要进一步加入过滤的条件。

对于本数据集，一个简单的清洗方法是，判断冒号:是否存在于一行内容中，若是的话，则为文件前缀，否则为正文内容，这样对正文造成的影响十较小。代码如下：

with open(file_path, 'r', encoding='utf-8') as file:
		for line in file:
				line = line.lower()
				if ':' in line or line == '':
						continue
				else:
						pass

接着过滤所有的中英文符号，并且使用nltk分词，将分词中的纯数字和停用词过滤掉，考虑到文章中可能有一些不可读取的字节码，引入异常处理，代码如下:

def fetch_tokens(file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result

这样给定一个文件，即可输出这个文件的所有分词结果。当然根据特定情况，可以进一步修改过滤条件，如文章中有一些特别长的单词，这些单词要么是无意义的字符串，要么出现的次数很少可以忽略。

使用gensim训练

使用gensim训练很简单，只需要输入所有文章构成的单词，每个文章的分词结果一列表形式保存，代码如下；

sentences = [['first', 'sentence'], ['second', 'sentence']]
model = gensim.models.Word2Vec(sentences, min_count=1)

其中参数min_count表示忽略出现次数少于次参数的所有单词。

当sentence很多的时候，占用的内存也很大，此处可以使用一种节约内存的方式，使用python的yeild方法。重写上面的预处理，将其改为如下的形式：

class MySentence:
    def __init__(self, dir_name):
        self.dir = dir_name
        self.dir_list = os.listdir(dir_name)
        self.list_stopwords = stopwords.words('english')

    def __iter__(self):
        for sub_dir in self.dir_list:
            file_list = os.listdir(os.path.join(self.dir, sub_dir))
            for file in file_list:
                yield self.fetch_tokens(os.path.join(self.dir, sub_dir, file))

    def fetch_tokens(self, file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in self.list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result

调用方法如下，

sentences = MySentence(CORPUS_DIR)
model = gensim.models.Word2Vec(sentences)
model.save(SAVE_PATH)

这样就可以训练处word2vec模型，并将其保存为SAVE_PATH。

读取Word2vec

读取Word2vec模型使用gensim中的load函数

model = gensim.models.Word2Vec.load(FILE_PATH)

获取Word2vec模型中单词的数目

print(len(model.wv.vocab))

查询某个单词在Word2vec中的表示

print(model['screen'])

查询某个单词在Word2vec中与之最相近的单词

print(model.most_similar('screen'))

查询每个单词在Word2vec中对应的index

for word, obj in model.wv.vocab.items():
    print(word, obj.index)

Code

import gensim
import nltk
import re
import os
from nltk.corpus import stopwords

CORPUS_DIR = "./data/20_newsgroup"
CORPUS = '20_newsgroup'
SIZE = 200
WINDOW = 10


class MySentence:
    def __init__(self, dir_name):
        self.dir = dir_name
        self.dir_list = os.listdir(dir_name)
        self.list_stopwords = stopwords.words('english')

    def __iter__(self):
        for sub_dir in self.dir_list:
            file_list = os.listdir(os.path.join(self.dir, sub_dir))
            for file in file_list:
                yield self.fetch_tokens(os.path.join(self.dir, sub_dir, file))

    def fetch_tokens(self, file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in self.list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result


if __name__ == '__main__':
    sentences = MySentence(CORPUS_DIR)
    model = gensim.models.Word2Vec(sentences,
                                   size=SIZE,
                                   window=WINDOW,
                                   min_count=10)
    model.save("{}/word2vec_{}_{}".format(CORPUS_DIR, CORPUS, SIZE))

    # load Word2vec model
    model = gensim.models.Word2Vec.load("{}/word2vec_{}_{}".format(CORPUS_DIR, CORPUS, SIZE))
    # amount of words in Word2vec model
    print(len(model.wv.vocab))
    # fetch the vector representation of screen
    print(model['screen'])
    # fetch the most similar words of screen
    print(model.most_similar('screen'))
    # fetch index of every word in Word2vec model
    for word, obj in model.wv.vocab.items():
        print(word, obj.index)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。