主题模型LDA
原理
LDA也称为隐狄利克雷分布,LDA的目的就是要识别主题,即把文档—词汇矩阵变成文档—主题矩阵(分布)和主题—词汇矩阵(分布)。
文档生成方式
- 按照先验概率$P(d_{i})$选择一篇文档$d_{i}$
- 从狄利克雷分布$\alpha$中取样生成文档$i$的主题分布$\theta_{i}$,换言之,主题分布$\theta_{i}$由超参数$\alpha$的狄利克雷分布生成
- 从主题多项式分布$\theta_{i}$中取样生成文档$i$第$j$个词的主题$z_{i,j}$
- 从狄利克雷分布$\beta$中取样生成主题$z_{i,j}$对应的词语分布$\phi_{z_{i,j}}$,换言之,词语分布$\phi_{z_{i,j}}$由参数为$\beta$的狄利克雷分布生成
- 从词语的多项式分布$\phi_{z_{i,j}}$中采样最终生成词语$w_{i,j}$
共轭先验分布
狄利克雷分布是多项式分布的共轭先验分布,如果后验概率P(θ|x)和先验概率p(θ)满足同样的分布律,那么,先验分布和后验分布被叫做共轭分布,同时,先验分布叫做似然函数的共轭先验分布。
LDA参数估计
LDA的参数估计使用的是吉布斯采样的方法。LDA的学习过程其实就是估计主题分布$\theta$和词分布$\phi$这两个未知参数的过程。我们知道LDA是生成模型,最终目的是在控制超参数$\alpha$和$\beta$的条件下,通过隐变量$\theta$和$phi$,得到联合分布$p(w,z)$,公式如下:
$$p(z,w|\alpha, \beta)=p(w|z,\beta)p(z|\alpha)$$
$$p(w|z,\beta)= \int p(w|z, \phi)p(\phi|\beta)d \phi$$
$$p(z|\alpha)= \int p(z| \theta)p(\theta| \alpha)$$
当得到联合分布后,就可以根据当前的文章计算出主题分布$\phi$和词分布$\theta$
代码实现
模型训练
1 import json
2 from gensim import corpora, models
3 from gensim.corpora import Dictionary
4
5 with open(r'./data/data_specification/cn_software_data.json', 'r', encoding='utf8') as f:
6 cn_software_data = json.load(f)
7
8 with open(r'./data/LDA_data/LDA_text.json', 'r', encoding='UTF8') as f:
9 LDA_texts = json.load(f)
10
11 LDA_dict = Dictionary(LDA_texts)
12 LDA_dict.save(r'./data/LDA_data/LDA_dict')
13 LDA_corpus = [LDA_dict.doc2bow(text) for text in LDA_texts]
14
15 # LDA训练参数
16 num_topics=500
17 iterations=1000
18 workers=3
19
20 # lda多进程训练
21 lda = models.ldamulticore.LdaMulticore(LDA_corpus, id2word=LDA_dict, num_topics=num_topics, iterations=iterations, workers=workers, batch=True)
22 lda.save(r'./LDA_model/lda.model' + 'lda_%s_%s.model'%(num_topics, iterations))
计算perplexity
1 #-*-coding:utf-8-*-
2 import sys
3 import os
4 from gensim.corpora import Dictionary
5 from gensim import corpora, models
6 from datetime import datetime
7 import logging
8 import math
9 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s : ', level=logging.INFO)
10
11 def perplexity(ldamodel, testset, dictionary, size_dictionary, num_topics):
12 """calculate the perplexity of a lda-model"""
13 # dictionary : {7822:'deferment', 1841:'circuitry',19202:'fabianism'...]
14 print ('the info of this ldamodel: \n')
15 print ('num of testset: %s; size_dictionary: %s; num of topics: %s'%(len(testset), size_dictionary, num_topics))
16 prep = 0.0
17 prob_doc_sum = 0.0
18 topic_word_list = []
19 for topic_id in range(num_topics):
20 topic_word = ldamodel.show_topic(topic_id, size_dictionary)
21 dic = {}
22 for word, probability in topic_word:
23 dic[word] = probability
24 topic_word_list.append(dic)
25 doc_topics_ist = []
26 for doc in testset:
27 doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
28 testset_word_num = 0
29 for i in range(len(testset)):
30 prob_doc = 0.0 # the probablity of the doc
31 doc = testset[i]
32 doc_word_num = 0 # the num of words in the doc
33 for word_id, num in doc:
34 prob_word = 0.0 # the probablity of the word
35 doc_word_num += num
36 word = dictionary[word_id]
37 for topic_id in range(num_topics):
38 # cal p(w) : p(w) = sumz(p(z)*p(w|z))
39 prob_topic = doc_topics_ist[i][topic_id][1]
40 prob_topic_word = topic_word_list[topic_id][word]
41 prob_word += prob_topic*prob_topic_word
42 prob_doc += math.log(prob_word) # p(d) = sum(log(p(w)))
43 prob_doc_sum += prob_doc
44 testset_word_num += doc_word_num
45 prep = math.exp(-prob_doc_sum/testset_word_num) # perplexity = exp(-sum(p(d))/sum(Nd))
46 print ("the perplexity of this ldamodel is : %s"%prep)
47
48 return prep
49
50 if __name__ == '__main__':
51 dictionary_path = r'./data/LDA_data/LDA_dict'
52 corpus_path = r'./data/LDA_data/LDA_corpus'
53 num_topics = 500
54 ldamodel_path = './LDA_model/lda_{}_1000.model'.format(str(num_topics))
55 dictionary = corpora.Dictionary.load(dictionary_path)
56 corpus = corpora.MmCorpus(corpus_path)
57 lda_multi = models.ldamodel.LdaModel.load(ldamodel_path)
58
59 testset = []
60 # sample 1/300
61 for i in range(int(corpus.num_docs/300)):
62 testset.append(corpus[i*300])
63 # print(corpus[i*300])
64 prep = perplexity(lda_multi, testset, dictionary, len(dictionary.keys()), num_topics)
65 with open('./LDA_model/lda_{}.txt'.format(str(num_topics)), 'a', encoding='utf8') as f:
66 f.write("the perplexity of K={} ldamodel is : {}".format(str(num_topics), str(prep)))
面试问题
pLSA和LDA的关系
pLSA和LDA都在寻找主题分布与词分布。pLSA跟LDA的区别在于,去探索这两个未知参数的方法或思想不一样。pLSA是求到一个能拟合文本最好的参数(分布),这个值就认为是真实的参数。但LDA认为,其实我们没法去完全求解出主题分布、词分布到底是什么参数,我们只能把它们当成随机变量,通过缩小其方差(变化度)来尽量让这个随机变量变得更“确切”。换言之,我们不再求主题分布、词分布的具体值,而是通过这些分布生成的观测值(即实际文本)来反推分布的参数的范围,即在什么范围比较可能,在什么范围不太可能。所以,其实这就是一种贝叶斯分析的思想,虽然无法给出真实值具体是多少,但可以按照经验给一个相对合理的真实值服从的先验分布,然后从先验出发求解其后验分布。
参考
https://www.jianshu.com/p/b7033e792718
延伸
ABAE模型:https://www.jianshu.com/p/241cb238e21f