LDA参数
LDA求参推导
中国科学技术信息研究所徐硕老师的PDF,对LDA,TOT,AT模型如何使用gibbs sampling求参进行了细致推导,并依据求参结果给出伪代码。
参数alpha和beta的选择
alpha是一个对称的Dirichlet分布的参数,值越大意味着越平滑(更正规化)。When a is less than 1, the prior distribution is peaked, with most topics receiving low probability in any given document.a =1 represents a uniform prior, and a > 1 penalizes distributions that assign a high probability to any one topic in a specific document.
guide1:Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W
lz注:alpha一般默认 = 50/k + 1,50/k是LDA中的通用设置,+1是根据on smoothing and inference for topic models文中的推荐值。
guide2:I suppose it has no advantage of Dirichlet parameters greater than 1 on topic model.I always choose parameters as small as possible, e.g. ALPHA=0.01/T and so on.
[PGM:贝叶斯网的参数估计 :先验分布参数的确定]
主题数目的选择
有时主题数目的选择都相当随意,如果我们仅仅把主题抽取作为处理的中间环节,那么对大多数用户来说,主题数量对最终结果并无影响,也就是说,只要你抽取了足够多的主题,最终结果并无区别。
然而(lz注)主题数目应该代表模型的复杂度,如果主题数目太小,模型描述数据的能力就会受限;当主题数目设置超过某个threshold时,模型已经足够来处理数据了,这里增加主题数目没用,同时还会增加模型训练的时间。
lz建议使用交叉验证,使用不同的主题数目值,测试参数的sensitivity,通过最终结果的准确率来实现主题数目的选择。
HDP
不过即使这样,你有时候仍然需要去确定需要抽取多少主题,通过垂直狄利克莱过程的方法,它在Gensim中有所实现。
hdp = gensim.models.hdpmodel.HdpModel(mm,id2word)
剩余流程和使用LDA一样,不过使用这种方法花费的时间更长一些。
不用LDP
hierarchical(分层的) learn the number of topics from the data, In practice however, the inferred(推论的) topic counts and resulting topics are often not what’s expected. Theoptimal(最佳的) number of topics from the structural/syntactic(句法的) point of view isn’t necessarily optimal from thesemantic(语义的) point of view.
Thus in practice, running LDA with a number of different topic counts, evaluating the learned topics, and deciding whether the number topics should be increased or decreased usually gives better results.
[Gensim中ndarray、vector的用法及LDA的使用、主题数目的选择]
后验分布的更新实践[AMC]
在burn in phase结束后再在每个sample lag进行posterior distributions更新。
数据量很小时(e.g., 100 reviews)我们只保留Markov chain最后的状态 (i.e., sampleLag = -1). The reason is that it avoids the topics being dominated by the most frequent words.
数据量不是很小时(e.g., 1000 reviews),我们可以设置sampleLag为20。
@Option(name = "-slag", usage = "Specify the length of interval to sample for calculating posterior distribution")
public int sampleLag = -1; // Subject to change given the size of the data.When the data is very small
(e.g., 100 reviews), we only retain the last Markov chain status (i.e., sampleLag = -1). The reason is that it avoids the topics being dominated by the most frequent words. When the data is not very small (e.g., 1000 reviews),
we should set sampleLag as 20.
[随机采样和随机模拟:吉布斯采样Gibbs Sampling 吉布斯采样的相关性和独立性得到近似的独立样本]
皮皮blog
每次运行LDA产生的主题都不同
由于LDA使用随机性的训练和推理步骤。
怎么产生稳定的主题?
通过重置numpy.random种子一样
SOME_FIXED_SEED = 42# before training/inference:
np.random.seed(SOME_FIXED_SEED)
Note: trying lots of random seeds until you hit the right model (as tested on a validation set) is a pretty standard trick.
[lda-model-generates-different-topics-everytime-i-train-on-the-same-corpus]
[]
Naming the topics命名主题
1> The LDA topics are distributions over words, which naturally lends itself as a naming scheme: just take a number (for example 5-10) of most probable words in the distribution as the topicdescriptor(描述符号). This typically works quite well.
2> There are interesting approaches on how to improve topic naming, for example taking into account wordcentrality(中心) in the word network in thecorpus(语料库) etc.