nlp用户之间的网络关系构建模型 nlp用户画像

转载

mob64ca1412b28c 2024-05-20 20:51:48

文章标签 nlp用户之间的网络关系构建模型 ccf nlp 大数据搜狗 文章分类 NLP 人工智能

2016 CCF搜狗用户画像

队名：nice
排名: 66/894

这个比赛本质上是一个自然语言处理(NLP)问题，或者更具体地就是文本分类(TC)问题。我们组的主要想法来自于自动化所宗成庆老师他们的一篇文章[Xia et al., 2012]以及网上一些博客的启发。

Brief introduction of our method
Conclusion

Brief introduction of our method

Ensemble of feature sets:

我们使用了经典的词袋模型（BoG）。在特征处理方面，我们用了unigram和bigram的组合特征来表示原来的句子或者语料。

That’s to say, for example, give a user’s query sentence ：

中国特色社会主义基本经济制度

we can obtain the unigram feature via 结巴分词 as:

中国/特色/社会主义/基本/经济/制度

and the bigram feature as:

中国特色/特色社会主义/社会主义基本/基本经济/经济制度

then the combination feature of the sentence becomes：

中国特色社会主义基本经济制度——> 中国/特色/社会主义/基本/经济/制度/中国特色/特色社会主义/社会主义基本/基本经济/经济制度

结果显示，使用unigram和bigram的特征组合，而不是仅仅使用unigram，可以在精度上有所提高（虽然提升略少）！

在得到特征之后，我们进一步使用了卡方检验来进行feature selection。卡方检验可以计算出所有的word对于不同类别的贡献，那些卡方检验得分越高的word，意味着对类别更具有代表性或者区分力。我们使用了得分排在前20000的word放入词袋，并以此为字典来构建TFIDF向量。我们发现，使用Chi Square来进行feature selection是很有好处的，在实验中，当我们不做特征选择，而是将所有的word直接全部扔进词袋时，精度下降了近1个百分点。这也说明卡方检验能够很好的筛选出那些更加discriminating的词，减小噪声词对分类精度产生影响。

====================================================================================================================

A weird thing

The often-used single algorithm for text classification includes

支持向量机
朴素贝叶斯
最大熵模型（Logistic Regression）
……

接下来发生了一件很诡异的事情，而这件事情的解决收获了我个人觉得此次比赛最宝贵的经验，这也是我为什么想分享它的原因。

首先请先看我们的模型的思路：

这件诡异的事情就是在初赛即将结束前，我偶然发现使用多项式朴素贝叶斯（Multinomial Naive Bayes）在交叉验证中取得了非常“逆天”的精度。大喜过望的我，匆匆提交到线上，结果却差得离谱。同样的，我发现当把卡方检验选择出的特征个数不断增加，精度也有很大的提升，可是提交之后的结果却不涨反跌。

We fell into a overfitting-like trap!!

就这样，我们陷入一个类似过拟合的困境中，这让人百思不得其解。可能有人会说，“嘿。过拟合在机器学习问题中再常见不过了。” 但是请注意我的措辞，是类似过拟合。因为一般的过拟合虽然算法在训练集和测试集上的精度不一致，但是它们仍具有正相关性，也就是说如果算法在训练集上有改善，那么在测试集上也应该会变好。而我们这里，在训练集和测试集上的表现却背道而驰。再者，在10-fold cross-validation的情况下，过拟合理论上不应发生，或者至少是极小概率事件。

这件事情困扰了我们很久，因为线下的表现失去了对线上表现的参照意义（他们甚至毫无相关性），任何对于线下的改进都失去了意义…

事实上关于这个问题，Andrew Ng曾在ICML上有一篇文章专门对此进行过讨论，Preventing Overfitting of Cross-Validation Data。他指出造成这种情况的原因有两个：

其一是用于交叉验证的数据被许多噪声污染了，算法都跑去拟合那些噪声了
其二是所选用的假设函数复杂度太高（或者VC维太高）。

然而，这都不是我们的情况。

At last, solution found in the video lecture for An Introduction to Statistical Learning(Trevor Hastie et al.) in Youtube! Details can be found here!

其实回到上一张图可以发现，问题的关键在于我们在数据的处理过程中，以某种方式间接“偷窥”了测试集的数据，这正就是造成线下“假”高精度的原因所在！

The best single algorithm：

before the trap: SVM
after the trap: Naive Bayes

值得一提的是，据Trevor Hastie所说，这个看似不起眼的陷阱，即便是做机器学习和数据挖掘很有经验的人也会不经意中招，得到看似喜人的accuracy。这在许多已经发表的高水平论文中也屡见不鲜，而作者甚至全然不知。

因此，在数据处理过程中，随时关心并避免对于label的“窥视”，这对于指导搭建线上线下同步的机器学习模型大有裨益!

====================================================================================================================

The ensemble of algorithm

Ensemble is an art more than a technique!

对于ensemble的利用。我们尝试了许多ensemble的方法，尽管会有一些提升，但是结果还是没达到让人振奋的程度。

从最初的使用投票表决（分为soft voting和hard voting，前者是用概率投票，后者是直接用分类结果），投票表决有很简明易懂的理论支撑，而且也确实表现出了一些改善。后面还使用了Random Forest和Adaboost等方法，表现出的结果如[Xia et al., 2012]所说，确实不好。倒是有一个尝试，就是使用SVM或者Naive Bayes作为base classifier来进行bagging或者boosting，表现出了比single classifier有一些改善，但是也比较小。

最后有一个构建Meta Classifier或者说stacking的ensemble的方法引起了我们的注意。但是这种方法的调参和预测巨麻烦（因为要先用多个base classifer分类，用分类的结果再作为第二层的输入送给高一级的分类器）。我们在比赛结束前一晚写了一个很粗糙的linear stacking，使用了Max Entropy，SVM和Naive Bayes作为base classifer，用SVM作为meta classifer，最后得到了一个个人历史最好成绩0.66638，然后初赛就结束了 :)

#############code for linear stacking###############
print 'cross-validation now!'
n= 10
skf1 = StratifiedKFold(n_splits=n)
skf2 = StratifiedKFold(n_splits=n-1)
clf_base_all = [es_NB,es_ME,es_SVM]
#clf_top = OneVsOneClassifier(svm.LinearSVC(C=0.5))
clf_top = LogisticRegression(C=1,n_jobs=-1)
#clf_top = clf_per   # 0.48456
max_count = 10
count = 0
result=[]
print '============================================='
print max_count,'rounds on the road'
print '============================================='
for train, test in skf1.split(trainData_tfv, trainLabel):
    if max_count > 0:
        count+=1
        print 'Round',count
        max_count -= 1
        i = 1
        train_data, test_data = trainData_tfv[train], trainData_tfv[test]
        train_label, test_label = trainLabel[train], trainLabel[test]
        out_of_sample_data = np.array([])
        out_of_sample_label = np.array([])
        flag1 = 1
        for train_, test_ in skf2.split(train_data,train_label):
            print 'child round',i
            comb_pred = np.array([])
            flag2 = 1
            for clf_base in clf_base_all:
                clf_base.fit(train_data[train_],train_label[train_])
                if flag2 == 0 :
                     comb_pred = np.hstack([comb_pred,clf_base.predict_proba(train_data[test_])])
                else:
                     flag2 = 0
                     comb_pred = clf_base.predict_proba(train_data[test_])
                print comb_pred.shape
            if flag1 == 0:
                out_of_sample_data = np.vstack([out_of_sample_data,comb_pred])
                out_of_sample_label = np.hstack([out_of_sample_label,train_label[test_]])
            else:
                flag1=0
                out_of_sample_data = comb_pred
                out_of_sample_label = train_label[test_]
            i = i + 1
            print out_of_sample_data.shape
        clf_top.fit(out_of_sample_data,out_of_sample_label)
        for clf_base in clf_base_all:
            clf_base.fit(train_data,train_label)
        comb_pred = np.array([])
        for clf_base in clf_base_all:
            try:
                comb_pred = np.hstack([comb_pred,clf_base.predict_proba(test_data)])
            except:
                comb_pred = clf_base.predict_proba(test_data)
        print 'predict in',comb_pred.shape
        acc = 1-hamming_loss(clf_top.predict(comb_pred),test_label)
        print 'accuracy:',acc
        result.append(acc)
        print '-------------------------------------------------------------'
print result
print np.mean(result)

#====================================================================================
#
print 'predict now!'
count = 0
skf3 = StratifiedKFold(n_splits=10)
clf_base_all = [es_NB,es_ME,es_SVM]
#clf_top = OneVsOneClassifier(svm.LinearSVC(C=0.5))
clf_top = LogisticRegression(C=2,n_jobs=-1)
out_of_sample_data = np.array([])
out_of_sample_label = np.array([])
print '============================================='
print '10 rounds on the road'
print '============================================='
for train, test in skf3.split(trainData_tfv, trainLabel):
        count+=1
        print 'Round',count
        train_data, test_data = trainData_tfv[train], trainData_tfv[test]
        train_label, test_label = trainLabel[train], trainLabel[test]
        for clf_base in clf_base_all:
            clf_base.fit(train_data,train_label)
        comb_pred = np.array([])
        for clf_base in clf_base_all:
            try:
                comb_pred = np.hstack([comb_pred,clf_base.predict_proba(test_data)])
            except:
                comb_pred = clf_base.predict_proba(test_data)
        if count == 1:
            out_of_sample_data = comb_pred
            out_of_sample_label = test_label
        else:
            out_of_sample_data = np.vstack([out_of_sample_data,comb_pred])
            out_of_sample_label = np.hstack([out_of_sample_label,test_label])
        print out_of_sample_data.shape
        print '-------------------------------------------------------------'

clf_top.fit(out_of_sample_data,out_of_sample_label)
for clf_base in clf_base_all:
    clf_base.fit(trainData_tfv,trainLabel)

复赛的时候，我们尝试了使用word2vec的方法来改善特征提取，一番调试后，最后的排名定格在66名，由于感觉距离前五有点遥远，加之时间精力有限，所以最后就没怎么做了。

Conclusion

粗看之下，其实本次比赛至少还有以下可以努力的地方：

Semi-supervised Learning (由于训练数据中存在许多label缺失的数据，如果使用半监督学习，将使这些数据发挥价值)
Imbalanced Data Process (不同类别的数据其实并不均衡，性别男性偏多，年龄和教育程度各个类别样本数也有很大的悬殊，多的有七八千，少的只有几百个样本)
Multi-label Learning （不可否认的是，性别，年龄，教育程度三个预测是存在相关性的，比如说大学学历多在18-22岁的年龄区间，所以使用Multi-label Learning能很好的兼顾这种相关性）
……

Reference:

[1] Rui Xia, Chengqing Zong,Shoushan Li, “Ensemble of feature sets and classification algorithms for sentiment classification” Information Sciences, Volume 181, Issue 6, 15 March 2011, Pages 1138–1152

[2] Ng A Y. Preventing” overfitting” of cross-validation data [C]//ICML. 1997, 97: 245-253.

[3] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning: with Applications in R,
Springer Texts in Statistics, 2013.

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。