我们在之前的一篇回答中曾详细讲解了机器学习中的多标签分类问题,也介绍了解决多标签分类问题的一些方法:
简单说,多标签分类就是向每个样本分配一组目标标签,我们可以将这个问题看作预测某个数据点的互不排斥的多个属性,比如7-11,你既能将它归类为路边便利店,也能归类为路边小吃店。而在多标签分类问题中,多标签文本分类在实际中有着广泛应用,比如在购物网站上为商品分类标签,或者将电影分类到一个或多个流派等等。今天就分享一下如何用Scikit-learn解决多标签文本分类。
问题描述
估计不少人都有过在网上遭人辱骂或骚扰的经历,这种现象不会因为你关掉网站或手机后就消失不见。谷歌的一些研究人员目前正在用一些工具研究网络恶毒评论。本文我(作者Susan Li——译者注)会搭建一个多标签模型,能够检测不同类型的恶毒言论,比如威胁、猥亵话语、侮辱等等。我们会用到监督式分类器和文本表示。一条恶毒评论可能是威胁、辱骂、侮辱、仇恨等其中一种或全都符合。我们所有的数据集来自Kaggle:
数据探索
%matplotlib inline
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as sns
df = pd.read_csv("train 2.csv", encoding = "ISO-8859-1")
df.head()
每个种类下的评论数量
df_toxic = df.drop(['id', 'comment_text'], axis=1)
counts = []
categories = list(df_toxic.columns.values)
for i in categories:
counts.append((i, df_toxic[i].sum()))
df_stats = pd.DataFrame(counts, columns=['category', 'number_of_comments'])
df_stats
df_stats.plot(x='category', y='number_of_comments', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of comments per category")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('category', fontsize=12)
多标签
多少评论有多个标签?
rowsums = df.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()
#plot
plt.figure(figsize=(8,5))
ax = sns.barplot(x.index, x.values)
plt.title("Multiple categories per comment")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of categories', fontsize=12)绝大部分评论文本都没有被标记。
print('Percentage of comments that are not labelled:')
print(len(df[(df['toxic']==0) & (df['severe_toxic']==0) & (df['obscene']==0) & (df['threat']== 0) & (df['insult']==0) & (df['identity_hate']==0)]) / len(df))
没有标签的评论文本所占比例:
0.8983211235124177
评论文本中词汇数量的分布。
lens = df.comment_text.str.len()
lens.hist(bins = np.arange(0,5000,50))
大部分评论文本长度都在500个字符以内,也有些异常值,长度达到5000个字符。
在评论文本列中没有缺失评论。
print('Number of missing comments in comment text:')
df['comment_text'].isnull().sum()
评论文本中缺失评论的数量:
0
我们先看看第一条评论后就会发现,文本数据需要清洗。
df['comment_text'][0]
“Explanation\rWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27”
数据预处理
创建一个函数清洗文本
def clean_text(text):
text = text.lower()
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"\'scuse", " excuse ", text)
text = re.sub('\W', ' ', text)
text = re.sub('\s+', ' ', text)
text = text.strip(' ')
return text
清洗comment_text列:
df['comment_text'] = df['comment_text'].map(lambda com : clean_text(com))
df['comment_text'][0]
‘explanation why the edits made under my username hardcore metallica fan were reverted they were not vandalisms just closure on some gas after i voted at new york dolls fac and please do not remove the template from the talk page since i am retired now 89 205 38 27’
结果比之前好多了!
将数据拆分为训练集和测试集:
categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train, test = train_test_split(df, random_state=42, test_size=0.33, shuffle=True)
X_train = train.comment_text
X_test = test.comment_text
print(X_train.shape)
print(X_test.shape)
(106912,)
(52659,)
分类器训练
工作流
Scikit-learn提供了一个工作流实体以帮助我们实现自动化机器学习流程。在机器学习系统中,工作流非常常见,因为有太多数据需要处理,应用很多数据转换操作。下面我们会利用工作流训练每个分类器。
OneVsRest多标签策略
该多标签算法会将二元掩膜应用到多个标签上,每个预测的结果表示为0或1的数组,标记哪个类别应用到每行输入样本上。
朴素贝叶斯
OneVsRest策略可以用于多标签学习,其中有一个分类器用于预测实例的多个标签。朴素贝叶斯用于解决多类问题,但我们要处理多标签问题,所以我们将朴素贝叶斯封装在OneVsRest分类器中。
# 定义一个工作流,将文本特征提取器和多标签分类器合并在一起
NB_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
for category in categories:
print('... Processing {}'.format(category))
# 用 X_dtm & y训练模型
NB_pipeline.fit(X_train, train[category])
# 计算测试准确率
prediction = NB_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
处理恶言类别(toxic)
测试准确率为0.9191401279933155
处理严重恶言类别(severe_toxic)
测试准确率为0.9900112041626312
处理猥亵言论类别(obscene)
测试准确率为0.9514802787747584
处理威胁类别(threat)
测试准确率为0.9971135038644866
处理侮辱类别(insult)
测试准确率为 0.9517271501547694
处理仇恨言论类别(identity_hate)
测试准确率为0.9910556600011394
线性SVC
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = SVC_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
处理恶言类别(toxic)
测试准确率为0.9599498661197516
处理严重恶言类别(severe_toxic)
测试准确率为0.9906948479842003
处理猥亵言论类别(obscene)
测试准确率为0.9789019920621356
处理威胁类别(threat)
测试准确率为0.9974173455629617
处理侮辱类别(insult)
测试准确率为 0.9712299891756395
处理仇恨言论类别(identity_hate)
测试准确率为0.9919861752027194
逻辑回归
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
LogReg_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
处理恶言类别(toxic)
测试准确率为0.9548415275641391
处理严重恶言类别(severe_toxic)
测试准确率为 0.9910556600011394
处理猥亵言论类别(obscene)
测试准确率为0.9761104464573956
处理威胁类别(threat)
测试准确率为0.9973793653506523
处理侮辱类别(insult)
测试准确率为 0.9687612753755294
处理仇恨言论类别(identity_hate)
测试准确率为0.991758293928863
这三个分类器生成的结果差不多。至此,我们就为恶毒评论多标签文本分类问题创建了一个很强大的基准。