基于贝叶斯的垃圾短信分类
利用贝叶斯对垃圾短信(邮件)分类想必是入门机器学习的首选排行前三的一个实例,对于一个算法原理的了解和手撕这个算法还是有一定的差距的。对于贝叶斯分类算法的原理可以用一句话概括:通过某对象的先验概率,利用贝叶斯公式计算出其后验概率,即该对象属于某一类的概率,选择具有最大后验概率的类作为该对象所属的类。用公式来解释其实主要就是利用条件概率公式:
然而今天的主题不是讲解贝叶斯的数学公式和推到,而是通过动手自己去实现一个贝叶斯分类算法,来对垃圾邮件进行分类预测。
1数据说明:
数据来自LeetCode里面AI模块的垃圾短信识别。
链接:https://www.lintcode.com/ai/
部分数据:
Label,Text
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,,
ham,Ok lar... Joking wif u oni...,,,
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,,,
ham,U dun say so early hor... U c already then say...,,,
ham,"Nah I don't think he goes to usf, he lives around here though",,,
spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, �1.50 to rcv",,,
ham,Even my brother is not like to speak with me. They treat me like aids patent.,,,
ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,,,
spam,WINNER!! As a valued network customer you have been selected to receivea �900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,,,
spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,,,
ham,"I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.",,,
spam,"SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info",,,
spam,"URGENT! You have won a 1 week FREE membership in our �100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18",,,
ham,I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.,,,
ham,I HAVE A DATE ON SUNDAY WITH WILL!!,,,
spam,"XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL",,,
ham,Oh k...i'm watching here:),,,
ham,Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.,,,
ham,Fine if that��s the way u feel. That��s the way its gota b,,,
2数据清洗
通过分析数据,发现数据存在如下一些模式以及对应处理方式:
- can't,didn't---拆分为can not , do not
- that��s,I���m,�150---可以把�替换为英文的’,然后再进行拆分
- 大小写问题---转换为统一格式,全部为大写或者小写
- 标点以及其他特殊符号问题---利用非英文字母的方式去除
- 含有ing等词---利用词干提取器进行提取词干
例如:
- 为了分类效果去除英文停用词
本文用到了读取CSV文件的一个函数如下:
import csv
def readCSV(filename, header=False, dex=-1):
data = []
f = open(filename, 'r', encoding='utf8')
reader = csv.reader(f)
for line in reader:
if header:
header = False
continue
if dex == -1:
data.append(line)
else:
data.append(line[dex])
return data
清洗数据函数如下:
import re
from nltk.corpus import stopwords # 停用词
from nltk.stem.snowball import EnglishStemmer # 词干提取def clean_data(filename):
train_word_set = [] # 返回训练集每一个元素是一个包含每一条短信单词的列表
labels = [] # 返回短信的变迁
vocablist = [] # 返回一个词典,所有短信组合的一个词典
s = EnglishStemmer() # 词干提取器
datas = readCSV(filename, header=True, dex=-1)
for data in datas:
# 添加标签
if data[0] == 'ham':
labels.append(0)
elif data[0] == 'spam':
labels.append(1)
else:
labels.append(data[0])
# 处理文本
text = data[1]
text = re.sub('���', "'", text)
text = re.sub('��', "'", text)
text = re.sub('�', "'", text)
text = text.lower() # 小写转换
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"cannot", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"\'m", " am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " will ", text)
text = re.sub(r"ain\'t", " are not ", text)
text = re.sub(r"aren't", " are not ", text)
text = re.sub(r"couldn\'t", " can not ", text)
text = re.sub(r"didn't", " do not ", text)
text = re.sub(r"doesn't", " do not ", text)
text = re.sub(r"don't", " do not ", text)
text = re.sub(r"hadn't", " have not ", text)
text = re.sub(r"hasn't", " have not ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub('[^a-z]', ' ', text) # 去掉非英文字符
text = text.split()
text = [s.stem(word) for word in text] # 词干提取
train_word_set.append(text)
vocablist.extend(text)
stops = stopwords.words('english')
vocablist = list(set(vocablist)-set(stops))# 去除停用词的词典
return train_word_set, labels, vocablist
3构造词向量
将数据清洗好后还不能直接进行训练,需要将文本转化为向量。这里有几个方式去构造向量,分别是词集,词袋,以及TF-IDF。
词集(set of words)单词构成的集合,出现过的单词(不论出现几次)记为1,没有出现的单词记为0。
词袋(bag of words)如果一个单词在文档中出现不止一次,并统计其出现的次数(频数),没出现过的记为0。
TF-IDF(Term Frequency–Inverse Document Frequency)主要思想:如果词w在一篇文档d中出现的频率高,并且在其他文档中很少出现,则认为词w具有很好的区分能力,适合用来把文章d和其他文章区分开来
TF词频:指的是某一个给定的词语在该文件中出现的频率。即词w在文档d中出现的次数count(w, d)和文档d中总词数size(d)的比值
IDF逆向文件频率:是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。即文档总数n与词w所出现文件数docs(w, D)比值的对数
tf-idf模型根据tf和idf为每一个文档d和由关键词w[1]...w[k]组成的查询串q计算一个权值,用于表示查询串q与文档d的匹配度:
tf-idf(q, d)
= sum { i = 1..k | tf-idf(w[i], d) }
= sum { i = 1..k | tf(w[i], d) * idf(w[i]) }
参考:
3.1词集/词袋
实现方法:
def word2vec(word_list, vocablist):# 参数:文档集合,文档构成的词典 row = len(word_list) col = len(vocablist) data = np.zeros(shape=[len(word_list), col], dtype=np.float32) for i in range(row): for word in word_list[i]: if word in vocablist: data[i, vocablist.index(word)] = 1# 词集
# data[i, vocablist.index(word)] += 1# 词袋
return data
3.2TD-IDF
这个方法可以自己实现,也可以通过sklearn自带的库实现,本例中使用的是sklearn库。注意:词频逆向文件频率输入的集合是文本,不是一个个单词构成的列表。
from sklearn.feature_extraction.text import TfidfVectorizer
def tdVector(x_train, x_test): tfid_vec = TfidfVectorizer() x_tfid_train = tfid_vec.fit_transform(x_train) x_tfid_test = tfid_vec.transform(x_test) return x_tfid_train, x_tfid_test
4实现贝叶斯
代码如下:
class My_Beiyes(object):
def fit(self, train_data_set, train_category):
# 训练集的大小,即是短信的个数
Train_size = len(train_data_set)
# 词典向量长度
Num_words = len(train_data_set[0])
# 所有评论中垃圾短信的概率
P_abusive = sum(train_category) / float(Train_size)
# 初始化计算多个概率的乘积以获得文档属于某个类别的概率,即
# 计算p(w0|1)p(w1|1)p(w2|1)。 (如果其中一个概率值为0,那么最后的乘积也为0。 )
P_0_denom = P_1_denom = 2.0 # 将分母初始化为2。
# 分类求词频 0:普通评论 1:侮辱性评论
# 将所有词的出现数初始化为1
P_0_num = np.ones(Num_words)
P_1_num = np.ones(Num_words)
# 循环判别种类并统计不同类别的词频
for i in range(Train_size):
if train_category[i] == 1:
P_1_num += np.array(train_data_set[i])
P_1_denom += sum(train_data_set[i])
else:
P_0_num += np.array(train_data_set[i])
P_0_denom += sum(train_data_set[i])
# 利用概率公式计算每种类别词频的概率
P_1_vec = P_1_num / P_1_denom # (p(x0=1|y=1),p(x1=1|y=1),...p(xn=1|y=1))
P_0_vec = P_0_num / P_0_denom # (p(x0=1|y=0),p(x1=1|y=0),...p(xn=1|y=0))
# 当计算乘积p(w0 | ci)p(w1 | ci)p(w2 | ci)...p(wN | ci)时,由于大部分
# 因子都非常小,因此程序会下溢出或者得到不正确的答案,去对数
P_1_vec = np.log(P_1_vec)
P_0_vec = np.log(P_0_vec)
return P_0_vec, P_1_vec, P_abusive
# 贝叶斯分类器
def predict(self, test_in, P_0_vec, P_1_vec, P_abusive):
pred = []
for item in test_in:
P_0 = sum(item * P_0_vec) + np.log(1 - P_abusive)
P_1 = sum(item * P_1_vec) + + np.log(P_abusive)
if P_1 > P_0:
pred.append(1)
else:
pred.append(0)
return np.array(pred)
5训练与测试
测试模型:
from sklearn.model_selection import train_test_split
def train_mybeiyes():
train_filename = './data/train.csv'
train_word_list, train_labels, train_vocablist = clean_data(train_filename)
train = word2vec(train_word_list, train_vocablist)
X_train, X_test, y_train, y_test = train_test_split(train, train_labels, test_size=0.2, random_state=10)
clf = My_Beiyes()
P_0_vec, P_1_vec, P_abusive = clf.fit(X_train, y_train)
pred = clf.predict(X_test, P_0_vec, P_1_vec, P_abusive)
print("手撕贝叶斯测试准确率:")
print('\t', 1 - np.sum(np.abs(y_test - pred)) / len(y_test))
TF-IDF需要稍作变化,因为用sklearn库直接训练出来的数据不能直接套用在自己实现的贝叶斯模型里,如下为实现代码。
def train_mybeiyes_tf_idf():
train_filename = './data/train.csv'
train_word_list, train_labels, train_vocablist = clean_data(train_filename)
train = []
for item in train_word_list:
train.append(' '.join(item))
X_train, X_test, y_train, y_test = train_test_split(train, train_labels, test_size=0.2, random_state=10)
x_tfid_train, x_tfid_test = tdVector(X_train, X_test)
train_data, test_data = [], []
for i in range(x_tfid_train.shape[0]):
temp = []
for j in range(x_tfid_train.shape[1]):
temp.append(x_tfid_train[i, j])
train_data.append(temp)
print("train data done")
for i in range(x_tfid_test.shape[0]):
temp = []
for j in range(x_tfid_test.shape[1]):
temp.append(x_tfid_test[i, j])
test_data.append(temp)
print("test data done")
clf = My_Beiyes()
P_0_vec, P_1_vec, P_abusive = clf.fit(train_data, y_train)
pred = clf.predict(test_data, P_0_vec, P_1_vec, P_abusive)
print("TF-IDF模型测试准确率:")
print('\t', 1 - np.sum(np.abs(y_test - pred)) / len(y_test))
6其他模型
训练其他模型的时候我将数据集里面的测试数据加入进来,和上面的结果会稍微不同。可以根据分类器名字选择不同的分类器,因为数据比较少准确率还是蛮高的。
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.linear_model import SGDClassifierfrom sklearn.tree import DecisionTreeClassifier
def train(clfname): train_filename = './data/train.csv' train_word_list, train_labels, train_vocablist = clean_data(train_filename) test_filename = './data/test.csv' test_word_list, test_labels, test_vocablist = clean_data(test_filename) train_vocablist.extend(test_vocablist) vocablist = list(set(train_vocablist)) X = word2vec(train_word_list, vocablist) Y = train_labels X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=10) if clfname == 'GaussianNB': clf = GaussianNB() elif clfname == 'SVM': clf = SVC() elif clfname == 'LR': clf = LogisticRegression() elif clfname == 'RF': clf = RandomForestClassifier() elif clfname == 'SGD': clf = SGDClassifier() elif clfname == 'GBDT': clf = GradientBoostingClassifier() elif clfname == 'DT': clf = DecisionTreeClassifier() else: print("没有此分类器!") clf.fit(X_train, y_train) pred = clf.predict(X_test) acc = 1 - np.sum(np.abs(y_test - pred)) / len(y_test) print(acc)