2 垃圾邮件分类
如今,许多电子邮件服务提供垃圾邮件过滤器,能够将电子邮件精确地分类为垃圾邮件和非垃圾邮件。在本部分练习中,您将使用SVMs构建自己的垃圾邮件过滤器。
2.1导入模块
加载模块
import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio
from sklearn import svm
import processEmail as pe # 邮件关键词提取函数
import emailFeatures as ef # 邮件特征向量提取函数
import imp
imp.reload(ef) # 重新加载模块,jupyter开发过程看调试比较方便,但加载模块修改后不能直接调用,通过该函数重新加载模块
plt.ion()
np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
2.2 processEmail 函数
该函数提取电子邮件中的关键词,当然数据做了特殊处理,再将关键词转换成词条库中的索引
import numpy as np
import re
import nltk, nltk.stem.porter
def process_email(email_contents):
vocab_list = get_vocab_list()
word_indices = np.array([], dtype=np.int64)
# ===================== Preprocess Email =====================
# 邮件全部文字转换成小写
email_contents = email_contents.lower()
# 去除邮件中的HTML格式
email_contents = re.sub('<[^<>]+>', ' ', email_contents)
# Any numbers get replaced with the string 'number' 将数字全部转换成单词number
email_contents = re.sub('[0-9]+', 'number', email_contents)
# Anything starting with http or https:// replaced with 'httpaddr' 将url全部转换成httpaddr
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
# Strings with "@" in the middle are considered emails --> 'emailaddr' 将email全部转换成emailaddr
email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
# The '$' sign gets replaced with 'dollar' 将美元符号$转换成dollar
email_contents = re.sub('[$]+', 'dollar', email_contents)
# ===================== Tokenize Email =====================
# Output the email
print('==== Processed Email ====')
stemmer = nltk.stem.porter.PorterStemmer()
# print('email contents : {}'.format(email_contents))
tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)
for token in tokens:
# 去除字母数字
token = re.sub('[^a-zA-Z0-9]', '', token)
# 获取单词前缀
token = stemmer.stem(token)
if len(token) < 1:
continue
print(token)
for k, v in vocab_list.items():
if token == v:
# 单词在词库中存在则加入
word_indices = np.append(word_indices, k)
print('==================')
return word_indices
def get_vocab_list():
vocab_dict = {}
with open('vocab.txt') as f:
for line in f:
(val, key) = line.split()
vocab_dict[int(val)] = key
return vocab_dict
调用processEmail 提取邮件关键词
# ===================== Part 1: Email Preprocessing =====================
print('Preprocessing sample email (emailSample1.txt) ...')
file_contents = open('emailSample1.txt', 'r').read()
word_indices = pe.process_email(file_contents)
Preprocessing sample email (emailSample1.txt) ...
==== Processed Email ====
anyon
know
how
much
it
cost
to
host
a
web
portal
well
it
depend
on
how
mani
visitor
you
re
expect
thi
can
be
anywher
from
less
than
number
buck
a
month
to
a
coupl
of
dollarnumb
you
should
checkout
httpaddr
or
perhap
amazon
ecnumb
if
your
run
someth
big
to
unsubscrib
yourself
from
thi
mail
list
send
an
email
to
emailaddr
==================
显示该邮件成功提取的单词对应的key
# Print stats
print('Word Indices: ')
print(word_indices)
Word Indices:
[ 86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794
1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699
375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699
1758 1896 688 1676 992 961 1477 71 530 1699 531]
将提取的单词转换成特征向量:
# ===================== Part 2: Feature Extraction =====================
print('Extracting Features from sample email (emailSample1.txt) ... ')
# Extract features
features = ef.email_features(word_indices)
# Print stats
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))# np.sum(features)
Extracting Features from sample email (emailSample1.txt) ...
Length of feature vector: 1900
Number of non-zero entries: 45
2.3 支持向量机线性回归训练垃圾邮件分类器
c=0.1训练集的准确率达99.825%
# ===================== Part 3: Train Linear SVM for Spam Classification =====================
# Load the Spam Email dataset
# You will have X, y in your environment
data = scio.loadmat('spamTrain.mat')
X = data['X']
y = data['y'].flatten()
print('X.shape: ', X.shape, '\ny.shape: ', y.shape)
print('Training Linear SVM (Spam Classification)')
print('(this may take 1 to 2 minutes)')
c = 0.1
clf = svm.SVC(c, kernel='linear')
clf.fit(X, y)
p = clf.predict(X)
print('Training Accuracy: {}'.format(np.mean(p == y) * 100))
X.shape: (4000, 1899)
y.shape: (4000,)
Training Linear SVM (Spam Classification)
(this may take 1 to 2 minutes)
Training Accuracy: 99.825
2.4 支持向量机线性回归训练模型在测试集上验证
测试集上验证准确率达98.9,效果还不错
# ===================== Part 4: Test Spam Classification =====================
# After training the classifier, we can evaluate it on a test set. We have
# included a test set in spamTest.mat
# Load the test dataset
data = scio.loadmat('spamTest.mat')
Xtest = data['Xtest']
ytest = data['ytest'].flatten()
print('Xtest.shape: ', Xtest.shape, '\nytest.shape: ', ytest.shape)
print('Evaluating the trained linear SVM on a test set ...')
p = clf.predict(Xtest)
print('Test Accuracy: {}'.format(np.mean(p == ytest) * 100))
Xtest.shape: (1000, 1899)
ytest.shape: (1000,)
Evaluating the trained linear SVM on a test set ...
Test Accuracy: 98.9
2.5 查看哪些单词最可能被认为是垃圾邮件
#由于我们所训练的模型是一个线性SVM,我们可以通过检验模型学习到的w权值来更好地理解它是如何判断一封邮件是否是垃圾邮件的。下面的代#码将找到分类器中权重最大的单词。非正式地,分类器“认为”这些单词是垃圾邮件最有可能的指示器。
vocab_list = pe.get_vocab_list()
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)
for i in range(15):
print('{} ({:0.6f})'.format(vocab_list[indices[i]], clf.coef_.flatten()[indices[i]]))
[1190 297 1397 ... 1764 1665 1560]
otherwis (0.500614)
clearli (0.465916)
remot (0.422869)
gt (0.383622)
visa (0.367710)
base (0.345064)
doesn (0.323632)
wife (0.269724)
previous (0.267298)
player (0.261169)
mortgag (0.257298)
natur (0.253941)
ll (0.253467)
futur (0.248297)
hot (0.246404)