2 垃圾邮件分类

如今,许多电子邮件服务提供垃圾邮件过滤器,能够将电子邮件精确地分类为垃圾邮件和非垃圾邮件。在本部分练习中,您将使用SVMs构建自己的垃圾邮件过滤器。

2.1导入模块

加载模块

import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio
from sklearn import svm

import processEmail as pe # 邮件关键词提取函数
import emailFeatures as ef # 邮件特征向量提取函数

import imp
imp.reload(ef) # 重新加载模块,jupyter开发过程看调试比较方便,但加载模块修改后不能直接调用,通过该函数重新加载模块
plt.ion()
np.set_printoptions(formatter={'float': '{: 0.6f}'.format})

2.2 processEmail 函数

该函数提取电子邮件中的关键词,当然数据做了特殊处理,再将关键词转换成词条库中的索引

import numpy as np
import re
import nltk, nltk.stem.porter


def process_email(email_contents):
    vocab_list = get_vocab_list()
    word_indices = np.array([], dtype=np.int64)

    # ===================== Preprocess Email =====================
    # 邮件全部文字转换成小写
    email_contents = email_contents.lower()
	# 去除邮件中的HTML格式
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)

    # Any numbers get replaced with the string 'number' 将数字全部转换成单词number
    email_contents = re.sub('[0-9]+', 'number', email_contents)

    # Anything starting with http or https:// replaced with 'httpaddr' 将url全部转换成httpaddr
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)

    # Strings with "@" in the middle are considered emails --> 'emailaddr' 将email全部转换成emailaddr
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)

    # The '$' sign gets replaced with 'dollar' 将美元符号$转换成dollar
    email_contents = re.sub('[$]+', 'dollar', email_contents)

    # ===================== Tokenize Email =====================

    # Output the email
    print('==== Processed Email ====')
    stemmer = nltk.stem.porter.PorterStemmer()

    # print('email contents : {}'.format(email_contents))

    tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)

    for token in tokens:
    	# 去除字母数字
        token = re.sub('[^a-zA-Z0-9]', '', token)
        # 获取单词前缀
        token = stemmer.stem(token)

        if len(token) < 1:
            continue
        print(token)
        for k, v in vocab_list.items():
            if token == v:
            	# 单词在词库中存在则加入
                word_indices = np.append(word_indices, k)

    print('==================')
    return word_indices

def get_vocab_list():
    vocab_dict = {}
    with open('vocab.txt') as f:
        for line in f:
            (val, key) = line.split()
            vocab_dict[int(val)] = key

    return vocab_dict

调用processEmail 提取邮件关键词

# ===================== Part 1: Email Preprocessing =====================

print('Preprocessing sample email (emailSample1.txt) ...')

file_contents = open('emailSample1.txt', 'r').read()
word_indices = pe.process_email(file_contents)
Preprocessing sample email (emailSample1.txt) ...
==== Processed Email ====
anyon
know
how
much
it
cost
to
host
a
web
portal
well
it
depend
on
how
mani
visitor
you
re
expect
thi
can
be
anywher
from
less
than
number
buck
a
month
to
a
coupl
of
dollarnumb
you
should
checkout
httpaddr
or
perhap
amazon
ecnumb
if
your
run
someth
big
to
unsubscrib
yourself
from
thi
mail
list
send
an
email
to
emailaddr
==================

显示该邮件成功提取的单词对应的key

# Print stats
print('Word Indices: ')
print(word_indices)
Word Indices: 
[  86  916  794 1077  883  370 1699  790 1822 1831  883  431 1171  794
 1002 1893 1364  592 1676  238  162   89  688  945 1663 1120 1062 1699
  375 1162  479 1893 1510  799 1182 1237  810 1895 1440 1547  181 1699
 1758 1896  688 1676  992  961 1477   71  530 1699  531]

将提取的单词转换成特征向量:

用Python进行中文垃圾邮件分类的步骤 垃圾邮件分类 python_ci

# ===================== Part 2: Feature Extraction =====================

print('Extracting Features from sample email (emailSample1.txt) ... ')

# Extract features
features = ef.email_features(word_indices)

# Print stats
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))# np.sum(features)
Extracting Features from sample email (emailSample1.txt) ... 
Length of feature vector: 1900
Number of non-zero entries: 45

2.3 支持向量机线性回归训练垃圾邮件分类器

c=0.1训练集的准确率达99.825%

# ===================== Part 3: Train Linear SVM for Spam Classification =====================

# Load the Spam Email dataset
# You will have X, y in your environment
data = scio.loadmat('spamTrain.mat')
X = data['X']
y = data['y'].flatten()

print('X.shape: ', X.shape, '\ny.shape: ', y.shape)

print('Training Linear SVM (Spam Classification)')
print('(this may take 1 to 2 minutes)')

c = 0.1
clf = svm.SVC(c, kernel='linear')
clf.fit(X, y)

p = clf.predict(X)
print('Training Accuracy: {}'.format(np.mean(p == y) * 100))
X.shape:  (4000, 1899) 
y.shape:  (4000,)
Training Linear SVM (Spam Classification)
(this may take 1 to 2 minutes)
Training Accuracy: 99.825

2.4 支持向量机线性回归训练模型在测试集上验证

测试集上验证准确率达98.9,效果还不错

# ===================== Part 4: Test Spam Classification =====================
# After training the classifier, we can evaluate it on a test set. We have
# included a test set in spamTest.mat

# Load the test dataset
data = scio.loadmat('spamTest.mat')
Xtest = data['Xtest']
ytest = data['ytest'].flatten()

print('Xtest.shape: ', Xtest.shape, '\nytest.shape: ', ytest.shape)

print('Evaluating the trained linear SVM on a test set ...')

p = clf.predict(Xtest)

print('Test Accuracy: {}'.format(np.mean(p == ytest) * 100))
Xtest.shape:  (1000, 1899) 
ytest.shape:  (1000,)
Evaluating the trained linear SVM on a test set ...
Test Accuracy: 98.9

2.5 查看哪些单词最可能被认为是垃圾邮件

#由于我们所训练的模型是一个线性SVM,我们可以通过检验模型学习到的w权值来更好地理解它是如何判断一封邮件是否是垃圾邮件的。下面的代#码将找到分类器中权重最大的单词。非正式地,分类器“认为”这些单词是垃圾邮件最有可能的指示器。

vocab_list = pe.get_vocab_list()
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)

for i in range(15):
    print('{} ({:0.6f})'.format(vocab_list[indices[i]], clf.coef_.flatten()[indices[i]]))
[1190  297 1397 ... 1764 1665 1560]
otherwis (0.500614)
clearli (0.465916)
remot (0.422869)
gt (0.383622)
visa (0.367710)
base (0.345064)
doesn (0.323632)
wife (0.269724)
previous (0.267298)
player (0.261169)
mortgag (0.257298)
natur (0.253941)
ll (0.253467)
futur (0.248297)
hot (0.246404)