一、编程环境
Win10
Python3.6
Jupyter Notebook
Graphviz (简介和安装请参考https://www.jianshu.com/p/b559dc689b7f)
二、数据源
http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
把这个网址里的数据拷贝到csv文件中,并命名为dataset_uncleaned.csv
三、清洗数据
1 将疾病和对应的多个症状放到字典里,key为疾病,value为多个症状。
注意,有些疾病和症状包含了特殊符号’^’,需要先处理成’_’再切割。
import csv
from collections import defaultdict
disease_list = []
def return_list(disease):
disease_list = []
match = disease.replace('^','_').split('_')
ctr = 1
for group in match:
if ctr%2==0:
disease_list.append(group)
ctr = ctr + 1
return disease_list
with open("Scraped-Data/dataset_uncleaned.csv") as csvfile:
reader = csv.reader(csvfile)
disease=""
weight = 0
disease_list = []
dict_wt = {}
dict_=defaultdict(list)
for row in reader:
if row[0]!="\xc2\xa0" and row[0]!="":
disease = row[0]
disease_list = return_list(disease)
weight = row[1]
if row[2]!="\xc2\xa0" and row[2]!="":
symptom_list = return_list(row[2])
for d in disease_list:
for s in symptom_list:
dict_[d].append(s)
dict_wt[d] = weight
print
2 将疾病-症状-样本数写到dataset_clean.csv中,注意,每个疾病对应着一个样本数和多个症状。
with open("Scraped-Data/dataset_clean.csv","w") as csvfile:
writer = csv.writer(csvfile)
for key,values in dict_.items():
for v in values:
#key = str.encode(key)
key = str.encode(key).decode('utf-8')
#.strip()
#v = v.encode('utf-8').strip()
#v = str.encode(v)
writer.writerow([key,v,dict_wt[key]])
注意,此时看到的csv中,每行数据下面有一行空行,这个先不用处理,下面的步骤会处理。
3 给数据表dataset_clean.csv中的每列数据加上列标题
columns = ['Source','Target','Weight']
data = pd.read_csv("Scraped-Data/dataset_clean.csv",names=columns, encoding ="ISO-8859-1")
data.head()
data.to_csv("Scraped-Data/dataset_clean.csv",index=False)
此时,每行下面的空行消失了。
4 标注数据并存到nodetable.csv中
数据分为三列,第一列ID是疾病名称或症状名称;第二列Label是疾病名称或症状名称,与ID完全一样;第三标属性标明了这个ID或Label是病症或症状。
slist = []
dlist = []
with open("Scraped-Data/nodetable.csv","w") as csvfile:
writer = csv.writer(csvfile)
for key,values in dict_.items():
for v in values:
if v not in slist:
writer.writerow([v,v,"symptom"])
slist.append(v)
if key not in dlist:
writer.writerow([key,key,"disease"])
dlist.append(key)
nt_columns = ['Id','Label','Attribute']
nt_data = pd.read_csv("Scraped-Data/nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",)
nt_data.head()
nt_data.to_csv("Scraped-Data/nodetable.csv",index=False)
四、分析清洗好的数据
data = pd.read_csv("Scraped-Data/dataset_clean.csv", encoding ="ISO-8859-1")
len(data['Source'].unique())
len(data['Target'].unique())
df = pd.DataFrame(data)
df_1 = pd.get_dummies(df.Target)
df_1
df
df_s = df['Source']
df_pivoted = pd.concat([df_s,df_1], axis=1)
df_pivoted.drop_duplicates(keep='first',inplace=True)
df_pivoted
len(df_pivoted)
cols = df_pivoted.columns
print(cols)
df_pivoted = df_pivoted.groupby('Source').sum()
df_pivoted = df_pivoted.reset_index()
df_pivoted
len(df_pivoted)
df_pivoted.to_csv("Scraped-Data/df_pivoted.csv")
这此代码主要是分析数据,比如疾病有多少种,症状有多少种。每种疾病对应的症状标记为1,没对应上的症状标记为0,将这些数据合并后存到df_pivoted.csv中。
五、用朴素贝叶斯来训练模型
x = df_pivoted[cols]
y = df_pivoted['Source']
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
mnb = MultinomialNB()
mnb = mnb.fit(x_train, y_train)
mnb.score(x_test, y_test)
得分为0,意味着没有预测能力。
这是因为,对于149条数据(对应着149种疾病),被预测的那1/3的疾病是没有见过的,所以算法没有办法对没见过的疾病进行预测。
改为用全部的数据进行训练,并用全部的数据进行预测
mnb_tot = MultinomialNB()
mnb_tot = mnb_tot.fit(x, y)
mnb_tot.score(x, y)
得分率为0.8993288590604027
打印出预测不准确的疾病
disease_pred = mnb_tot.predict(x)
disease_real = y.values
for i in range(0, len(disease_real)):
if disease_pred[i]!=disease_real[i]:
print ('Pred: {0} Actual:{1}'.format(disease_pred[i].ljust(30), disease_real[i]))
运行结果:
Pred: HIV Actual:acquired immuno-deficiency syndrome
Pred: biliary calculus Actual:cholelithiasis
Pred: coronary arteriosclerosis Actual:coronary heart disease
Pred: depression mental Actual:depressive disorder
Pred: HIV Actual:hiv infections
Pred: carcinoma breast Actual:malignant neoplasm of breast
Pred: carcinoma of lung Actual:malignant neoplasm of lung
Pred: carcinoma prostate Actual:malignant neoplasm of prostate
Pred: carcinoma colon Actual:malignant tumor of colon
Pred: candidiasis Actual:oralcandidiasis
Pred: effusion pericardial Actual:pericardial effusion body substance
Pred: malignant neoplasms Actual:primary malignant neoplasm
Pred: sepsis (invertebrate) Actual:septicemia
Pred: sepsis (invertebrate) Actual:systemic infection
Pred: tonic-clonic epilepsy Actual:tonic-clonic seizures
六、用决策树来训练模型
from sklearn.tree import DecisionTreeClassifier, export_graphviz
dt = DecisionTreeClassifier()
clf_dt=dt.fit(x,y)
print ("Acurracy: ", clf_dt.score(x,y))
得到的分数为0.8993288590604027,这与上面用朴素贝叶斯算法得到的结果一样。
下面要可视化决策树的节点分布
1 生成tree.dot
from sklearn import tree
from sklearn.tree import export_graphviz
export_graphviz(dt,
out_file='DOT-files/tree.dot',
feature_names=cols)
在工程目录下的DOT-files目录下,可以看到生成了tree.dot文件。
打开cmd终端,进入到tree.dot所在的目录,即DOT-files/中,执行
会得到tree.png
但是如果tree.dot太大的话,有可能报内存不够的错误:
dot: failure to create cairo surface: out of memory
2 在jupyter notebook中显示tree.png
from IPython.display import Image
Image(filename='tree.png')
七、版权声明
程序来源于https://github.com/Aniruddha-Tapas/Predicting-Diseases-From-Symptoms
笔者在这里只是学习、分析、记录,版权属于原作者。