决策树分类
- 导入包
- 数据集
- 信息熵
- 计算信息熵
- 分类数据
- 找出使信息熵最少的分类方法
- 完全分类
- 所有代码
导入包
import pandas as pd
import numpy as np
# trees为自己编写的py文件,放在同一目录,之后有写
import trees
from math import log
import operator
数据集
file.txt
No. no surfacing flippers fish
1 L1 R1 yes
2 L1 R1 yes
3 L1 R0 no
4 L0 R1 no
5 L0 R1 no
6 L0 R1 yes
7 L0 R1 why
8 L2 R2 yes
9 L0 R0 why
数据这个地方无所谓,建议新建一个txt文档,为了方便之后print观察,每列都有特殊的标识符,命名为file.txt,可以随时更改数据,以便观察代码每一步的含义。
信息熵
没办法,虽然只想实现功能,但这部分不把这个了解下基本没办法了解代码的含义,总的来说信息熵就是用来度量混乱程度的。一件必定发生的事情最存粹、干净,信息熵为0。相反,某件事的可能性越多,越不确定,信息熵越大(信息熵可以大于1)。信息熵在所有可能性均等时达到最大值。更多信息可以百度百科一下”香农熵“很有意思
计算信息熵
trees.py下的代码
def calcShannonEnt(dataSet):
dataSet = pd.DataFrame(dataSet)
n = dataSet.shape[0]
iset = dataSet.iloc[:, -1].value_counts()
p = iset / n
ent = (-p * np.log2(p)).sum()
return ent
代码很短,输入一个列表,array,pd。返回数据的信息熵(多列数据只返回最后一列数据的信息熵)。
测试代码:
print(trees.calcShannonEnt([1,1,1,2,2,2]))
print(trees.calcShannonEnt(np.array([1,1,2,2,2,2])))
data_file = pd.read_csv('file.txt', sep='\t')
print(trees.calcShannonEnt(data_file))
print(trees.calcShannonEnt(data_file['flippers']))
分类数据
trees.py下的代码
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
参数1:数据集(<class ‘list’>)
参数2:分类条件的位置,第几列(0为第一列)
参数3:分类条件的值,以值为多少进行分类
返回值:分类后的结果(<class ‘list’>)
测试代码:
data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
print(trees.splitDataSet(b, 0, 'L1'))
print(trees.splitDataSet(b, 1, 'R1'))
print(trees.splitDataSet(b, 2, "no"))
找出使信息熵最少的分类方法
补充一下:后面搜到资料是这么说的”ID3算法用的是信息增益,当我加了这个特征以后,我的信息熵减少了多少。不确定信息减少得越多,那得到的信息就越大,所以我们选择一个信息增益最大的,作为结点。“
trees.py下的代码
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
# 实际操作可以注释掉这里的print 这里的三个print仅供观察函数到底干了啥。
print(subDataSet)
print(calcShannonEnt(subDataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
print()
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
这个代码太多了,我直接把书上的复制过来了,代码的实现利用了上面的两个函数calcShannonEnt,splitDataSet。功能为找到一个分类标准,使得分类后的目标结果,信息熵最小。也就是说经过第一次分类后,要使得情况看起来最干净。上代码演示下。
测试代码:
上面的chooseBestFeatureToSplit已经加上了print,直接运行
data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
print(data_file)
a = data_file.values
b = a.tolist()
best = trees.chooseBestFeatureToSplit(b)
print(best)
返回结果:
no surfacing flippers fish
0 L1 R1 yes
1 L1 R1 yes
2 L1 R0 no
3 L0 R1 no
4 L0 R1 no
5 L0 R1 yes
6 L0 R1 why
7 L2 R2 yes
8 L0 R0 why
[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896
[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0
0
通过测试的代码说一说这个函数的作用,第一次以第一列的‘no surfacing’为分类标准,将数据分成了
[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896
一共三类,信息熵为最后一列的信息熵,也就是‘fish’列,三类的信息熵分别是0.0,1.5219280948873621,0.9182958340544896
信息熵的和为2.440223928941852
然后以第二列‘flippers ’为分类标准也将数据分成了三类
[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0
三类的信息熵和为2.4591479170272448,比第种分类的信息熵大一点点,所以相比之下以第一列(no surfacing)为分类标准,分出来的数据更加有序。所以最后输出了一个0,也就是告诉我们第一列最优(0代表第一列)。
我们也可以随意更改file.txt的数据进行测试,看看这个函数到底做了什么。
完全分类
tree.py下的代码
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
时间精力有限(能力也有限),就不去掉头发了。直接上测试代码。
data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)
最后的代码很简单输出的结果为
no_surfacing flippers fish
0 L1 R1 yes
1 L1 R1 yes
2 L1 R0 no
3 L0 R1 no
4 L0 R1 no
5 L0 R1 yes
6 L0 R1 why
7 L2 R2 yes
8 L0 R0 why
[['L1', 'R1', 'yes'], ['L1', 'R1', 'yes'], ['L1', 'R0', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'yes'], ['L0', 'R1', 'why'], ['L2', 'R2', 'yes'], ['L0', 'R0', 'why']]
['no_surfacing', 'flippers']
{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}
createTree()函数很简单,第二个参数是[‘no_surfacing’, ‘flippers’],也就是分类的标签(list格式),也就是第一行的列名(不包括最后一列的结果列),第一个参数是整个分类数据(list)。
输出的结果不太明显,我们把他格式化一下。
{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}
直接复制到网站SoJson点一下格式化。
{
'no_surfacing': {
'L1': {
'flippers': {
'R0': 'no',
'R1': 'yes'
}
},
'L0': {
'flippers': {
'R0': 'why',
'R1': 'no'
}
},
'L2': 'yes'
}
}
这就很容易看懂了吧,第一次以”no_surfacing“分成三类,其中L2直接出了结果。L0、L1两类再分一次,也出了分类结果。
所有代码
trees.py
'''
Created on Oct 12, 2010
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington
'''
from math import log
import operator
import numpy as np
import pandas as pd
def calcShannonEnt(dataSet):
dataSet = pd.DataFrame(dataSet)
n = dataSet.shape[0]
iset = dataSet.iloc[:, -1].value_counts()
p = iset / n
ent = (-p * np.log2(p)).sum()
return ent
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
test.py
import pandas as pd
import numpy as np
import trees
from math import log
data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)