决策树分类

  • 导入包
  • 数据集
  • 信息熵
  • 计算信息熵
  • 分类数据
  • 找出使信息熵最少的分类方法
  • 完全分类
  • 所有代码

导入包

import pandas as pd
import numpy as np
# trees为自己编写的py文件,放在同一目录,之后有写
import trees
from math import log
import operator

数据集

file.txt

No.	no surfacing	flippers	fish
1	L1	R1	yes
2	L1	R1	yes
3	L1	R0	no
4	L0	R1	no
5	L0	R1	no
6	L0	R1	yes
7	L0	R1	why
8	L2	R2	yes
9	L0	R0	why

数据这个地方无所谓,建议新建一个txt文档,为了方便之后print观察,每列都有特殊的标识符,命名为file.txt,可以随时更改数据,以便观察代码每一步的含义。

信息熵

没办法,虽然只想实现功能,但这部分不把这个了解下基本没办法了解代码的含义,总的来说信息熵就是用来度量混乱程度的。一件必定发生的事情最存粹、干净,信息熵为0。相反,某件事的可能性越多,越不确定,信息熵越大(信息熵可以大于1)。信息熵在所有可能性均等时达到最大值。更多信息可以百度百科一下”香农熵“很有意思

计算信息熵

trees.py下的代码

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent

代码很短,输入一个列表,array,pd。返回数据的信息熵(多列数据只返回最后一列数据的信息熵)。

测试代码:

print(trees.calcShannonEnt([1,1,1,2,2,2]))
print(trees.calcShannonEnt(np.array([1,1,2,2,2,2])))
data_file = pd.read_csv('file.txt', sep='\t')
print(trees.calcShannonEnt(data_file))
print(trees.calcShannonEnt(data_file['flippers']))

分类数据

trees.py下的代码

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

参数1:数据集(<class ‘list’>)
参数2:分类条件的位置,第几列(0为第一列)
参数3:分类条件的值,以值为多少进行分类
返回值:分类后的结果(<class ‘list’>)

测试代码:

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
print(trees.splitDataSet(b, 0, 'L1'))
print(trees.splitDataSet(b, 1, 'R1'))
print(trees.splitDataSet(b, 2, "no"))

找出使信息熵最少的分类方法

补充一下:后面搜到资料是这么说的”ID3算法用的是信息增益,当我加了这个特征以后,我的信息熵减少了多少。不确定信息减少得越多,那得到的信息就越大,所以我们选择一个信息增益最大的,作为结点。“
trees.py下的代码

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            # 实际操作可以注释掉这里的print 这里的三个print仅供观察函数到底干了啥。
            print(subDataSet)
            print(calcShannonEnt(subDataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        print()
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

这个代码太多了,我直接把书上的复制过来了,代码的实现利用了上面的两个函数calcShannonEnt,splitDataSet。功能为找到一个分类标准,使得分类后的目标结果,信息熵最小。也就是说经过第一次分类后,要使得情况看起来最干净。上代码演示下。

测试代码:
上面的chooseBestFeatureToSplit已经加上了print,直接运行

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
print(data_file)
a = data_file.values
b = a.tolist()

best = trees.chooseBestFeatureToSplit(b)
print(best)

返回结果:

no surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

0

通过测试的代码说一说这个函数的作用,第一次以第一列的‘no surfacing’为分类标准,将数据分成了

[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

一共三类,信息熵为最后一列的信息熵,也就是‘fish’列,三类的信息熵分别是0.0,1.5219280948873621,0.9182958340544896
信息熵的和为2.440223928941852

然后以第二列‘flippers ’为分类标准也将数据分成了三类

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

三类的信息熵和为2.4591479170272448,比第种分类的信息熵大一点点,所以相比之下以第一列(no surfacing)为分类标准,分出来的数据更加有序。所以最后输出了一个0,也就是告诉我们第一列最优(0代表第一列)。
我们也可以随意更改file.txt的数据进行测试,看看这个函数到底做了什么。

完全分类

tree.py下的代码

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

时间精力有限(能力也有限),就不去掉头发了。直接上测试代码。

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)

最后的代码很简单输出的结果为

no_surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['L1', 'R1', 'yes'], ['L1', 'R1', 'yes'], ['L1', 'R0', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'yes'], ['L0', 'R1', 'why'], ['L2', 'R2', 'yes'], ['L0', 'R0', 'why']]
['no_surfacing', 'flippers']
{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

createTree()函数很简单,第二个参数是[‘no_surfacing’, ‘flippers’],也就是分类的标签(list格式),也就是第一行的列名(不包括最后一列的结果列),第一个参数是整个分类数据(list)。
输出的结果不太明显,我们把他格式化一下。

{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

直接复制到网站SoJson点一下格式化。

{
	'no_surfacing': {
		'L1': {
			'flippers': {
				'R0': 'no',
				'R1': 'yes'
			}
		},
		'L0': {
			'flippers': {
				'R0': 'why',
				'R1': 'no'
			}
		},
		'L2': 'yes'
	}
}

这就很容易看懂了吧,第一次以”no_surfacing“分成三类,其中L2直接出了结果。L0、L1两类再分一次,也出了分类结果。

所有代码

trees.py

'''
Created on Oct 12, 2010
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington
'''
from math import log
import operator
import numpy as np
import pandas as pd

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent
    
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

test.py

import pandas as pd
import numpy as np
import trees
from math import log

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)