python二分类绘制决策曲线 python决策树分类案例

转载

蓝色忧郁花 2024-04-10 17:05:47

文章标签 python二分类绘制决策曲线 python 机器学习决策树数据分析 文章分类 Python 后端开发

决策树分类

导入包
数据集
信息熵
计算信息熵
分类数据
找出使信息熵最少的分类方法
完全分类
所有代码

导入包

import pandas as pd
import numpy as np
# trees为自己编写的py文件，放在同一目录，之后有写
import trees
from math import log
import operator

数据集

file.txt

No.	no surfacing	flippers	fish
1	L1	R1	yes
2	L1	R1	yes
3	L1	R0	no
4	L0	R1	no
5	L0	R1	no
6	L0	R1	yes
7	L0	R1	why
8	L2	R2	yes
9	L0	R0	why

数据这个地方无所谓，建议新建一个txt文档，为了方便之后print观察，每列都有特殊的标识符，命名为file.txt，可以随时更改数据，以便观察代码每一步的含义。

信息熵

没办法，虽然只想实现功能，但这部分不把这个了解下基本没办法了解代码的含义，总的来说信息熵就是用来度量混乱程度的。一件必定发生的事情最存粹、干净，信息熵为0。相反，某件事的可能性越多，越不确定，信息熵越大（信息熵可以大于1）。信息熵在所有可能性均等时达到最大值。更多信息可以百度百科一下”香农熵“很有意思

计算信息熵

trees.py下的代码

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent

代码很短，输入一个列表，array，pd。返回数据的信息熵（多列数据只返回最后一列数据的信息熵）。

测试代码：

print(trees.calcShannonEnt([1,1,1,2,2,2]))
print(trees.calcShannonEnt(np.array([1,1,2,2,2,2])))
data_file = pd.read_csv('file.txt', sep='\t')
print(trees.calcShannonEnt(data_file))
print(trees.calcShannonEnt(data_file['flippers']))

分类数据

trees.py下的代码

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

参数1：数据集（<class ‘list’>）
参数2：分类条件的位置，第几列（0为第一列）
参数3：分类条件的值，以值为多少进行分类
返回值：分类后的结果（<class ‘list’>）

测试代码：

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
print(trees.splitDataSet(b, 0, 'L1'))
print(trees.splitDataSet(b, 1, 'R1'))
print(trees.splitDataSet(b, 2, "no"))

找出使信息熵最少的分类方法

补充一下：后面搜到资料是这么说的”ID3算法用的是信息增益，当我加了这个特征以后，我的信息熵减少了多少。不确定信息减少得越多，那得到的信息就越大，所以我们选择一个信息增益最大的，作为结点。“
trees.py下的代码

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            # 实际操作可以注释掉这里的print 这里的三个print仅供观察函数到底干了啥。
            print(subDataSet)
            print(calcShannonEnt(subDataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        print()
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

这个代码太多了，我直接把书上的复制过来了，代码的实现利用了上面的两个函数calcShannonEnt，splitDataSet。功能为找到一个分类标准，使得分类后的目标结果，信息熵最小。也就是说经过第一次分类后，要使得情况看起来最干净。上代码演示下。

测试代码：
上面的chooseBestFeatureToSplit已经加上了print，直接运行

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
print(data_file)
a = data_file.values
b = a.tolist()

best = trees.chooseBestFeatureToSplit(b)
print(best)

返回结果：

no surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

0

通过测试的代码说一说这个函数的作用，第一次以第一列的‘no surfacing’为分类标准，将数据分成了

[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

一共三类，信息熵为最后一列的信息熵，也就是‘fish’列，三类的信息熵分别是0.0，1.5219280948873621，0.9182958340544896
信息熵的和为2.440223928941852

然后以第二列‘flippers ’为分类标准也将数据分成了三类

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

三类的信息熵和为2.4591479170272448，比第种分类的信息熵大一点点，所以相比之下以第一列（no surfacing）为分类标准，分出来的数据更加有序。所以最后输出了一个0，也就是告诉我们第一列最优（0代表第一列）。
我们也可以随意更改file.txt的数据进行测试，看看这个函数到底做了什么。

完全分类

tree.py下的代码

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

时间精力有限（能力也有限），就不去掉头发了。直接上测试代码。

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)

最后的代码很简单输出的结果为

no_surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['L1', 'R1', 'yes'], ['L1', 'R1', 'yes'], ['L1', 'R0', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'yes'], ['L0', 'R1', 'why'], ['L2', 'R2', 'yes'], ['L0', 'R0', 'why']]
['no_surfacing', 'flippers']
{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

createTree()函数很简单，第二个参数是[‘no_surfacing’, ‘flippers’]，也就是分类的标签（list格式），也就是第一行的列名（不包括最后一列的结果列），第一个参数是整个分类数据（list）。
输出的结果不太明显，我们把他格式化一下。

{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

直接复制到网站SoJson点一下格式化。

{
	'no_surfacing': {
		'L1': {
			'flippers': {
				'R0': 'no',
				'R1': 'yes'
			}
		},
		'L0': {
			'flippers': {
				'R0': 'why',
				'R1': 'no'
			}
		},
		'L2': 'yes'
	}
}

这就很容易看懂了吧，第一次以”no_surfacing“分成三类，其中L2直接出了结果。L0、L1两类再分一次，也出了分类结果。

所有代码

trees.py

'''
Created on Oct 12, 2010
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington
'''
from math import log
import operator
import numpy as np
import pandas as pd

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent
    
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

test.py

import pandas as pd
import numpy as np
import trees
from math import log

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。