一、简介:

计算部分gradAscent()

数据与标签均转换为numpy矩阵

" * " : 矩阵相乘

维度:

  • 数据:100行3列(添加了常数项)
  • 标签:100行一列
  • 初始权重:3行一列

 

每轮循环步骤:

  1. 数据矩阵(100行3列) *  权重矩阵(3行一列),结果是100行一列
  2. 矩阵乘积(100行一列)代入 sigmoid()函数,结果是100行一列,即预测值
  3. 标签值(100行一列 )  减去   预测值(100行一列),结果是100行一列,即计算误差(100行一列)
  4. 权重矩阵(三行一列)  加上    步长  数据矩阵转置(三行100列) 误差(100行一列),结果是3行一列,即更新权重矩阵

 

参考:

逻辑回归:损失函数与梯度下降(公式推导):


 

逻辑回归原理(python代码实现)(似然理解,函数求导,代码实现(多个实现函数)都有):


 

 

二、梯度上升法:

函数: 

逻辑回归的协变量 逻辑回归deviance_数据

导函数:

逻辑回归的协变量 逻辑回归deviance_逻辑回归的协变量_02

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.linspace(0,4)
y = -x**2 + 4*x

plt.plot(x,y,'-k')

逻辑回归的协变量 逻辑回归deviance_权重_03

# 写一下就明白了

def grad_ascent():
    def f_prime(x_old):
        return -2 * x_old + 4
    x_old = 4
    x_new = 0
    alpha = 0.01
    presission = 0.00000001
    while(abs(x_new - x_old) > presission):
        x_old = x_new
        x_new = x_old + alpha *f_prime(x_old) 
    print(x_new)
    
grad_ascent()
1.999999515279857

 

数学表达式:

逻辑回归的协变量 逻辑回归deviance_代码实现_04

 

三、逻辑回归公式

 

似然函数:

逻辑回归的协变量 逻辑回归deviance_权重_05

梯度上升的梯度迭代公式:

逻辑回归的协变量 逻辑回归deviance_权重_06

 

梯度下降的迭代公式:

逻辑回归的协变量 逻辑回归deviance_逻辑回归的协变量_07

梯度上升与梯度下降其实就是同一个公式,只是梯度上升求导中前面没有取负号,

 

四、代码实现

注:

1、loadDataSet():

要添加常数项,

计算时,使用mat() 函数将数据转换为numpy矩阵,

 

2、计算部分gradAscent()

数据与标签均转换为numpy矩阵

" * " : 矩阵相乘

维度:

  • 数据:100行3列(添加了常数项)
  • 标签:100行一列
  • 初始权重:3行一列

 

每轮循环步骤:

  1. 数据矩阵(100行3列) *  权重矩阵(3行一列),结果是100行一列
  2. 矩阵乘积(100行一列)代入 sigmoid()函数,结果是100行一列,即预测值
  3. 标签值(100行一列 )  减去   预测值(100行一列),结果是100行一列,即计算误差(100行一列)
  4. 权重矩阵(三行一列)  加上    步长  数据矩阵转置(三行100列) 误差(100行一列),结果是3行一列,即更新权重矩阵

 

from numpy import *
filename = 'testSet.txt'
def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open(filename)
    for line in fr.readlines():
        # Python strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列。
        # Python split() 通过指定分隔符对字符串进行切片,如果参数 num 有指定值,则仅分隔 num 个子字符串
        lineArr = line.strip().split()
        # 前面的1,表示方程的常量。比如两个特征X1,X2,共需要三个参数,W1+W2*X1+W3*X2
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))

    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

def gradAscent(dataMat,labelMat):
    # 用mat函数转换为矩阵之后可以才进行一些线性代数的操作。
    # 列表转换为矩阵,默认转换为一个行矩阵,所以需要transpose()转换为列矩阵
    # transpose()作用是为转置
    # ones()全一矩阵,zeros()全零矩阵,eyes()单位阵
    # 矩阵,使用*是矩阵乘法,即行乘以列
    # print(labelMat)
    dataMatrix = mat(dataMat)
    # print(mat(labelMat))
    classLabels = mat(labelMat).transpose()
    # print(classLabels)
    m,n = shape(dataMatrix)
    alpha = 0.001
    maxCyle = 500
    weights = ones((n,1))
    for k in range(maxCyle):
        h = sigmoid(dataMatrix*weights)
        error = (classLabels-h)
        weights = weights + alpha*dataMatrix.transpose()*error
    # print(weights)
    return weights

def plotBestFit(weights):
    import matplotlib.pyplot as plt
    dataMat,labelMat = loadDataSet()
    dataArr = array(dataMat)
    # print(dataArr)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = arange(-3.0,3.0,0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x,y)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.show()
    

def main():
    dataMat,labelMat = loadDataSet()
    weights = gradAscent(dataMat,labelMat).getA()
    plotBestFit(weights)
  
if __name__ == "__main__":
    main()

逻辑回归的协变量 逻辑回归deviance_逻辑回归的协变量_08

数据:

-0.017612   14.053064   0
-1.395634   4.662541    1
-0.752157   6.538620    0
-1.322371   7.152853    0
0.423363    11.054677   0
0.406704    7.067335    1
0.667394    12.741452   0
-2.460150   6.866805    1
0.569411    9.548755    0
-0.026632   10.427743   0
0.850433    6.920334    1
1.347183    13.175500   0
1.176813    3.167020    1
-1.781871   9.097953    0
-0.566606   5.749003    1
0.931635    1.589505    1
-0.024205   6.151823    1
-0.036453   2.690988    1
-0.196949   0.444165    1
1.014459    5.754399    1
1.985298    3.230619    1
-1.693453   -0.557540   1
-0.576525   11.778922   0
-0.346811   -1.678730   1
-2.124484   2.672471    1
1.217916    9.597015    0
-0.733928   9.098687    0
-3.642001   -1.618087   1
0.315985    3.523953    1
1.416614    9.619232    0
-0.386323   3.989286    1
0.556921    8.294984    1
1.224863    11.587360   0
-1.347803   -2.406051   1
1.196604    4.951851    1
0.275221    9.543647    0
0.470575    9.332488    0
-1.889567   9.542662    0
-1.527893   12.150579   0
-1.185247   11.309318   0
-0.445678   3.297303    1
1.042222    6.105155    1
-0.618787   10.320986   0
1.152083    0.548467    1
0.828534    2.676045    1
-1.237728   10.549033   0
-0.683565   -2.166125   1
0.229456    5.921938    1
-0.959885   11.555336   0
0.492911    10.993324   0
0.184992    8.721488    0
-0.355715   10.325976   0
-0.397822   8.058397    0
0.824839    13.730343   0
1.507278    5.027866    1
0.099671    6.835839    1
-0.344008   10.717485   0
1.785928    7.718645    1
-0.918801   11.560217   0
-0.364009   4.747300    1
-0.841722   4.119083    1
0.490426    1.960539    1
-0.007194   9.075792    0
0.356107    12.447863   0
0.342578    12.281162   0
-0.810823   -1.466018   1
2.530777    6.476801    1
1.296683    11.607559   0
0.475487    12.040035   0
-0.783277   11.009725   0
0.074798    11.023650   0
-1.337472   0.468339    1
-0.102781   13.763651   0
-0.147324   2.874846    1
0.518389    9.887035    0
1.015399    7.571882    0
-1.658086   -0.027255   1
1.319944    2.171228    1
2.056216    5.019981    1
-0.851633   4.375691    1
-1.510047   6.061992    0
-1.076637   -3.181888   1
1.821096    10.283990   0
3.010150    8.401766    1
-1.099458   1.688274    1
-0.834872   -1.733869   1
-0.846637   3.849075    1
1.400102    12.628781   0
1.752842    5.468166    1
0.078557    0.059736    1
0.089392    -0.715300   1
1.825662    12.693808   0
0.197445    9.744638    0
0.126117    0.922311    1
-0.679797   1.220530    1
0.677983    2.556666    1
0.761349    10.693862   0
-2.168791   0.143632    1
1.388610    9.341997    0
0.317029    14.739025   0

 

 

五、使用SKLearn构建逻辑回归

 

疝气病症状预测病马的死亡率,

原始数据集下载地址:http://archive.ics.uci.edu/ml/datasets/Horse+Colic

这里的数据包含了368个样本和28个特征。

原始的数据集经过处理,保存为两个文件:horseColicTest.txt和horseColicTraining.txt。

局部数据:

逻辑回归的协变量 逻辑回归deviance_代码实现_09

逻辑回归的协变量 逻辑回归deviance_代码实现_10

 

可以发现,SKLearn主要代码也就两行:

classifier = LogisticRegression(solver='liblinear',max_iter=10).fit(trainingSet, trainingLabels)
test_accurcy = classifier.score(testSet, testLabels) * 100

 

from sklearn.linear_model import LogisticRegression

def colicSklearn():
    frTrain = open('horseColicTraining.txt')                                        #打开训练集
    frTest = open('horseColicTest.txt')                                                #打开测试集
    trainingSet = []; trainingLabels = []
    testSet = []; testLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[-1]))
    for line in frTest.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        testSet.append(lineArr)
        testLabels.append(float(currLine[-1]))
    classifier = LogisticRegression(solver='liblinear',max_iter=10).fit(trainingSet, trainingLabels)
    test_accurcy = classifier.score(testSet, testLabels) * 100
    print('正确率:%f%%' % test_accurcy)

if __name__ == '__main__':
    colicSklearn()

结果:

正确率:73.134328%