逻辑回归是个二分类问题,而不是回归问题。
一般对数据集先用逻辑回归(最简单的分类)然后再用普通分类算法。
基础公式:y(i)=θTx(i)+Ei
y
(
i
)
=
θ
T
x
(
i
)
+
E
i
(1)
E
E
是每个样本的误差值,并且服从独立同分布。
均值为0方差为θ2
θ
2
的高斯分布。也就是说满足如下:
p(E(i))=12π√σexp(−(ϵ(i))22σ2)
p
(
E
(
i
)
)
=
1
2
π
σ
e
x
p
(
−
(
ϵ
(
i
)
)
2
2
σ
2
)
(2)
将(1)代入(2)式:
p(y(i)|x(i);θ)=12π√σexp(−(y(i)−θTx(i))22σ2)
p
(
y
(
i
)
|
x
(
i
)
;
θ
)
=
1
2
π
σ
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
似然函数:
L(θ)=∏mi=1p(yi|x(i);θ)=∏mi=112π√σexp(−(y(i)−θTx(i))22σ2)
L
(
θ
)
=
∏
i
=
1
m
p
(
y
i
|
x
(
i
)
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
在似然函数前取对数,累乘变累加:
logL(θ)=∏mi=1p(yi|x(i);θ)=log∏mi=112π√σexp(−(y(i)−θTx(i))22σ2)
l
o
g
L
(
θ
)
=
∏
i
=
1
m
p
(
y
i
|
x
(
i
)
;
θ
)
=
l
o
g
∏
i
=
1
m
1
2
π
σ
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
上述原式=∑mi=1log12π√σexp(−(y(i)−θTx(i))22σ2)
∑
i
=
1
m
l
o
g
1
2
π
σ
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
=mlog12π√σ−1σ2⋅12∑mi=1(yi−θTx(i))2
m
l
o
g
1
2
π
σ
−
1
σ
2
·
1
2
∑
i
=
1
m
(
y
i
−
θ
T
x
(
i
)
)
2
对上述简化式常数去掉取局部计算极值:
J(θ)=12∑mi=1(y(i)−θTx(i))2
J
(
θ
)
=
1
2
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
(最小二乘法)
按照矩阵求导法则计算该式偏导最后得到:
J(θ)=XTXθ−XTy
J
(
θ
)
=
X
T
X
θ
−
X
T
y
令J(θ)=0→
J
(
θ
)
=
0
→
θ=(XTX)−1XTy
θ
=
(
X
T
X
)
−
1
X
T
y
在此处有一个问题:我们不知道X的逆是否存在
故而换方法:
梯度下降
目标函数:J(θ)=12m∑mi=1(y(i)−hθ(x(i)))2
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
2
有三种策略来进行梯度下降算法:
1、批量梯度下降——精度很高,时间太慢
2、随机梯度下降:速度非常快,但是不一定往最优方向迭代
3、小批量梯度下降算法:折中,就选你了! 公式:θ′j=θj−α110∑k=ii+9(hθ(xk)−y(k))xkj
θ
j
′
=
θ
j
−
α
1
10
∑
i
+
9
k
=
i
(
h
θ
(
x
k
)
−
y
(
k
)
)
x
j
k
学习率(步长):一般选小一点,步子买大了容易造成局部最优而不是全局最优,步子迈太小了容易造成效率太低速度慢(通常选16,32,64,128【程序员都喜欢这些数字,别问为什么】)。
sigmoid函数
g(z)=11+e−1
g
(
z
)
=
1
1
+
e
−
1
预测函数:hθ(x)=g(θTx)=11+e−θTx
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
其中:θTx=∑ni=1θixi=θ0+θ1x1=,......,θnxn
θ
T
x
=
∑
i
=
1
n
θ
i
x
i
=
θ
0
+
θ
1
x
1
=
,
.
.
.
.
.
.
,
θ
n
x
n
分类任务(整合概率分布)=p(y|x;θ)=hθ(x)y(1−hθ(x)1−y
p
(
y
|
x
;
θ
)
=
h
θ
(
x
)
y
(
1
−
h
θ
(
x
)
1
−
y
当y=0时保留(1−hθ(x)1−y
(
1
−
h
θ
(
x
)
1
−
y
当y=1时保留p(y|x;θ)=hθ(x))y
p
(
y
|
x
;
θ
)
=
h
θ
(
x
)
)
y
似然函数:L(θ)=∏mi=1p(yi|x(i);θ)=∏mi=1hθ(x(i))yi(1−hθ(xi))1−yi
L
(
θ
)
=
∏
i
=
1
m
p
(
y
i
|
x
(
i
)
;
θ
)
=
∏
i
=
1
m
h
θ
(
x
(
i
)
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
同上个方法一样取对数化简:
l(θ)=logL(θ)=∑mi=1(yiloghθ(xθ)+(1−yi)log(1−hθ(xi)))
l
(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
(
y
i
l
o
g
h
θ
(
x
θ
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
)
上式经过一系列推导后可得:−1m∑mi=1(hθ(xi)−yi)xji
−
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
(此处的i,j为第j个特征的0-i个样本值)
参数θ
θ
的更新:θ′j=θj−α1m∑mi=1(hθ(xi)−yi)xji
θ
j
′
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
代码实现:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
path = 'E:/test/tidu/data/LogiReg_data.txt'
pddata=pd.read_csv(path,header=None,names=['Exam 1','Exam 2','Admitted'])
positive=pddata[pddata['Admitted']==1]
negative=pddata[pddata['Admitted']==0]
fig,ax = plt.subplots(figsize=(10,5))
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=30, c='b', marker='o', label='Admitted')
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=30, c='r', marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
pddata.insert(0,'ones',1)
orig_data=pddata.as_matrix()
cols = orig_data.shape[1]
X = orig_data[:,0:cols-1]
y = orig_data[:,cols-1:cols]
theta=np.zeros([1,3])
STOP_ITER=0
STOP_COST=1
STOP_GRAD=2
def sigmoid(z): #构建sigmoid函数
return 1/(1+np.exp(-z))
def model(X, theta):
return sigmoid(np.dot(X, theta.T))
def gradient(X, y, theta):
grad = np.zeros(theta.shape)
error = (model(X, theta)- y).ravel()
for j in range(len(theta.ravel())): #for each parmeter
term = np.multiply(error, X[:,j])
grad[0, j] = np.sum(term) / len(X)
return grad
def cost(X, y, theta):
left = np.multiply(-y, np.log(model(X, theta)))
right = np.multiply(1 - y, np.log(1 - model(X, theta)))
return np.sum(left - right) / (len(X))
def stopcriterion(type, value, threshold):
#设定三种不同的停止策略
if type == STOP_ITER: return value > threshold
elif type == STOP_COST: return abs(value[-1]-value[-2]) < threshold
elif type == STOP_GRAD: return np.linalg.norm(value) < threshold
def shuffleData(data):
np.random.shuffle(data)
cols=data.shape[1]
X=data[:,0:cols-1]
y=data[:,cols-1:]
return X,y
def descent(data, theta, batchSize, stopType, thresh, alpha):
#梯度下降求解
init_time = time.time()
i = 0 # 迭代次数
k = 0 # batch
X, y = shuffleData(data)
grad = np.zeros(theta.shape) # 计算的梯度
costs = [cost(X, y, theta)] # 损失值
while True:
grad = gradient(X[k:k+batchSize], y[k:k+batchSize], theta)
k += batchSize #取batch数量个数据
if k >= n:
k = 0
X, y = shuffleData(data) #重新洗牌
theta = theta - alpha*grad # 参数更新
costs.append(cost(X, y, theta)) # 计算新的损失
i += 1
if stopType == STOP_ITER: value = i
elif stopType == STOP_COST: value = costs
elif stopType == STOP_GRAD: value = grad
if stopcriterion(stopType, value, thresh): break
return theta, i-1, costs, grad, time.time() - init_time
def runExpe(data, theta, batchSize, stopType, thresh, alpha):
#import pdb; pdb.set_trace();
theta, iter, costs, grad, dur = descent(data, theta, batchSize, stopType, thresh, alpha)
name = "Original" if (data[:,1]>2).sum() > 1 else "Scaled"
name += " data - learning rate: {} - ".format(alpha)
if batchSize==n: strDescType = "Gradient"
elif batchSize==1: strDescType = "Stochastic"
else: strDescType = "Mini-batch ({})".format(batchSize)
name += strDescType + " descent - Stop: "
if stopType == STOP_ITER: strStop = "{} iterations".format(thresh)
elif stopType == STOP_COST: strStop = "costs change < {}".format(thresh)
else: strStop = "gradient norm < {}".format(thresh)
name += strStop
print ("***{}\nTheta: {} - Iter: {} - Last cost: {:03.2f} - Duration: {:03.2f}s".format(
name, theta, iter, costs[-1], dur))
fig, ax = plt.subplots(figsize=(12,4))
ax.plot(np.arange(len(costs)), costs, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title(name.upper() + ' - Error vs. Iteration')
return theta
n=100
runExpe(orig_data, theta, n, STOP_COST, thresh=0.000001, alpha=0.001)