模型的介绍
根据问题特点选择适当的估计器estimater模型:分类(SVC,KNN,LR,NaiveBayes,…) 回归 (Lasso,ElasticNet,SVR,…) 聚类(KMeans,…) 降维(PCA,…)
机器学习模型按照可使用的数据类型分为监督学习和无监督学习两大类。
- 监督学习主要包括用于分类和用于回归的模型:
- 分类:
- 线性分类器(如LR)
- 支持向量机(SVM)
- 朴素贝叶斯(NB)
- K近邻(KNN)
- 决策树(DT)
- 集成模型(RF/GDBT等)
- 随机森林算法
- 回归:
- 线性回归
- 支持向量机(SVM)
- K近邻(KNN)
- 回归树(DT)、
- 集成模型(ExtraTrees/RF/GDBT)
- 无监督学习主要包括:
- 数据聚类(K-means)
- 数据降维(PCA)
分类模型的训练
识别手写字体的介绍
把图片当成一枚枚像素来看,下图为手写体数字1的图片,它在计算机中的存储其实是一个二维矩阵,每个元素都是0~1之间的数字,0代表白色,1代表黑色,小数代表某种程度的灰色。
数据集的导入,并用plt
看下数字0
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
# 导入手写数据集
from sklearn import datasets
digits = datasets.load_digits()
plt.figure()
plt.axis('off')
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
分成训练集和测试集
#从样本数据中选出2/3作为训练集,1/3个作为测试集,并打乱数据集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(digits.data,digits.target, test_size = 1/3)
(len(X_train),len(X_test))
OUT:(1198, 599)
用决策树算法识别手写字体
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier()
treeclf.fit(X_train,y_train)
# print(treeclf.predict(X_test))
print(treeclf.predict(X_test)-y_test)
数据还是有很多不是0的
查看模型在测试集的评分
#score函数是模型内部定义的打分方法,对于分类、回归和聚类问题有不同的实现。
treeclf.score(X_test,y_test)
OUT:
0.8313856427378965
使用随机森林算法识别手写字体
#随机森林算法是一种集成算法,相当于许多棵决策树进行投票
#可以显著改善单棵决策树的过拟合性
from sklearn.ensemble import RandomForestClassifier
rfclf = RandomForestClassifier()
rfclf.fit(X_train,y_train)
# print(rfclf.predict(X_test))
print(rfclf.predict(X_test) - y_test)
OUT:
[ 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 -3 0 0 0 0 0 -2 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 -2 0 0 0 0 0 0
0 0 0 0 0 -8 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-6 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 -7 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 0
0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -7
0 0 0 0 0 0 0 0 -7 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5
0 -5 0 0 -7 0 0 2 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 -7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-2 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -2 0 0 0 0 -1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 -6 0 0 -1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 -5 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 -6
0 0 0 0 0 0 0 0 0 0 0 -4 0 -4 0 0 0 0 0 0 0 0 0 0 0
0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 0 0
0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
查看模型在测试集的评分
rfclf.score(X_test,y_test)
OUT:
0.92988313856427374
果然比决策树好
支持向量机(SVM) Liner SVC
from sklearn import svm
svmclf = svm.SVC(kernel='linear')
svmclf.fit(X_train,y_train)
print(svmclf.predict(X_test))
print(svmclf.predict(X_test) - y_test)
svmclf.score(X_test,y_test)
OUT:
0.97829716193656091
支持向量机(SVM)比随机森林还要好
回归模型的训练
采用波士顿房价来进行回归模型的训练
导入波士顿房价数据集
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
分类数据
#从样本数据中选出2/3作为训练集,1/3个作为测试集,并打乱数据集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(boston.data,boston.target, test_size = 1/3,random_state=0)
(len(X_train),len(X_test))
OUT:
(337, 169)
数据预处理
将特征进行MinMax标准化
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler(feature_range = (0,1))
scaler.fit(X_train)
X_train,X_test = scaler.transform(X_train),scaler.transform(X_test)
使用决策树算法拟合
from sklearn import tree
treereg = tree.DecisionTreeRegressor()
treereg.fit(X_train,y_train)
print(treereg.predict(X_test))
print(np.abs(treereg.predict(X_test)/y_test-1))
treereg.score(X_test,y_test)
OUT:
0.69370259503061049
决策树不怎么行
使用随机森林算法拟合
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg.fit(X_train,y_train)
print(rfreg.predict(X_test))
print(np.abs(rfreg.predict(X_test)/y_test-1))
rfreg.score(X_test,y_test)
OUT:
0.79064875519389821
随机森林比决策树好
多尝试几种
使用LassoCV回归
from sklearn.linear_model import LassoCV
lareg = LassoCV()
lareg.fit(X_train,y_train)
lareg.predict(X_test)
lareg.score(X_test,y_test)
OUT:
0.62673225346320816
使用ElasticNetCV回归
from sklearn.linear_model import ElasticNetCV
netreg = ElasticNetCV()
netreg.fit(X_train,y_train)
netreg.predict(X_test)
netreg.score(X_test,y_test)
OUT:
0.60253841643302164
尝试RidgeCV
from sklearn.linear_model import RidgeCV
rgreg = RidgeCV()
rgreg.fit(X_train,y_train)
rgreg.predict(X_test)
rgreg.score(X_test,y_test)
OUT:
0.66898023014959174
尝试SVR
from sklearn.svm import SVR
svr = SVR(kernel = 'linear',C = 1000)
svr.fit(X_train,y_train)
svr.predict(X_test)
svr.score(X_test,y_test)
OUT:
0.63410277753801492
好像都不是很行,还是随机森林算法最优