一、其他距离公式
1.标准化欧氏距离 (Standardized EuclideanDistance):
2.余弦距离(Cosine Distance)
3.汉明距离(Hamming Distance)【了解】:
4.杰卡德距离(Jaccard Distance)【了解】:
5.马氏距离(Mahalanobis Distance)【了解】
二、再议数据分割
1.留出法
2.交叉验证法
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold
X = np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y = np.array([1,1,0,0,1,1,0,0])
folder = KFold(n_splits = 4, random_state=0, shuffle = False)
sfolder = StratifiedKFold(n_splits = 4, random_state = 0, shuffle = False)
for train, test in folder.split(X, y):
print('train:%s | test:%s' %(train, test))
print("")
for train, test in sfolder.split(X, y):
print('train:%s | test:%s'%(train, test))
print("")
3.自助法
4.总结
三、正规方程的另一种推导方式
1.损失表示方式
2.另一种推导方式
四、梯度下降法算法比较和进一步优化
1.算法比较
2.梯度下降优化算法
五、多项式回归
1.多项式回归的一般形式
2.多项式回归的实现
直线方程的拟合
使用多项式方程
3.持续降低训练误差与过拟合
# 定义模型训练函数
def try_degree(degree, X, y):
poly_features_d = PolynomialFeatures(degree=degree, include_bias=False)
X_poly_d = poly_features_d.fit_transform(X)
lin_reg_d = LinearRegression()
lin_reg_d.fit(X_poly_d, y)
return {'X_poly': X_poly_d, 'intercept': lin_reg_d.intercept_, 'coef': lin_reg_d.coef_}
degree2loss_paras = []
for i in range(2, 20):
paras = try_degree(i, X, y)
# 自己实现预测值的求解
h = np.dot(paras['X_poly'], paras['coef'].T) + paras['intercept']
_loss = mean_squared_error(h, y)
degree2loss_paras.append({'d': i, 'loss': _loss, 'coef': paras['coef'], 'intercept': paras['intercept']})
查看最小模型参数:
min_index = np.argmin(np.array([i['loss'] for i in degree2loss_paras]))
min_loss_para = degree2loss_paras[min_index]
print(min_loss_para)
# 输出结果
{'d': 12,
'loss': 3.8764202841976227e-23,
'coef': array([[ 1.17159189, 8.60674192, -4.91798703, -4.18378115, 3.79426131, -8.56026107, -6.94465715, 5.03891035, 4.08870088, -0.30369348, -0.6635493 , -0.11314395]]),
'intercept': array([1.63695924])}
六、维灾难
1.什么是维灾难
2.维数灾难与过拟合
七、分类中解决类别不平衡问题
1.类别不平衡数据集基本介绍
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
#使用make_classification生成样本数据
X, y = make_classification(n_samples=5000,
n_features=2, # 特征个数= n_informative() + n_redundant + n_repeated
n_informative=2, # 多信息特征的个数
n_redundant=0, # 冗余信息,informative特征的随机线性组合
n_repeated=0, # 重复信息,随机提取n_informative和n_redundant 特征
n_classes=3, # 分类类别
n_clusters_per_class=1, # 某一个类别是由几个cluster构成的
weights=[0.01, 0.05, 0.94], # 列表类型,权重比
random_state=0)