数据挖掘线性回归预测数据挖掘回归分析算法

转载

jowvid 2023-08-07 01:11:55

文章标签 数据挖掘线性回归预测数据缺失值线性回归 文章分类 数据挖掘人工智能

1、线性回归

一元线性回归分析
多元线性回归分析

线性回归模型

数据挖掘线性回归预测数据挖掘回归分析算法_数据

数据挖掘线性回归预测数据挖掘回归分析算法_线性回归_02

这里的目标函数（损失函数）的推导实际运用了极大似然的思想，假设误差服从高斯分布，使误差最小。

2、岭回归

岭回归是对线性回归的变体

数据挖掘线性回归预测数据挖掘回归分析算法_缺失值_03

3、losso回归

losso回归模型是对线性回归的另一种改进，可以防止出现过拟合

数据挖掘线性回归预测数据挖掘回归分析算法_线性回归_04

4、多项式回归

数据挖掘线性回归预测数据挖掘回归分析算法_数据挖掘线性回归预测_05

多项式模型的损失函数与多元线性回归的损失函数相同，都是最小二乘误差。求解最优模型也是求解使得损失函数最小的参数，还是用梯度下降法。

5、梯度下降法

批量梯度下降

数据挖掘线性回归预测数据挖掘回归分析算法_数据_06

随机梯度下降

数据挖掘线性回归预测数据挖掘回归分析算法_数据_07

小批量梯度下降

数据挖掘线性回归预测数据挖掘回归分析算法_数据挖掘线性回归预测_08

6、正则化

其中，L1范数容易得到稀疏解。

6、评估指标

数据挖掘线性回归预测数据挖掘回归分析算法_数据挖掘线性回归预测_09

7、回归算法实操

实验介绍

本实验使用Lasso回归模型作为汽车价格预测的模型，该模型相对于岭回归模型来说，更容易产生权重为0的特征项，这个特点符合汽车价格预测的任务。因为影响汽车的价格的关键因素不多，数据集中的很多特征项可以不考虑在内。

数据集

汽车价格预测，根据汽车的各种特征属性，对汽车的价格进行预测。汽车价格预测数据集主要包含以下，主要包括3类指标:

汽车的各种特性.

symboling保险风险评级：(-3, -2, -1, 0, 1, 2, 3).

normalized-losses 每辆保险车辆年平均相对损失支付.

类别属性

make: 汽车的商标（奥迪，宝马。。。）

fuel-type: 汽油还是天然气

aspiration: 涡轮

num-of-doors: 两门还是四门

body-style: 硬顶车、轿车、掀背车、敞篷车

drive-wheels: 驱动轮

engine-location: 发动机位置

engine-type: 发动机类型

num-of-cylinders: 几个气缸

fuel-system: 燃油系统

连续指标

bore: continuous from 2.54 to 3.94.

stroke: continuous from 2.07 to 4.17.

compression-ratio: continuous from 7 to 23.

horsepower: continuous from 48 to 288.

peak-rpm: continuous from 4150 to 6600.

city-mpg: continuous from 13 to 49.

highway-mpg: continuous from 16 to 54.

price: 价格，5118 ~45400.

任务一导入包

导入相关的工具包，便于后续的开发使用。

输入：

# 导入相关包
import numpy as np
import pandas as pd

# 导入可视化包
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno #缺失数据可视化工具包


# 统计函数工具包
from statsmodels.distributions.empirical_distribution  import ECDF
from sklearn.metrics import mean_squared_error, r2_score

# 机器学习模型工具包
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor

# 设置固定的随机数种子，保证每次随机产生的数字的一致 
seed = 100

任务二获取数据

使用pandas，从本地获取数据集，数据集的地址需要根据实际的路径替换。

输入：

csv_dir = '/data/dm/Auto-Data.csv' # 根据实际路径进行替换
## 通过查看CSV中的数据看到，缺失数据是用 ‘？’表示的 
## 因此注意，使用pandas读入数据时需要指定na_values，否则在缺失值可视化时不能正常显示
data = pd.read_csv(csv_dir, na_values='?', engine='python')

任务三探索数据

了解数据类型及基本情况
数据质量检查：主要包括检查数据中是否有错误，如性别类型，是否会有拼写错误的，把female 拼写为fmale等等，诸如此类

步骤1 数据概览

# 分析数据类型，看哪些是分类数据，哪些是数值型数据，
# 用来进行数据类型转换的依据
data.dtypes

输出
symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

# 查看数据的基本信息
## 返回数据总量，特征列数量，所有特征列的数据类型、空值数量等简要信息
data.info()

# 查看数据量的大小，并预览数据的前5条数据
print(data.shape)   # 205,26
data.head(5)

# 查看数据有哪些特征列
print(data.columns)

输出：

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

# 对数值型数据进行描述统计,会返回一个DataFrame结构的数据
## DataFrame.describe(percentiles=None, include=None, exclude=None)
## 参数解释：
##percentiles： 1、百分位数：数字列表，可选：输出中包含的百分位数。 全部应该在0和1之间。默认值为[.25，.5，.75]，返回第25，第50和第75百分位数
##include：要包括在结果中的白名单数据类型。
###        all：输入的所有列都将包含在输出中；类似dtypes的列表：
###        将结果限制为提供的数据类型。 将结果限制为数字类型，提交numpy.number。要将其限制为分类对象，请提交numpy.object数据类型。 字符串也可以以select_dtypes的样式使用（例如，df.describe（include = ['O']））
###        默认：结果将包括所有数字列
data_desc = data.describe()
print(data_desc)

步骤2 检查数据

# 所有分类型的特征
classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors', 
           'body-style', 'drive-wheels', 'engine-location',
           'engine-type', 'num-of-cylinders', 'fuel-system']

# 对于每一个分类型的特征，使用.unique()查看有多少取值
for each in classes:
    print(each + ':
')
    print(data[each].unique())  
    print('
')

任务四数据预处理

数据预处理是非常重要的环节，干净合理的数据是模型成功的关键因素，。数据预处理主要包括以下几个环节：

缺失值处理
异常值处理：对数值型、类别性特征进行缺失值处理。
特征重加工：对数值型的特征进行特征重加工，例如去除相关性较高的特征。
特征编码：对类别型特征进行编码，便于回归模型的处理。

步骤1 缺失值分析&处理

缺失值查看：观测异常值的缺失情况，可通过missingno提供的可视化工具，也可以以计数的形式，查看缺失值及所占比例。

缺失值处理方法： 1、缺失值较少时可以直接去掉； 2、缺失值较多时可用已有的值取平均值或众数； 3、用已知的数做回归模型，进行预测。

缺失值查看

# 通过图示查看缺失值 # seaborn预先定义了5中主题样式，以适合不同场景需要,sns.set style参数： ## darkgrid 黑色网格（默认） ## whitegrid 白色网格 ## dark 黑色背景 ## white 白色背景 ## ticks 刻度值 sns.set(style='ticks') #设置sns的样式背景 msno.matrix(data)

输出：

数据挖掘线性回归预测数据挖掘回归分析算法_线性回归_10

# 缺失值统计

# 根据以上数据可以看出，只有nrmaized-losses列缺失值比较多，其余的缺失值很少
# 看一下具体缺失多少
null_cols = ['normalized-losses', 'num-of-doors', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price']
total_rows = data.shape[0]
for each_col in null_cols:
    # 使用.isnull().sum() 统计空值数量
    # print('{}:{}'.format(each_col,data[each_col].isnull().sum() / total_rows))
    print('{}:{}'.format(each_col, pd.isnull(data[each_col]).sum() / total_rows))

输出：

normalized-losses:0.2 num-of-doors:0.00975609756097561 bore:0.01951219512195122 stroke:0.01951219512195122 horsepower:0.00975609756097561 peak-rpm:0.00975609756097561 price:0.01951219512195122

#normalized-losses缺失值处理

# 查看nrmaized-losses的分布情况
sns.set(style='darkgrid')
plt.figure(figsize=(12,5))
plt.subplot(121)

# 累计分布曲线
cdf = ECDF(data['normalized-losses'])
cdf = [[each_x, each_y] for each_x, each_y in zip(cdf.x, cdf.y)]
cdf = pd.DataFrame(cdf, columns=['x','y'])
sns.lineplot(x="x", y="y",data=cdf)

输出：

数据挖掘线性回归预测数据挖掘回归分析算法_线性回归_11

plt.subplot(122)# 直方图
x = data['normalized-losses'].dropna()
sns.distplot(x, hist=True, kde=True, kde_kws={"color": "k", "lw": 3, "label": "KDE"},
                   hist_kws={"histtype": "step", "linewidth": 3,
                             "alpha": 1, "color": "g"})

输出：

数据挖掘线性回归预测数据挖掘回归分析算法_线性回归_12

# 查看不同symboling下normalized-losses分布,symboling保险风险评级：(-3, -2, -1, 0, 1, 2, 3).
data.groupby('symboling')['normalized-losses'].describe()

out：

数据挖掘线性回归预测数据挖掘回归分析算法_数据挖掘线性回归预测_13

# 其他维度的缺失值较小，直接删除
sub_set = ['num-of-doors', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price']
## 使用dropna方法删除缺失值
## 使用reset_index重置索引值，drop=True表示丢弃原索引
data = data.dropna(subset=sub_set).reset_index(drop=True) 

# 用分组的平均值进行填充
## groupby：分组处理
### 一般情况下，我们在groupby之后使用aggregate , filter 或 apply来汇总数据
### aggregation会返回数据的缩减版本，而transformation能返回完整数据的某一变换版本供我们重组。
### 这样的transformation，输出的形状和输入一致。一个常见的例子是通过减去分组平均值来居中数据。
## fillna：空值填充方法 
data['normalized-losses'] = data.groupby('symboling')['normalized-losses'].transform(lambda x: x.fillna(x.mean()))
print(data.shape) #(193, 26)
data.head()

out：

数据挖掘线性回归预测数据挖掘回归分析算法_数据_14

步骤2 异常值分析&处理

异常值检测方法：一般异常值的检测方法有基于统计的方法，基于聚类的方法，以及一些专门检测异常值的方法等。常用的是基于统计的方法：

基于正态分布的方法：数据需要服从正态分布。在3∂原则下，异常值如超过3倍标准差，则认为是异常值。
基于四分位矩的方法：利用箱型图的四分位距（QR）对异常值进行检测。四分位距(QR)就是上四分位与下四分位的差值。而我们通过QR的1.5倍为标准，规定：超过上四分位+1.5倍QR距离，或者下四分位-1.5倍QR距离的点为异常值（使用‘*’表示），规定：超过上四分位+3倍QR距离，或者下四分位-3倍QR距离的点为极端异常值（使用‘O’表示）。

异常值处理方法：对检测到的异常值一般会进行删除操作。

# 异常值查看

# 所有数值型特征列
num = ['symboling', 'normalized-losses', 'length', 'width', 'height', 'horsepower', 'wheel-base',
       'bore', 'stroke','compression-ratio', 'peak-rpm','engine-size','highway-mpg']

# 可以一次性绘制出所有的箱线图，但由于其度量并不一致，可以分别绘制.
# 用sns绘制时，需要考虑到缺失值的情况，这里直接用dataframe的功能绘制
# 箱线图的理解：
for each in num:
    plt.figure()
    x = data[each]
    x.plot.box()
# 在箱线图中可以直接观测到离群点，一般应将其删除

# 异常值的处理
data_outliers=data.copy()
for each in num:
    #定义一个下限
    lower = data_outliers[each].quantile(0.25)-1.5*(data_outliers[each].quantile(0.75)-data_outliers[each].quantile(0.25))
    #定义一个上限
    upper = data_outliers[each].quantile(0.25)+1.5*(data_outliers[each].quantile(0.75)-data_outliers[each].quantile(0.25))

    #重新加入一列，用于判断
    data_outliers['qutlier'] = (data_outliers[each] < lower) | (data_outliers[each] > upper) 

    #过滤掉异常数据
    data_outliers = data_outliers[data_outliers['qutlier'] ==False]
    plt.figure()
    data_outliers[each].plot.box()
    data_outliers = data_outliers.drop('qutlier',axis=1)

步骤3 数据相关性分析&处理

对于一个模型来说，特征并不是越多越好，而是越简洁包含的信息越多越好。对于有些特征之间，线性关联性非常强，这样的特征可以只保留一个的，减少特征的冗余。

# 相关性计算

# 使用corr()计算数据的相关性，返回的仍是dataframe类型数据，可以直接引用
### 相关系数的取值范围为[-1, 1],当接近1时，表示两者具有强烈的正相关性，
### 比如‘s’和‘x’；当接近-1时，表示有强烈的的负相关性，比如‘s’和‘c’，
### 而若值接近0，则表示相关性很低.
cor_matrix = data_outliers.corr()
cor_matrix

# 相关性可视化展示# 布尔型的mask，然后从中取上三角矩阵。去下三角矩阵是np.tril_indices_from(mask)# 其目的是剔除冗余映射，只取一半就好
mask = np.zeros_like(cor_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(cor_matrix, 
            vmin=-1, vmax=1, 
            square=True, 
            cmap=sns.color_palette("RdBu_r", 100), 
            mask=mask, 
            linewidths=.5);

输出热力图：

# 强相关特征查看# 查看相关性较高的元素，分析关系，对特征进行处理。## 比如：去除相关性较高的特征
## 或者：对有逻辑相关性的特征进行融合加工
cor_matrix *= np.tri(*cor_matrix.values.shape, k=-1).T  
cor_matrix = cor_matrix.stack()#在用pandas进行数据重排，stack:以列为索引进行堆积，unstack:以行为索引展开。
cor_matrix = cor_matrix.reindex(cor_matrix.abs().sort_values(ascending=False).index).reset_index()
cor_matrix.columns = ["FirstVariable", "SecondVariable", "Correlation"]
cor_matrix.head(10)

输出：

# 根据结果 ## 1.city-mpg highway-mpg之间相似度过高，只保留一个即可## 2.city-mpg 和 curb-weight之间相关性也过高，只保留一个即可## 3.data2.length * data2.width * data2.height三者之间和
# 数据预处理
data2 = data_outliers.copy()
data2['volume'] = data2.length * data2.width * data2.height
#drop默认删除行元素，删除列需加 axis = 1
data2.drop(['width', 'length', 'height', 
           'curb-weight', 'city-mpg'], 
          axis = 1, # 1 for columns
          inplace = True) 
data2.info()

步骤4 数值特征的标准化
对于数值型的特征，需要进行标准化处理，减少由于不同数量级的度量范围对模型带来的影响。

# 目标预测数据target = data2['price']target = data2.price# 特征数据
features = data2.drop(columns=['price'])

# 数字类型的特征
num = ['symboling', 'normalized-losses', 'volume', 'horsepower', 'wheel-base',
       'bore', 'stroke','compression-ratio', 'peak-rpm','engine-size','highway-mpg']

# 对数字类型的特征进行标准化处理
standard_scaler = StandardScaler()
features[num] = standard_scaler.fit_transform(features[num])
features.head(10)

# 绘制箱线图看数据分布# 使用pandas 的plot.box函数# 此时数据已经归一化处理，因此可以在一张图中展示所有特征的箱线图features.plot.box(title="Auto-Car", vert=False)
plt.xticks(rotation=-20)

out:

(array([-3., -2., -1., 0., 1., 2., 3.]), <a list of 7 Text xticklabel objects>)

步骤5 类别特征的编码
由于是回归模型，因此需要对类别特征进行数字化的编码处理。便于后续模型的数值化处理。

# 类别属性的one-hot编码## 需要进行one-hot编码的特征列classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors',            'body-style', 'drive-wheels', 'engine-location',
           'engine-type', 'num-of-cylinders', 'fuel-system']

## 使用pandas的get_dummies进行one-hot编码
dummies = pd.get_dummies(features[classes])
print(dummies.columns)

## one-hot编码加工好的特征数据
features3 = features.join(dummies).drop(classes, axis = 1)
print(features3.columns)
features3.head()

任务五数据建模
步骤1 划分数据集
输入：
# 使用sklearn.model_selection.train_test_split随机划分训练集和测试集X_train, X_test, y_train, y_test = train_test_split(features3, target, test_size = 0.3, random_state = seed)

步骤2 回归模型
lasso回归模型中有一个超参数需要选择，也就是正则化的参数alpha，合适的超参数选择是获取好的模型的重要因素。超参数选择的可以使用的方法很多，常见的有网格查找法，还有就是机器学习工具包sklearn中自带的交叉验证法.

#lassocv：交叉验证模型，#lassocv 返回拟合优度这一统计学指标，越趋近1，拟合程度越好lassocv = LassoCV(cv = 10, random_state=seed,alphas =(2,3,4,5,6,7,8,9,10,11))#制定模型，将训练集平均切10分，9份用来做训练，1份用来做验证，可设置alphas=[]是多少（序列格
#式），默认不设置则找适合训练集最优alpha
lassocv.fit(features3, target)                    # 训练模型
lassocv_score = lassocv.score(features3, target)  # 测试模型,返回r^2值?????
lassocv_alpha = lassocv.alpha_                    # 最佳惩罚系数alpha
 
plt.figure(figsize = (10, 4))
plt.plot(lassocv_alpha, lassocv_score, '-ko')

plt.axhline(lassocv_score, color = 'c')
plt.xlabel(r'$alpha$')       # X轴标签
plt.ylabel('CV Score')        # Y轴标签
plt.xscale('log', basex = 2)  # x轴刻度以对数为底

sns.despine(offset = 15)
 
print('CV results:', lassocv_score, lassocv_alpha)

out:

步骤3 查看模型训练结果
查看哪些特征是比较重要的，哪些特征是不重要的。因为LASSO回归的特性，会产生很多特征的重要性参数为0。

# 特征权重的分布# lassocv.coef_是参数向量w，返回经过学习后的所有 feature 的参数。coefs = pd.Series(lassocv.coef_, index = features3.columns)  print(coefs)# 打印信息
print("Lasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  
      str(sum(coefs == 0)) + " features.")

# 可视化特征权重的分布
## 选取前5个重要和后5个重要特征
coefs = pd.concat([coefs.sort_values().head(5), coefs.sort_values().tail(5)])   #将相同字段首尾相接
## 可视化展示 
plt.figure(figsize = (10, 4))
coefs.plot(kind = "barh", color = 'c')
plt.title("Coefficients in the Lasso Model")
plt.show()

步骤4 模型测试

# 训练模型model_l1 = LassoCV(alphas=(2,3,4,5,6,7,8,9,10,11), cv=10, random_state=seed).fit(X_train, y_train)# 模型预测![img](https://arch-source-hebutai.obs.cn-north-4.myhuaweicloud.com:443/service-course/fbb46e56_735.png?AccessKeyId=BJHU7DFLUZHKDPEEKMJL&Expires=1622363406&Signature=BqFHw7iVgX%2Bzr78UKCEje7EYGNA%3D)y_pred_l1 = model_l1.predict(X_test)# 模型打分 
model_l1.score(X_test, y_test)

out：

0.6181257534685929

# 查看预测值和真实值之间的差异plt.rcParams['figure.figsize'] = (6.0, 6.0) ## 构造pandas 数据库。preds：预测值，true：真实值，residuals：真实值-预测值preds = pd.DataFrame({"preds": model_l1.predict(X_train), "true": y_train})preds["residuals"] = preds["true"] - preds["preds"]
## 可视化 {preds：预测值 }和 {residuals：真实值-预测值 }之间的关系
sns.scatterplot(x='preds',y="residuals",data=preds)