文章目录
- 一、简要分析
- 二、缺失值处理
- 主要思路分析:
- 三、异常值处理
- 主要思路分析
- 四、深度清洗
- 主要思路分析
数据清洗工作是数据分析工作中不可缺少的步骤,这是因为数据清洗能够处理掉肮脏数据,如果不清洗数据的话,那么数据分析的结果准确率会变得极低。
一般来说,数据清理是将数据库精简以除去重复记录,并使剩余部分转换成标准可接收格式的过程。数据清理标准模型是将数据输入到数据清理处理器,通过一系列步骤“ 清理”数据,然后以期望的格式输出清理过的数据。数据清理从数据的准确性、完整性、一致性、惟一性、适时性、有效性几个方面来处理数据的丢失值、越界值、不一致代码、重复数据等问题。
一、简要分析
在任务一中,我们对于赛题、数据总体情况、缺失值、特征分布等信息做了简要的分析。在本次任务中就是基于任务一的分析做数据清理工作。
在一些场景中,任务一和任务二合并起来会被称作EDA(Exploratory Data Analysis-探索性数据分析)。当然真正的EDA包含的内容远不止这两份参考示例所展示了,大家可以自行学习尝试。
二、缺失值处理
由于调查、编码和录入误差,数据中可能存在一些无效值和缺失值,需要给予适当的处理。常用的处理方法有:估算,整例删除,变量删除和成对删除。
#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
# GBDT
from sklearn.ensemble import GradientBoostingRegressor
# XGBoost
import xgboost as xgb
# LightGBM
import lightgbm as lgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder
import pickle
import multiprocessing
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC,LinearRegression,LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
#载入数据
data_train = pd.read_csv('F:\\实验\\比赛\\房价预测\\数据集\\train_data.csv')
data_train['Type'] = 'Train'
data_test = pd.read_csv('F:\\实验\\比赛\\房价预测\\数据集\\test_a.csv')
data_test['Type'] = 'Test'
data_all = pd.concat([data_train, data_test], ignore_index=True)
主要思路分析:
虽然这步骤是缺失值处理,但还会涉及到一些最最基础的数据处理。
1.缺失值处理:
- 缺失值的处理手段大体可以分为:删除、填充、映射到高维(当做类别处理)。
- 根据任务一,直接找到的缺失值情况是pu和pv;
- 但是,根据特征nunique分布的分析,可以发现rentType存在"–"的情况,这也算是一种缺失值。
- 此外,诸如rentType的"未知方式";houseToward的"暂无数据"等,本质上也算是一种缺失值,但是对于这些缺失方式,我们可以把它当做是特殊的一类处理,而不需要去主动修改或填充值。
- 将rentType的"–"转换成"未知方式"类别;
- pv/pu的缺失值用均值填充;
- buildYear存在"暂无信息",将其用众数填充。
2.转换object类型数据:
- 这里直接采用LabelEncoder的方式编码,详细的编码方式请自行查阅相关资料学习。
3.时间字段的处理:
- buildYear由于存在"暂无信息",所以需要主动将其转换int类型;
- tradeTime,将其分割成月和日。
4.删除无关字段:
- ID是唯一码,建模无用,所以直接删除;
- city只有一个SH值,也直接删除;
- tradeTime已经分割成月和日,删除原来字段
def preprocessingData(data):
# 填充缺失值
data['rentType'][data['rentType'] == '--'] = '未知方式'
# 转换object类型数据
columns = ['rentType','communityName','houseType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
for feature in columns:
data[feature] = LabelEncoder().fit_transform(data[feature])
# 将buildYear列转换为整型数据
buildYearmean = pd.DataFrame(data[data['buildYear'] != '暂无信息']['buildYear'].mode())
data.loc[data[data['buildYear'] == '暂无信息'].index, 'buildYear'] = buildYearmean.iloc[0, 0]
data['buildYear'] = data['buildYear'].astype('int')
# 处理pv和uv的空值
data['pv'].fillna(data['pv'].mean(), inplace=True)
data['uv'].fillna(data['uv'].mean(), inplace=True)
data['pv'] = data['pv'].astype('int')
data['uv'] = data['uv'].astype('int')
# 分割交易时间
def month(x):
month = int(x.split('/')[1])
return month
def day(x):
day = int(x.split('/')[2])
return day
data['month'] = data['tradeTime'].apply(lambda x: month(x))
data['day'] = data['tradeTime'].apply(lambda x: day(x))
# 去掉部分特征
data.drop('city', axis=1, inplace=True)
data.drop('tradeTime', axis=1, inplace=True)
data.drop('ID', axis=1, inplace=True)
return data
data_train = preprocessingData(data_train)
三、异常值处理
主要思路分析
这里主要针对area和tradeMoney两个维度处理。
针对tradeMoney,这里采用的是IsolationForest模型自动处理;
针对areahetotalFloor是主观+数据可视化的方式得到的结果。
参考资料:iForest (Isolation Forest)孤立森林 异常检测 入门篇
# clean data
def IF_drop(train):
IForest = IsolationForest(contamination=0.01)
IForest.fit(train["tradeMoney"].values.reshape(-1,1))
y_pred = IForest.predict(train["tradeMoney"].values.reshape(-1,1))
drop_index = train.loc[y_pred==-1].index
print(drop_index)
train.drop(drop_index,inplace=True)
return train
data_train = IF_drop(data_train)
结果:
def dropData(train):
# 丢弃部分异常值
train = train[train.area <= 200]
train = train[(train.tradeMoney <=16000) & (train.tradeMoney >=700)]
train.drop(train[(train['totalFloor'] == 0)].index, inplace=True)
return train
#数据集异常值处理
data_train = dropData(data_train)
# 处理异常值后再次查看面积和租金分布图
plt.figure(figsize=(15,5))
sns.boxplot(data_train.area)
plt.show()
plt.figure(figsize=(15,5))
sns.boxplot(data_train.tradeMoney),
plt.show()
结果:
四、深度清洗
主要思路分析
针对每一个region的数据,对area和tradeMoney两个维度进行深度清洗。 采用主观+数据可视化的方式。
def cleanData(data):
data.drop(data[(data['region']=='RG00001') & (data['tradeMoney']<1000)&(data['area']>50)].index,inplace=True)
data.drop(data[(data['region']=='RG00001') & (data['tradeMoney']>25000)].index,inplace=True)
data.drop(data[(data['region']=='RG00001') & (data['area']>250)&(data['tradeMoney']<20000)].index,inplace=True)
data.drop(data[(data['region']=='RG00001') & (data['area']>400)&(data['tradeMoney']>50000)].index,inplace=True)
data.drop(data[(data['region']=='RG00001') & (data['area']>100)&(data['tradeMoney']<2000)].index,inplace=True)
data.drop(data[(data['region']=='RG00002') & (data['area']<100)&(data['tradeMoney']>60000)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['area']<300)&(data['tradeMoney']>30000)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['tradeMoney']<500)&(data['area']<50)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['tradeMoney']<1500)&(data['area']>100)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['tradeMoney']<2000)&(data['area']>300)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['tradeMoney']>5000)&(data['area']<20)].index,inplace=True)
data.drop(data[(data['region']=='RG00003') & (data['area']>600)&(data['tradeMoney']>40000)].index,inplace=True)
data.drop(data[(data['region']=='RG00004') & (data['tradeMoney']<1000)&(data['area']>80)].index,inplace=True)
data.drop(data[(data['region']=='RG00006') & (data['tradeMoney']<200)].index,inplace=True)
data.drop(data[(data['region']=='RG00005') & (data['tradeMoney']<2000)&(data['area']>180)].index,inplace=True)
data.drop(data[(data['region']=='RG00005') & (data['tradeMoney']>50000)&(data['area']<200)].index,inplace=True)
data.drop(data[(data['region']=='RG00006') & (data['area']>200)&(data['tradeMoney']<2000)].index,inplace=True)
data.drop(data[(data['region']=='RG00007') & (data['area']>100)&(data['tradeMoney']<2500)].index,inplace=True)
data.drop(data[(data['region']=='RG00010') & (data['area']>200)&(data['tradeMoney']>25000)].index,inplace=True)
data.drop(data[(data['region']=='RG00010') & (data['area']>400)&(data['tradeMoney']<15000)].index,inplace=True)
data.drop(data[(data['region']=='RG00010') & (data['tradeMoney']<3000)&(data['area']>200)].index,inplace=True)
data.drop(data[(data['region']=='RG00010') & (data['tradeMoney']>7000)&(data['area']<75)].index,inplace=True)
data.drop(data[(data['region']=='RG00010') & (data['tradeMoney']>12500)&(data['area']<100)].index,inplace=True)
data.drop(data[(data['region']=='RG00004') & (data['area']>400)&(data['tradeMoney']>20000)].index,inplace=True)
data.drop(data[(data['region']=='RG00008') & (data['tradeMoney']<2000)&(data['area']>80)].index,inplace=True)
data.drop(data[(data['region']=='RG00009') & (data['tradeMoney']>40000)].index,inplace=True)
data.drop(data[(data['region']=='RG00009') & (data['area']>300)].index,inplace=True)
data.drop(data[(data['region']=='RG00009') & (data['area']>100)&(data['tradeMoney']<2000)].index,inplace=True)
data.drop(data[(data['region']=='RG00011') & (data['tradeMoney']<10000)&(data['area']>390)].index,inplace=True)
data.drop(data[(data['region']=='RG00012') & (data['area']>120)&(data['tradeMoney']<5000)].index,inplace=True)
data.drop(data[(data['region']=='RG00013') & (data['area']<100)&(data['tradeMoney']>40000)].index,inplace=True)
data.drop(data[(data['region']=='RG00013') & (data['area']>400)&(data['tradeMoney']>50000)].index,inplace=True)
data.drop(data[(data['region']=='RG00013') & (data['area']>80)&(data['tradeMoney']<2000)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['area']>300)&(data['tradeMoney']>40000)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['tradeMoney']<1300)&(data['area']>80)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['tradeMoney']<8000)&(data['area']>200)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['tradeMoney']<1000)&(data['area']>20)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['tradeMoney']>25000)&(data['area']>200)].index,inplace=True)
data.drop(data[(data['region']=='RG00014') & (data['tradeMoney']<20000)&(data['area']>250)].index,inplace=True)
data.drop(data[(data['region']=='RG00005') & (data['tradeMoney']>30000)&(data['area']<100)].index,inplace=True)
data.drop(data[(data['region']=='RG00005') & (data['tradeMoney']<50000)&(data['area']>600)].index,inplace=True)
data.drop(data[(data['region']=='RG00005') & (data['tradeMoney']>50000)&(data['area']>350)].index,inplace=True)
data.drop(data[(data['region']=='RG00006') & (data['tradeMoney']>4000)&(data['area']<100)].index,inplace=True)
data.drop(data[(data['region']=='RG00006') & (data['tradeMoney']<600)&(data['area']>100)].index,inplace=True)
data.drop(data[(data['region']=='RG00006') & (data['area']>165)].index,inplace=True)
data.drop(data[(data['region']=='RG00012') & (data['tradeMoney']<800)&(data['area']<30)].index,inplace=True)
data.drop(data[(data['region']=='RG00007') & (data['tradeMoney']<1100)&(data['area']>50)].index,inplace=True)
data.drop(data[(data['region']=='RG00004') & (data['tradeMoney']>8000)&(data['area']<80)].index,inplace=True)
data.loc[(data['region']=='RG00002')&(data['area']>50)&(data['rentType']=='合租'),'rentType']='整租'
data.loc[(data['region']=='RG00014')&(data['rentType']=='合租')&(data['area']>60),'rentType']='整租'
data.drop(data[(data['region']=='RG00008')&(data['tradeMoney']>15000)&(data['area']<110)].index,inplace=True)
data.drop(data[(data['region']=='RG00008')&(data['tradeMoney']>20000)&(data['area']>110)].index,inplace=True)
data.drop(data[(data['region']=='RG00008')&(data['tradeMoney']<1500)&(data['area']<50)].index,inplace=True)
data.drop(data[(data['region']=='RG00008')&(data['rentType']=='合租')&(data['area']>50)].index,inplace=True)
data.drop(data[(data['region']=='RG00015') ].index,inplace=True)
data.reset_index(drop=True, inplace=True)
return data
data_train = cleanData(data_train)