运营商客户流失分析与预测
- 背景
- 提出问题
- 理解数据
- 数据清洗
- 可视化分析
- 用户流失预测
- 结论和建议
一、背景
关于用户留存有这样一个观点,如果将用户流失率降低5%,公司利润将提升25%-85%。如今高居不下的获客成本让移动运营商遭遇“天花板”,甚至陷入获客难的窘境。随着市场饱和度上升,移动运营商亟待解决增加用户黏性,延长用户生命周期的问题。因此,移动用户流失分析与预测至关重要。 数据集来自kesci中的“移动运营商客户数据集”
二、提出问题
- 分析用户特征与流失的关系。
- 从整体情况看,流失用户的普遍具有哪些特征?
- 尝试找到合适的模型预测流失用户。
- 针对性给出增加用户黏性、预防流失的建议。
三、理解数据
该数据集有21个字段,共7043条记录。每条记录包含了唯一客户的特征。
我们目标就是发现前20列特征和最后一列客户是否流失特征之间的关系。
四、数据清洗
数据清洗的“完全合一”规则:
- 完整性:单条数据是否存在空值,统计的字段是否完善。
- 全面性:观察某一列的全部数值,通过常识来判断该列是否有问题,比如:数据定义、单位标识、数据本身。
- 合法性:数据的类型、内容、大小的合法性。比如数据中是否存在非ASCII字符,性别存在了未知,年龄超过了150等。
- 唯一性:数据是否存在重复记录,因为数据通常来自不同渠道的汇总,重复的情况是常见的。行数据、列数据都需要是唯一的。
- 数据集下载地址:
链接:https://pan.baidu.com/s/1NIg-4X_ajfeaMr7hB1rScQ?pwd=49kk
提取码:49kk
# 1.导入工具包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
# 2.导入数据集文件
customerDF = pd.read_csv('./data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
# 3.查看数据集大小
customerDF.shape
【输出结果如下】:
(7043, 21)
# 4.设置查看列不省略
pd.set_option('display.max_columns',None)
# 5.查看前10条数据
customerDF.head(10)
【输出结果如下】:
#6.查看数据是否存在Null,如果存在则计数
pd.isnull(customerDF).sum()
【输出结果如下】:
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
# 7.1 查看数据类型,下面两行指令功能一样
customerDF.info()
【输出结果如下】:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
# 7.2 查看数据类型
customerDF.dtypes
【输出结果如下】:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
# 8 将‘TotalCharges’总消费额的数据类型转换为浮点型
# × 8.1 发现错误:字符串无法转换为数字,ValueError: could not convert string to float:
customerDF[['TotalCharges']].astype(float)
【代码错误的输出结果如下】:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-7c93c9019d13> in <module>
1 # 8 将‘TotalCharges’总消费额的数据类型转换为浮点型
2 # × 8.1 发现错误:字符串无法转换为数字,ValueError: could not convert string to float:
----> 3 customerDF[['TotalCharges']].astype(float)
D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5875 else:
5876 # else, only a single dtype is given
-> 5877 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
5878 return self._constructor(new_data).__finalize__(self, method="astype")
5879
D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
629 self, dtype, copy: bool = False, errors: str = "raise"
630 ) -> "BlockManager":
--> 631 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
632
633 def convert(
D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
425 applied = b.apply(f, **kwargs)
426 else:
--> 427 applied = getattr(b, f)(**kwargs)
428 except (TypeError, NotImplementedError):
429 if not ignore_failures:
D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
671 vals1d = values.ravel()
672 try:
--> 673 values = astype_nansafe(vals1d, dtype, copy=True)
674 except (ValueError, TypeError):
675 # e.g. astype_nansafe can fail on object-dtype of strings
D:\mysoft\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
1095 if copy or is_object_dtype(arr) or is_object_dtype(dtype):
1096 # Explicit copy, or required since NumPy can't view from / to object.
-> 1097 return arr.astype(dtype, copy=True)
1098
1099 return arr.view(dtype)
ValueError: could not convert string to float: ''
# 8.2 依次检查各个字段的数据类型、字段内容和数量。最后发现“TotalCharges”(总消费额)列有11个用户数据缺失
# 查看每一列数据取值
for x in customerDF.columns:
test=customerDF.loc[:,x].value_counts()
print('{0} 的行数是:{1}'.format(x,test.sum()))
print('{0} 的数据类型是:{1}'.format(x,customerDF[x].dtypes))
print('{0} 的内容是:\n{1}\n'.format(x,test))
【输出结果如下】:
customerID 的行数是:7043
customerID 的数据类型是:object
customerID 的内容是:
0463-TXOAK 1
1025-FALIX 1
7176-WIONM 1
5180-UCIIQ 1
2260-USTRB 1
..
6017-PPLPX 1
9588-YRFHY 1
0112-QAWRZ 1
9985-MWVIX 1
5095-AESKG 1
Name: customerID, Length: 7043, dtype: int64
gender 的行数是:7043
gender 的数据类型是:object
gender 的内容是:
Male 3555
Female 3488
Name: gender, dtype: int64
SeniorCitizen 的行数是:7043
SeniorCitizen 的数据类型是:int64
SeniorCitizen 的内容是:
0 5901
1 1142
Name: SeniorCitizen, dtype: int64
Partner 的行数是:7043
Partner 的数据类型是:object
Partner 的内容是:
No 3641
Yes 3402
Name: Partner, dtype: int64
Dependents 的行数是:7043
Dependents 的数据类型是:object
Dependents 的内容是:
No 4933
Yes 2110
Name: Dependents, dtype: int64
tenure 的行数是:7043
tenure 的数据类型是:int64
tenure 的内容是:
1 613
72 362
2 238
3 200
4 176
...
28 57
39 56
44 51
36 50
0 11
Name: tenure, Length: 73, dtype: int64
PhoneService 的行数是:7043
PhoneService 的数据类型是:object
PhoneService 的内容是:
Yes 6361
No 682
Name: PhoneService, dtype: int64
MultipleLines 的行数是:7043
MultipleLines 的数据类型是:object
MultipleLines 的内容是:
No 3390
Yes 2971
No phone service 682
Name: MultipleLines, dtype: int64
InternetService 的行数是:7043
InternetService 的数据类型是:object
InternetService 的内容是:
Fiber optic 3096
DSL 2421
No 1526
Name: InternetService, dtype: int64
OnlineSecurity 的行数是:7043
OnlineSecurity 的数据类型是:object
OnlineSecurity 的内容是:
No 3498
Yes 2019
No internet service 1526
Name: OnlineSecurity, dtype: int64
OnlineBackup 的行数是:7043
OnlineBackup 的数据类型是:object
OnlineBackup 的内容是:
No 3088
Yes 2429
No internet service 1526
Name: OnlineBackup, dtype: int64
DeviceProtection 的行数是:7043
DeviceProtection 的数据类型是:object
DeviceProtection 的内容是:
No 3095
Yes 2422
No internet service 1526
Name: DeviceProtection, dtype: int64
TechSupport 的行数是:7043
TechSupport 的数据类型是:object
TechSupport 的内容是:
No 3473
Yes 2044
No internet service 1526
Name: TechSupport, dtype: int64
StreamingTV 的行数是:7043
StreamingTV 的数据类型是:object
StreamingTV 的内容是:
No 2810
Yes 2707
No internet service 1526
Name: StreamingTV, dtype: int64
StreamingMovies 的行数是:7043
StreamingMovies 的数据类型是:object
StreamingMovies 的内容是:
No 2785
Yes 2732
No internet service 1526
Name: StreamingMovies, dtype: int64
Contract 的行数是:7043
Contract 的数据类型是:object
Contract 的内容是:
Month-to-month 3875
Two year 1695
One year 1473
Name: Contract, dtype: int64
PaperlessBilling 的行数是:7043
PaperlessBilling 的数据类型是:object
PaperlessBilling 的内容是:
Yes 4171
No 2872
Name: PaperlessBilling, dtype: int64
PaymentMethod 的行数是:7043
PaymentMethod 的数据类型是:object
PaymentMethod 的内容是:
Electronic check 2365
Mailed check 1612
Bank transfer (automatic) 1544
Credit card (automatic) 1522
Name: PaymentMethod, dtype: int64
MonthlyCharges 的行数是:7043
MonthlyCharges 的数据类型是:float64
MonthlyCharges 的内容是:
20.05 61
19.85 45
19.90 44
19.95 44
19.65 43
..
87.65 1
35.30 1
114.85 1
56.50 1
97.25 1
Name: MonthlyCharges, Length: 1585, dtype: int64
TotalCharges 的行数是:7043
TotalCharges 的数据类型是:object
TotalCharges 的内容是:
11
20.2 11
19.75 9
20.05 8
19.65 8
..
5166.2 1
1133.65 1
934.8 1
385.55 1
5832 1
Name: TotalCharges, Length: 6531, dtype: int64
Churn 的行数是:7043
Churn 的数据类型是:object
Churn 的内容是:
No 5174
Yes 1869
Name: Churn, dtype: int64
# 8.3 采用强制转换,将“TotalCharges”(总消费额)转换为浮点型数据
# ×报错:AttributeError: 'Series' object has no attribute 'convert_objects'
# ×convert_objects的方法已经被弃用,
customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True
【代码执行的错过结果如下】:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-bdd74b9e37bd> in <module>
2 # ×报错:AttributeError: 'Series' object has no attribute 'convert_objects'
3 # ×convert_objects的方法已经被弃用,
----> 4 customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True)
D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'convert_objects'
# 8.3.1 √解决以上版本类型转换为方法
customerDF['TotalCharges'] = pd.to_numeric(customerDF['TotalCharges'], errors='coerce')
# 9.转换后发现“TotalCharges”(总消费额)列有11个用户数据缺失,为NaN。
test=customerDF.loc[:,'TotalCharges'].value_counts().sort_index()
print(test.sum())
print(customerDF.tenure[customerDF['TotalCharges'].isnull().values==True])
【输出结果如下】:
7032
488 0
753 0
936 0
1082 0
1340 0
3331 0
3826 0
4380 0
5218 0
6670 0
6754 0
Name: tenure, dtype: int64
经过观察,发现这11个用户‘tenure’(入网时长)为0个月,推测是当月新入网用户。
根据一般经验,用户即使在注册的当月流失,也需缴纳当月费用。因此将这11个用户入网时长改为1,将总消费额填充为月消费额,符合实际情况。
# 9.1. 查看null值,且输出
print(customerDF.isnull().any())
print(customerDF[customerDF['TotalCharges'].isnull().values==True][['tenure','MonthlyCharges','TotalCharges']])
【输出结果如下】:
customerID False
gender False
SeniorCitizen False
Partner False
Dependents False
tenure False
PhoneService False
MultipleLines False
InternetService False
OnlineSecurity False
OnlineBackup False
DeviceProtection False
TechSupport False
StreamingTV False
StreamingMovies False
Contract False
PaperlessBilling False
PaymentMethod False
MonthlyCharges False
TotalCharges True
Churn False
dtype: bool
tenure MonthlyCharges TotalCharges
488 0 52.55 NaN
753 0 20.25 NaN
936 0 80.85 NaN
1082 0 25.75 NaN
1340 0 56.05 NaN
3331 0 19.85 NaN
3826 0 25.35 NaN
4380 0 20.00 NaN
5218 0 19.70 NaN
6670 0 73.35 NaN
6754 0 61.90 NaN
# 9.2 ×将总消费额填充为月消费额,以下报错:ValueError: Series.replace cannot use dict-value and non-None to_replace
customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)
【错误的输出结果如下】:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-6230086838ba> in <module>
1 # ×将总消费额填充为月消费额,以下报错:ValueError: Series.replace cannot use dict-value and non-None to_replace
----> 2 customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)
D:\mysoft\anaconda3\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
4507 method="pad",
4508 ):
-> 4509 return super().replace(
4510 to_replace=to_replace,
4511 value=value,
D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
6921 # Operate column-wise
6922 if self.ndim == 1:
-> 6923 raise ValueError(
6924 "Series.replace cannot use dict-value and "
6925 "non-None to_replace"
ValueError: Series.replace cannot use dict-value and non-None to_replace
# 9.3 √上一步将总消费额填充为月消费额
# 方法1:使用填充方法
customerDF['TotalCharges'] = customerDF['TotalCharges'].fillna(customerDF['MonthlyCharges'])
# 方法2:执行下面两行代码
# 把所有需要替换的行的索引值取出来,转换成列表形式
# pan1 = customerDF[customerDF['TotalCharges'].isnull()].index.to_list()
# # 开始一一对应去替换
# customerDF.loc[pan1,'TotalCharges'] = customerDF.loc[pan1,'MonthlyCharges']
# 9.4 查看是否替换成功
customerDF[customerDF['tenure']==0][['tenure','MonthlyCharges','TotalCharges']]
【输出结果如下】:
# 10.将‘tenure’入网时长从0修改为1
customerDF.loc[:,'tenure'].replace(to_replace=0,value=1,inplace=True)
print(pd.isnull(customerDF['TotalCharges']).sum())
print(customerDF['TotalCharges'].dtypes)
【输出结果为】:
0
float64
# 11.获取数据类型的描述统计信息
customerDF.describe()
【输出结果为】:
五、可视化分析
根据一般经验,将用户特征划分为用户属性、服务属性、合同属性,并从这三个维度进行可视化分析。
# 12.查看流失用户数量和占比
plt.rcParams['figure.figsize']=6,6
plt.pie(customerDF['Churn'].value_counts(),labels=customerDF['Churn'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0))
plt.title('Churn(Yes/No) Ratio')
plt.show()
【输出结果如下】:
#13.
churnDf=customerDF['Churn'].value_counts().to_frame()
x=churnDf.index
y=churnDf['Churn']
plt.bar(x,y,width = 0.5,color = 'c')
#用来正常显示中文标签(需要安装字库)
plt.title('Churn(Yes/No) Num')
plt.show()
【输出结果如下】:
属于不平衡数据集,流失用户占比达26.54%
(1)用户属性分析
import matplotlib.ticker as ticker
def barplot_percentages(feature,orient='v',axis_name="percentage of customers"):
ratios = pd.DataFrame()
g = (customerDF.groupby(feature)["Churn"].value_counts()/len(customerDF)).to_frame()
g.rename(columns={"Churn":axis_name},inplace=True)
g.reset_index(inplace=True)
#print(g)
if orient == 'v':
ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)
ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()])
plt.rcParams.update({'font.size': 13})
#plt.legend(fontsize=10)
else:
ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)
ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()])
plt.legend(fontsize=10)
plt.title('Churn(Yes/No) Ratio as {0}'.format(feature))
plt.show()
barplot_percentages("SeniorCitizen")
barplot_percentages("gender")
【输出结果如下】:
customerDF['churn_rate'] = customerDF['Churn'].replace("No", 0).replace("Yes", 1)
g = sns.FacetGrid(customerDF, col="SeniorCitizen", height=4, aspect=.9)
ax = g.map(sns.barplot, "gender", "churn_rate", palette = "Blues_d", order= ['Female', 'Male'])
plt.rcParams.update({'font.size': 13})
plt.show()
【输出结果如下】:
小结:
用户流失与性别基本无关;
年老用户流失占显著高于年轻用户。
fig, axis = plt.subplots(1, 2, figsize=(12,4))
axis[0].set_title("Has Partner")
axis[1].set_title("Has Dependents")
axis_y = "percentage of customers"
# Plot Partner column
gp_partner = (customerDF.groupby('Partner')["Churn"].value_counts()/len(customerDF)).to_frame()
gp_partner.rename(columns={"Churn": axis_y}, inplace=True)
gp_partner.reset_index(inplace=True)
ax1 = sns.barplot(x='Partner', y= axis_y, hue='Churn', data=gp_partner, ax=axis[0])
ax1.legend(fontsize=10)
#ax1.set_xlabel('伴侣')
# Plot Dependents column
gp_dep = (customerDF.groupby('Dependents')["Churn"].value_counts()/len(customerDF)).to_frame()
#print(gp_dep)
gp_dep.rename(columns={"Churn": axis_y} , inplace=True)
#print(gp_dep)
gp_dep.reset_index(inplace=True)
#print(gp_dep)
ax2 = sns.barplot(x='Dependents', y= axis_y, hue='Churn', data=gp_dep, ax=axis[1])
#ax2.set_xlabel('家属')
#设置字体大小
plt.rcParams.update({'font.size': 20})
ax2.legend(fontsize=10)
#设置
plt.show()
【输出结果如下】:
# Kernel density estimaton核密度估计
def kdeplot(feature,xlabel):
plt.figure(figsize=(9, 4))
plt.title("KDE for {0}".format(feature))
ax0 = sns.kdeplot(customerDF[customerDF['Churn'] == 'No'][feature].dropna(), color= 'navy', label= 'Churn: No', shade='True')
ax1 = sns.kdeplot(customerDF[customerDF['Churn'] == 'Yes'][feature].dropna(), color= 'orange', label= 'Churn: Yes',shade='True')
plt.xlabel(xlabel)
#设置字体大小
plt.rcParams.update({'font.size': 20})
plt.legend(fontsize=10)
kdeplot('tenure','tenure')
plt.show()
【输出结果为】:
小结:
- 有伴侣的用户流失占比低于无伴侣用户;
- 有家属的用户较少;
- 有家属的用户流失占比低于无家属用户;
- 在网时长越久,流失率越低,符合一般经验;
- 在网时间达到三个月,流失率小于在网率,证明用户心理稳定期一般是三个月
(2)服务属性分析
plt.figure(figsize=(9, 4.5))
barplot_percentages("MultipleLines", orient='h')
【输出结果为】:
plt.figure(figsize=(9, 4.5))
barplot_percentages("InternetService", orient="h")
【输出结果为】:
cols = ["PhoneService","MultipleLines","OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]
df1 = pd.melt(customerDF[customerDF["InternetService"] != "No"][cols])
df1.rename(columns={'value': 'Has service'},inplace=True)
plt.figure(figsize=(20, 8))
ax = sns.countplot(data=df1, x='variable', hue='Has service')
ax.set(xlabel='Internet Additional service', ylabel='Num of customers')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Customers as Internet Additional Service')
plt.show()
【输出结果为】:
plt.figure(figsize=(20, 8))
df1 = customerDF[(customerDF.InternetService != "No") & (customerDF.Churn == "Yes")]
df1 = pd.melt(df1[cols])
df1.rename(columns={'value': 'Has service'}, inplace=True)
ax = sns.countplot(data=df1, x='variable', hue='Has service', hue_order=['No', 'Yes'])
ax.set(xlabel='Internet Additional service', ylabel='Churn Num')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Churn Customers as Internet Additional Service')
plt.show()
【输出结果为】:
小结:
- 电话服务整体对用户流失影响较小。
- 单光纤用户的流失占比较高;
- 光纤用户绑定了安全、备份、保护、技术支持服务的流失率较低;
- 光纤用户附加流媒体电视、电影服务的流失率占比较高。
(3)合同属性分析
plt.figure(figsize=(9, 4.5))
barplot_percentages("PaymentMethod",orient='h')
g = sns.FacetGrid(customerDF, col="PaperlessBilling", height=6, aspect=.9)
ax = g.map(sns.barplot, "Contract", "churn_rate", palette = "Blues_d", order= ['Month-to-month', 'One year', 'Two year'])
plt.rcParams.update({'font.size':18})
plt.show()
【输出结果为】:
kdeplot('MonthlyCharges','MonthlyCharges')
kdeplot('TotalCharges','TotalCharges')
plt.show()
【输出结果为】:
小结:
- 采用电子支票支付的用户流失率最高,推测该方式的使用体验较为一般;
- 签订合同方式对客户流失率影响为:按月签订 > 按一年签订 > 按两年签订,证明长期合同最能保留客户;
- 月消费额大约在70-110之间用户流失率较高;
- 长期来看,用户总消费越高,流失率越低,符合一般经验。
五、用户流失预测
对数据集进一步清洗和提取特征,通过特征选取对数据进行降维,采用机器学习模型应用于测试数据集,然后对构建的分类模型准确性进行分析
(1)数据清洗
customerID=customerDF['customerID']
customerDF.drop(['customerID'],axis=1, inplace=True)
观察数据类型,发现大多除了“tenure”、“MonthlyCharges”、“TotalCharges”是连续特征,其它都是离散特征。对于连续特征,采用标准化方式处理。对于离散特征,特征之间没有大小关系,采用one-hot编码;特征之间有大小关联,则采用数值映射。
#获取离散特征
cateCols = [c for c in customerDF.columns if customerDF[c].dtype == 'object' or c == 'SeniorCitizen']
dfCate = customerDF[cateCols].copy()
dfCate.head(3)
【输出结果为】:
#进行特征编码
for col in cateCols:
if dfCate[col].nunique() == 2:
dfCate[col] = pd.factorize(dfCate[col])[0]
else:
dfCate = pd.get_dummies(dfCate, columns=[col])
dfCate['tenure']=customerDF[['tenure']]
dfCate['MonthlyCharges']=customerDF[['MonthlyCharges']]
dfCate['TotalCharges']=customerDF[['TotalCharges']]
#查看关联关系
plt.figure(figsize=(16,8))
dfCate.corr()['Churn'].sort_values(ascending=False).plot(kind='bar')
plt.show()
【输出结果为】:
(2)特征选取
# 特征选择
dropFea = ['gender','PhoneService',
'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
'DeviceProtection_No internet service', 'TechSupport_No internet service',
'StreamingTV_No internet service', 'StreamingMovies_No internet service',
#'OnlineSecurity_No', 'OnlineBackup_No',
#'DeviceProtection_No','TechSupport_No',
#'StreamingTV_No', 'StreamingMovies_No',
]
dfCate.drop(dropFea, inplace=True, axis =1)
#最后一列是作为标识
target = dfCate['Churn'].values
#列表:特征和1个标识
columns = dfCate.columns.tolist()
构造训练数据集和测试数据集
# 列表:特征
columns.remove('Churn')
# 含有特征的DataFrame
features = dfCate[columns].values
# 30% 作为测试集,其余作为训练集
# random_state = 1表示重复试验随机得到的数据集始终不变
# stratify = target 表示按标识的类别,作为训练数据集、测试数据集内部的分配比例
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)
(3)构建模型
构造多个分类器
# 引入以下分类算法模块,# 如果没有需要执行 pip install scikit-learn
from sklearn.svm import SVC # C-支持向量分类器
from sklearn.tree import DecisionTreeClassifier #决策树模型模型
from sklearn.ensemble import RandomForestClassifier # 随机森林分类器
from sklearn.neighbors import KNeighborsClassifier #K 最近邻(KNN)分类算法
from sklearn.ensemble import AdaBoostClassifier #AdaBoost分类器
# 构造各种分类器
classifiers = [
SVC(random_state = 1, kernel = 'rbf'),
DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
RandomForestClassifier(random_state = 1, criterion = 'gini'),
KNeighborsClassifier(metric = 'minkowski'),
AdaBoostClassifier(random_state = 1),
]
# 分类器名称
classifier_names = [
'svc',
'decisiontreeclassifier',
'randomforestclassifier',
'kneighborsclassifier',
'adaboostclassifier',
]
# 分类器参数
#注意分类器的参数,字典键的格式,GridSearchCV对调优的参数格式是"分类器名"+"__"+"参数名"
classifier_param_grid = [
{'svc__C':[0.1], 'svc__gamma':[0.01]},
{'decisiontreeclassifier__max_depth':[6,9,11]},
{'randomforestclassifier__n_estimators':range(1,11)} ,
{'kneighborsclassifier__n_neighbors':[4,6,8]},
{'adaboostclassifier__n_estimators':[70,80,90]}
]
(4)模型参数调优和评估
对分类器进行参数调优和评估,最后得到试用AdaBoostClassifier(n_estimators=80)效果最好。
#Pipeline将数据处理步骤和一个学习器组合在一起,使得可以使用一个命令对数据进行处理并用学习器进行训练
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV #GridSearchCV的主要作用是用于超参数调优
from sklearn.metrics import accuracy_score # 使用 accuracy_score 计算模型准确性
# 对具体的分类器进行 GridSearchCV 参数调优
def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy_score'):
response = {}
gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=3, scoring = score)
# 寻找最优的参数 和最优的准确率分数
search = gridsearch.fit(train_x, train_y)
print("GridSearch 最优参数:", search.best_params_)
print("GridSearch 最优分数: %0.4lf" %search.best_score_)
#采用predict函数(特征是测试数据集)来预测标识,预测使用的参数是上一步得到的最优参数
predict_y = gridsearch.predict(test_x)
print(" 准确率 %0.4lf" %accuracy_score(test_y, predict_y))
response['predict_y'] = predict_y
response['accuracy_score'] = accuracy_score(test_y,predict_y)
return response
for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
#采用 StandardScaler 方法对数据规范化:均值为0,方差为1的正态分布
pipeline = Pipeline([
#('scaler', StandardScaler()),
#('pca',PCA),
(model_name, model)
])
result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')
【输出结果为】:
GridSearch 最优参数: {'svc__C': 0.1, 'svc__gamma': 0.01}
GridSearch 最优分数: 0.7560
准确率 0.7591
GridSearch 最优参数: {'decisiontreeclassifier__max_depth': 6}
GridSearch 最优分数: 0.7777
准确率 0.7927
GridSearch 最优参数: {'randomforestclassifier__n_estimators': 10}
GridSearch 最优分数: 0.7702
准确率 0.7842
GridSearch 最优参数: {'kneighborsclassifier__n_neighbors': 8}
GridSearch 最优分数: 0.7690
准确率 0.7870
GridSearch 最优参数: {'adaboostclassifier__n_estimators': 70}
GridSearch 最优分数: 0.7998
准确率 0.8050
六、结论和建议
根据以上分析,得到高流失率用户的特征:
- 用户属性:老年用户,未婚用户,无亲属用户更容易流失;
- 服务属性:在网时长小于半年,有电话服务,光纤用户/光纤用户附加流媒体电视、电影服务,无互联网增值服务;
- 合同属性:签订的合同期较短,采用电子支票支付,是电子账单,月租费约70-110元的客户容易流失; 其它属性对用户流失影响较小,以上特征保持独立。
针对上述结论,从业务角度给出相应建议:
根据预测模型,构建一个高流失率的用户列表。通过用户调研推出一个最小可行化产品功能,并邀请种子用户进行试用。
- 用户方面:针对老年用户、无亲属、无伴侣用户的特征退出定制服务如亲属套餐、温暖套餐等,一方面加强与其它用户关联度,另一方对特定用户提供个性化服务。
- 服务方面:针对新注册用户,推送半年优惠如赠送消费券,以渡过用户流失高峰期。针对光纤用户和附加流媒体电视、电影服务用户,重点在于提升网络体验、增值服务体验,一方面推动技术部门提升网络指标,另一方面对用户承诺免费网络升级和赠送电视、电影等包月服务以提升用户黏性。针对在线安全、在线备份、设备保护、技术支持等增值服务,应重点对用户进行推广介绍,如首月/半年免费体验。
- 合同方面:针对单月合同用户,建议推出年合同付费折扣活动,将月合同用户转化为年合同用户,提高用户在网时长,以达到更高的用户留存。 针对采用电子支票支付用户,建议定向推送其它支付方式的优惠券,引导用户改变支付方式。