我们每天都在用支付宝,蚂蚁金服风险识别速度可达到100毫秒,比眨眼一次的时间还要快四倍。在全球数字经济时代,有一种金融优势,那就是基于消费者大数据的纯信用! 我们不妨称之为数据信用,它比抵押更靠谱,它比担保更保险,它比监管更高明,它是一种面向未来的财产权,它是数字货币背后核心的抵押资产,它决定了数字货币时代信用创造的方向、速度和规模。一句话,谁掌握了数据信用,谁就控制了数字货币的发行权!数据信用判断依靠的就是金融风控模型。


为了从银行的角度将损失降到最低,银行需要制定决策规则,确定谁批准贷款,谁不批准。 在决定贷款申请之前,贷款经理会考虑申请人的人口统计和社会经济概况。

德国信贷数据包含有关20个变量的数据,以及1000个贷款申请者被认为是好信用风险还是坏信用风险的分类。 这是指向德国信用数据的链接(右键单击并另存为)。 预期基于此数据开发的预测模型将为银行经理提供指导,以根据他/她的个人资料来决定是否批准准申请人的贷款。


世界上最早的银行出现在意大利。 最早的银行是意大利1407年在威尼斯成立的银行。当然类似于银行的机构可能存更早存在。只要有银行,就会有风险控制和管理,即风控。早期风控包括对借贷人资质审核和账户核实。

随着金融业发展,贷款流程逐渐完善,包括下图流程 2000-2008后,全球逐步进入大数据时代,随着用户数据整合,诞生央行征信,公安人脸数据,芝麻信用分,同盾分,聚信立蜜罐分,百度黑中介分等参考数据。银行,消费金融公司,小额贷公司可以利用大数据建模,利用机器智能决策代替绝大部分人工审核,缩短信贷流程,减少贷款风险,实现利润最大化。 现代大数据时代的风控部门主要分为贷前,贷中和贷后管理三个板块。 信用危机时代的信用评分卡






信用卡下卡数量不断增加,说明在初审阶段银行并没有管理的太严格,因此坏账增加是客观会存在的问题。 之前银行是当铺思想,把钱借给有偿还能力的人。这些人群算是优质客群。更糟糕的是但随着量化宽松,财政货币刺激,M2激增,银行,消费金融公司,小额贷公司纷纷把市场目标扩大到次级客户,即偿还能力不足或没有工作的人,这些人还钱风险很高,因此借钱利息也很高。作为小额贷,助贷,消费金融公司的贷前审批人员,是否经历过下图的场景?骗贷,黑中介,灰色产业链,他们无孔不入,搞得你们晕头转向,不好判断用户还要领导拍脑壳决定是否放贷。 国内黑产,灰产已经形成庞大产业链条。根据之前同盾公司统计,黑产团队至少上千个,多大为3人左右小团队,100人以上大团队也有几十上百个。这些黑产团队天天测试各大现金贷平台漏洞,可谓专业产品经理。下图是生产虚假号码的手机卡,来自东南亚,国内可用,可最大程度规避国内安全监控,专门为线上平台现金贷诈骗用户准备。,如果没有风控能力,就不要玩现金贷这行了。放款犹如肉包子打狗有去无回。 举个身边熟悉例子,作者在之前某宝关键词搜索中,可以发现黑产和灰产身影。




信用评分卡可以成为贷款人和借款人计算借款人偿债能力的绝佳工具。对于贷方而言,评分卡可以帮助他们评估借款人的风险,识别是否是骗贷用户或还款能力不足用户,并帮公司维持健康的投资组合 - 这最终将影响整个经济。


因此风控模型就像信贷守护神,保护公司资产,免受黑产吞噬。评分卡模型自动化评分,1秒之内决定客户是否通过, 贷前人员工作轻松多了!这样,大数据时代下的风控模型就此诞生。



部分关键变量翻译 account balance 账户余额

duration of credit持卡时长

建模数据信息Data Set Information:

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".

For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.

This dataset requires use of a cost matrix (see below)

提供了两个数据集。 原始数据集以Hofmann教授提供的形式包含类别/符号属性,并且位于文件“ german.data”中。

对于需要数字属性的算法,斯特拉斯克莱德大学产生了文件“ german.data-numeric”。 该文件已经过编辑,并添加了一些指标变量,以使其适用于无法处理分类变量的算法。 几个按类别排序的属性(例如属性17)已编码为整数。 这是StatLog使用的形式。


..... 1 2

1 0 1

2 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification.

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

Attribute Information:

Attribute 1: (qualitative) Status of existing checking account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments for at least 1 year A14 : no checking account

Attribute 2: (numerical) Duration in month

Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account

Attribute 7: (qualitative) Present employment since A71 : unemployed A72 : ... < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years

Attribute 8: (numerical) Installment rate in percentage of disposable income

Attribute 9: (qualitative) Personal status and sex A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single

Attribute 10: (qualitative) Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor

Attribute 11: (numerical) Present residence since

Attribute 12: (qualitative) Property A121 : real estate A122 : if not A121 : building society savings agreement/ life insurance A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown / no property

Attribute 13: (numerical) Age in years

Attribute 14: (qualitative) Other installment plans A141 : bank A142 : stores A143 : none

Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for free

Attribute 16: (numerical) Number of existing credits at this bank

Attribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer

Attribute 18: (numerical) Number of people being liable to provide maintenance for

Attribute 19: (qualitative) Telephone A191 : none A192 : yes, registered under the customers name

Attribute 20: (qualitative) foreign worker A201 : yes A202 : no

It is worse to class a customer as good when they are bad (5),

than it is to class a customer as bad when they are good (1).



课程会横向讲解评分卡模型是如何一步一步搭建的。 ** 数学原理**

课程还会讲述 逻辑回归的信用评分卡搭建背后的算法原理,数学公式。

** 取数**


** 变量筛选**



通过变量重要性排序,我们可以观察到下图中条状越长的变量,重要性越高,反之亦然。例如credit amount信用额度在下图是最长的,因此是最重要的变量。信用额度是指银行承诺提供贷款人的最高额度。一般情况下借贷人信誉越好,违约记录越少,信用额度越高;如果用户经常借钱不还,信誉差,信用额度越低。银行喜欢把钱借给信用额度高的用户,从而赚取利息。反正,银行不喜欢把钱借给信用额度低用户,这些用户本金可能还不上,银行不仅赚不到利息,而且亏损。



模型运行后输出详细信息,包括变量的统计分析。 评分卡生成(score card)

python脚本可以生成一个评分卡模块(score card),详细保存每个变量有哪些分箱,每个分箱如何计分。这方便业务方和领导参考决策。

** 拒绝推断reject reference**

课程讲述了拒绝推断reject reference重要概念 模型验证




** 模型部署和监控**






random forest with 1000 trees: accuracy on the training subset:1.000 accuracy on the test subset:0.772

准确性高于决策树 `# -- coding: utf-8 -- """ 博主python金融风控评分卡模型和数据分析微专业课:https://edu.51cto.com/sd/f2e9b 随机森林不需要预处理数据 """ import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split

trees=1000 #读取文件 readFileName="German_credit.xlsx" #读取excel df=pd.read_excel(readFileName) list_columns=list(df.columns[:-1]) X=df.ix[:,:-1] y=df.ix[:,-1] names=X.columns x_train,x_test,y_train,y_test=train_test_split(X,y,random_state=0) #n_estimators表示树的个数,测试中100颗树足够 forest=RandomForestClassifier(n_estimators=trees,random_state=0) forest.fit(x_train,y_train) print("random forest with %d trees:"%trees) print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test))) print('Feature importances:{}'.format(forest.feature_importances_)) n_features=X.shape[1] plt.barh(range(n_features),forest.feature_importances_,align='center') plt.yticks(np.arange(n_features),names) plt.title("random forest with %d trees:"%trees) plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show()`

决策树可视化 准确率不高,且严重过度拟合

`import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt import numpy as np import pydotplus from IPython.display import Image import graphviz from sklearn.tree import export_graphviz from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split

trees=1000 #读取文件 readFileName="German_credit.xlsx" #读取excel df=pd.read_excel(readFileName) list_columns=list(df.columns[:-1]) x=df.ix[:,:-1] y=df.ix[:,-1] names=x.columns x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0) #调参 list_average_accuracy=[] depth=range(1,30) for i in depth: #max_depth=4限制决策树深度可以降低算法复杂度,获取更精确值 tree= DecisionTreeClassifier(max_depth=i,random_state=0) tree.fit(x_train,y_train) accuracy_training=tree.score(x_train,y_train) accuracy_test=tree.score(x_test,y_test) average_accuracy=(accuracy_training+accuracy_test)/2.0 #print("average_accuracy:",average_accuracy) list_average_accuracy.append(average_accuracy)

max_value=max(list_average_accuracy) #索引是0开头,结果要加1 best_depth=list_average_accuracy.index(max_value)+1 print("best_depth:",best_depth) best_tree= DecisionTreeClassifier(max_depth=best_depth,random_state=0) best_tree.fit(x_train,y_train) accuracy_training=best_tree.score(x_train,y_train) accuracy_test=best_tree.score(x_test,y_test) print("decision tree:")
print("accuracy on the training subset:{:.3f}".format(best_tree.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(best_tree.score(x_test,y_test)))

n_features=x.shape[1] plt.barh(range(n_features),best_tree.feature_importances_,align='center') plt.yticks(np.arange(n_features),names) plt.title("Decision Tree:") plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show()

#生成一个dot文件,以后用cmd形式生成图片 export_graphviz(best_tree,out_file="creditTree.dot",class_names=['bad','good'],feature_names=names,impurity=False,filled=True) ''' best_depth: 12 decision tree: accuracy on the training subset:0.991 accuracy on the test subset:0.680 '''`

支持向量最高预测率 `#标准化数据 from sklearn import preprocessing from sklearn.svm import SVC from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import pandas as pd

#读取文件 readFileName="German_credit.xlsx" #读取excel df=pd.read_excel(readFileName) list_columns=list(df.columns[:-1]) x=df.ix[:,:-1] y=df.ix[:,-1] names=x.columns #random_state 相当于随机数种子 X_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=42) svm=SVC() svm.fit(X_train,y_train) print("accuracy on the training subset:{:.3f}".format(svm.score(X_train,y_train))) print("accuracy on the test subset:{:.3f}".format(svm.score(x_test,y_test))) ''' accuracy on the training subset:1.000 accuracy on the test subset:0.700

''' #观察数据是否标准化 plt.plot(X_train.min(axis=0),'o',label='Min') plt.plot(X_train.max(axis=0),'v',label='Max') plt.xlabel('Feature Index') plt.ylabel('Feature magnitude in log scale') plt.yscale('log') plt.legend(loc='upper right')

#标准化数据 X_train_scaled = preprocessing.scale(X_train) x_test_scaled = preprocessing.scale(x_test) svm1=SVC() svm1.fit(X_train_scaled,y_train) print("accuracy on the scaled training subset:{:.3f}".format(svm1.score(X_train_scaled,y_train))) print("accuracy on the scaled test subset:{:.3f}".format(svm1.score(x_test_scaled,y_test))) ''' accuracy on the scaled training subset:0.867 accuracy on the scaled test subset:0.800 ''' #改变C参数,调优,kernel表示核函数,用于平面转换,probability表示是否需要计算概率 svm2=SVC(C=10,gamma="auto",kernel='rbf',probability=True) svm2.fit(X_train_scaled,y_train) print("after c parameter=10,accuracy on the scaled training subset:{:.3f}".format(svm2.score(X_train_scaled,y_train))) print("after c parameter=10,accuracy on the scaled test subset:{:.3f}".format(svm2.score(x_test_scaled,y_test))) ''' after c parameter=10,accuracy on the scaled training subset:0.972 after c parameter=10,accuracy on the scaled test subset:0.716 ''' #计算样本点到分割超平面的函数距离 #print (svm2.decision_function(X_train_scaled)) #print (svm2.decision_function(X_train_scaled)[:20]>0) #支持向量机分类 #print(svm2.classes_) #malignant和bening概率计算,输出结果包括恶性概率和良性概率 #print(svm2.predict_proba(x_test_scaled)) #判断数据属于哪一类,0或1表示 #print(svm2.predict(x_test_scaled))`

神经网络 效果不如支持向量和随机森林 最好概率

#Multi-layer Perceptron 多层感知机
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import mglearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#random_state 相当于随机数种子
print("neural network:")   
print("accuracy on the training subset:{:.3f}".format(mlp.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp.score(x_test,y_test)))
print("neural network after scaled:")   
print("accuracy on the training subset:{:.3f}".format(mlp_scaled.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled.score(x_test_scaled,y_test)))
print("neural network after scaled and alpha change to 1:")   
print("accuracy on the training subset:{:.3f}".format(mlp_scaled2.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled2.score(x_test_scaled,y_test)))
plt.xlabel("columns in weight matrix")
plt.ylabel("input feature")
neural network:
accuracy on the training subset:0.700
accuracy on the test subset:0.700
neural network after scaled:
accuracy on the training subset:1.000
accuracy on the test subset:0.704
neural network after scaled and alpha change to 1:
accuracy on the training subset:0.916
accuracy on the test subset:0.720



AUC: 0.8134 ACC: 0.7720 Recall: 0.9521 F1-score: 0.8480 Precesion: 0.7644

import xgboost as xgb
from sklearn.cross_validation import train_test_split
import pandas as pd
import matplotlib.pylab as plt
train_x, test_x, train_y, test_y=train_test_split(x,y,random_state=0)
    #'objective': 'reg:linear',
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.025,
watchlist = [(dtrain,'train')]
# 设置阈值, 输出一些评价指标
y_pred = (ypred >= 0.5)*1
from sklearn import metrics
print ('AUC: %.4f' % metrics.roc_auc_score(test_y,ypred))
print ('ACC: %.4f' % metrics.accuracy_score(test_y,y_pred))
print ('Recall: %.4f' % metrics.recall_score(test_y,y_pred))
print ('F1-score: %.4f' %metrics.f1_score(test_y,y_pred))
print ('Precesion: %.4f' %metrics.precision_score(test_y,y_pred))
#print("accuracy on the training subset:{:.3f}".format(bst.get_score(train_x,train_y)))
#print("accuracy on the test subset:{:.3f}".format(bst.get_score(test_x,test_y)))
print('Feature importances:{}'.format(bst.get_fscore()))
AUC: 0.8135
ACC: 0.7640
Recall: 0.9641
F1-score: 0.8451
Precesion: 0.7523
Feature importances:{'Account Balance': 80, 'Duration of Credit (month)': 119,
 'Most valuable available asset': 54, 'Payment Status of Previous Credit': 84,
 'Value Savings/Stocks': 66, 'Age (years)': 94, 'Credit Amount': 149,
 'Type of apartment': 20, 'Instalment per cent': 37,
 'Length of current employment': 70, 'Sex & Marital Status': 29,
 'Purpose': 67, 'Occupation': 13, 'Duration in Current address': 25,
 'Telephone': 15, 'Concurrent Credits': 23, 'No of Credits at this Bank': 7,
 'Guarantors': 28, 'No of dependents': 6}

xgboost 有时候特征重要性分析比随机森林还准确,可见其强大之处

随机森林重要因子排序 xgboost权重指数 Credit amount信用保证金 149 age 年龄 94 account balance 账户余额 80 duration of credit持卡时间 119 (信用卡逾期时间,每个银行有所不同,以招商银行为例,两个月就会被停卡)

