一种基于AP-Entropy选择集成的风控模型和算法

计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 71-76.doi: 10.11896/jsjkx.210200110

• 智能计算 • 上一篇    下一篇

一种基于AP-Entropy选择集成的风控模型和算法

王茂光, 杨行   

  1. 中央财经大学信息学院 北京100081
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 杨行(yanghangv@163.com)
  • 作者简介:wangmg@cufe.edu.cn

Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble

WANG Mao-guang, YANG Hang   

  1. School of Information,Central University of Finance and Economics,Beijing 100081,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:WANG Mao-guang,born in 1974,Ph.D,professor.His main research interests include intelligent risk control models and algorithms,big data and intelligent software engineering etc.
    YANG Hang,born in 1997,postgraduate.Her main research interests include intelligent risk control models and algorithms.

摘要: 近年来互联网金融网贷领域涌现出了众多的风控问题,对此采用多种特征选择方法预处理风控领域的数据指标,构建了全面的针对企业信用的风控指标体系,采用stacking 集成策略研究了基于AP-Entropy的信用风险模型。信用风险模型有两层学习器,引入选择集成思想,从种类和数量上筛选基学习器。首先,在Logistic回归、反向传播神经网络、AdaBoost等经典机器学习算法中,采用AP 聚类算法选出适合企业信用风险的异质学习器作为基学习器;其次,在每次学习器迭代中,利用熵对学习器择优,自动选出F1值最高的基学习器,其中改进基于熵的学习器选择算法,提升了基学习器选择过程的效率,降低了模型的计算成本,模型选取XGBoost作为次级基学习器。实验结果表明,文中提出的模型和其他模型相比具有更好的学习效果和更强的泛化能力。

关键词: AP 聚类算法, AP-Entropy信用风险模型, stacking集成策略, XGBoost, 风控指标体系, 基于熵的学习器选择算法, 选择集成

Abstract: In recent years,many risk control problems have emerged in the field of Internet finance.For this,we adopt a variety of feature selection methods to preprocess data indicators in the field of risk control,and construct a comprehensive risk control indicator system for corporate credit.And we use stacking ensemble strategy to study credit risk model based on AP-entropy.There are two layers of learners in credit risk model.The idea of selection ensemble is introduced to select the base learners from the category and quantity.First,in machine learning algorithms such as Logistic regression,back propagation neural network,AdaBoost,AP clustering algorithm is used to select a heterogeneous learner suitable for corporate credit risk as the base learner.Se-condly,in each iteration of the learner,entropy is used to select the best learner,and the base learner with the highest F1 value is automatically selected.Among them,the improved algorithm based on entropy improves the efficiency of base learner selection process and reduces the computational cost of the model.Xgboost is selected as the secondary base learner.The empirical results show that the proposed model has good performance and generalization ability.

Key words: Affinity propagation clustering algorithm, AP-Entropy credit risk model, Learner selection algorithm based on Entropy, Risk control feature system, Selective ensemble, Stacking ensemble strategies, XGBoost

中图分类号: 

  • TP311
[1]YAN R J,YIN S Q.Micro-blog credit evaluation model based on selective neutral network ensembl[J].Computer Engineering and Design,2018,377(5):286-291.
[2]YANG J,YUAN Y L,YU H L.Selective Ensemble LearningAlgorithm of Extreme Learning Machine Based on Ant Colony Optimization[J].Computer Science,2016(43):266-271.
[3]LIU J P,HE J Z,MA T Y.Selective Ensemble of KELM-Based Complex Network Intrusion Detection[J].Acta Electronica,2019,47(5):1070-1078.
[4]HU X J,KANG N.SVM selective ensemble learning methodbased on feature selection[J].Electronic Technology & Software Engineering,2019(18):143-144.
[5]FANG K N,FAN X Y,MA S G.Forecasting of Enterprise's Credit Risk Based on Network-logistic Model[J].Statiscal Research,2016,33(4):50-55.
[6]ZHANG Q,HU L Y,WANG Y.Study on credit risk earlywarning based on Logit and SVM[J].System Engineering-Thery & Practice,2015(7):1784-1790.
[7]LI X,DAI Y C.Research on Early Warning Model of Banking Credit Risk Based on Logit and SVM[J].Wuhan Finance Monthly,2018(2):33-37.
[8]LIU Y.The Application of Decision tree algorithm in credit risk assessment of P2P new loan[D]Changsha:Hunan University,2016.
[9]YU X H,LOU W G.P2P Online Loan Credit Risk Evaluation,Early Warning and Empirical Research Based on Random Forest[J].Financial Theory & Practice,2016,439(2):53-58.
[10]PIERRE G,ERNST D,WEHENKEL L.Extremely randomized trees[J].Machine Learning,2006,63(1):3-42.
[11]ALEXEY N,ALOIS K.Gradient boosting machines,a tutorial[J].Frontiers in Neurorobotics,2013,7:21.
[12]FEI H Y,HUANG H.Research on Internet Credit Risk Prediction Based on Model Fusion[J].Statistics and Applications,2019,8(5):12.
[13]ZHOU Q Y.Application Research of Improved AdaBoost Algorithm in Credit Imbalance Classification[D].Hangzhou:Zhejiang Gongshang University,2020.
[14]YU L,YANG Z,TANG L.A novel multistage deep belief network based extreme learning machine.ensemble learning paradigm for credit risk assessment[J].Flexible Services & Manufacturing Journal,2016,28(4):576-592.
[15]CHEN Y,SHI S,PAN Y,et al.Hybrid ensemble approach for credit risk assessment based on SVM[J].Computer Engineering and Applications,2016(4):115-120.
[16]NASCIMENTO D,COELHO A,CANUTO A.Integrating complementary techniques for promoting diversity in classifier ensembles:A systematic study[J].Neurocomputing,2014,138:347-357.
[17]ALA'RAJ M,ABBOD M F.Classifiers consensus system approach for credit scoring[J].Knowledge-Based Systems,2016,104:89-105.
[18]XIA Y F.A novel heterogeneous ensemble credit scoring model based on bstacking.approach[J].Expert Systems with Applications,2018,93.
[19]ZHOU Z H,WU J X,TANG W.Ensembling neural networks:Many could be better than al1[J].Artificial Intelligence,2002,137(1/2):239-263.
[20]ZHANG C X,ZHANG J S.A Survey of Selective EnsembleLearning Algorithms[J].Chinese Journal of Computers,2011,34(8):1399-1410.
[21]CHEN K.Study of selective ensemble alogrithm based on classification problems[J].Application Research of Computers,2009(7):2457-2459.
[22]ZHENG L R.Heuristic selective ensemble learning algorithm based on clustering and dynamic update[D].Xiamen:Xiamen University,2017.
[23]CHEN Q.Research on selective ensemble learning algorithm.Computer Technology and Development[J].Comput Technol,2010,20(2):87-89.
[24]FREY B J,DUECK D.Clustering by passing messages between data points[J].Science,2007,315(5814):972-976.
[25]KUNCHEVA L I,WHITAKER C J.Ten measures of diversity in classifier ensembles:Limits for two classifiers[C]//Intelligent Sensor Processing.IET,2001.
[26]YU J Y.Research on Enterprise Credit Risk Evaluation Based on Heterogeneous Learning Device Integration Strategy[D].Beijing:Central University of Finance and Economics,2019.
[27]LIU J C,JIANG X H,WU J P.Realization of a Knowledge Inference Rule Induction System[J].Systems Engineering,2003,21(3):108-110.
[28]LI Z S,LIU Z G.Feature selection algorithm based on XGBoost[J].Journal on Communications,2019(10).
[29]prosper-loan[EB/OL].https://www.kaggle.com/yousuf28/prosper-loan.
[30]lendingclub[EB/OL].https://www.lendingclub.com/info/download-data.action.
[1] 孙福权, 梁莹.
基于XGBoost算法的水稻基因组6mA位点识别研究
Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm
计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262
[2] 李京泰, 王晓丹.
基于代价敏感激活函数XGBoost的不平衡数据分类方法
XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function
计算机科学, 2022, 49(5): 135-143. https://doi.org/10.11896/jsjkx.210400064
[3] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[4] 龚追飞, 魏传佳.
基于拓扑相似和XGBoost的复杂网络链路预测方法
Complex Network Link Prediction Method Based on Topology Similarity and XGBoost
计算机科学, 2021, 48(12): 226-230. https://doi.org/10.11896/jsjkx.200800026
[5] 王晓迪, 刘鑫, 于晓.
用于多元时间序列预测的自适应频域模型
Adaptive Frequency Domain Model for Multivariate Time Series Forecasting
计算机科学, 2021, 48(11A): 204-210. https://doi.org/10.11896/jsjkx.210500129
[6] 宋玲玲, 王时绘, 杨超, 盛潇.
改进的XGBoost在不平衡数据处理中的应用研究
Application Research of Improved XGBoost in Imbalanced Data Processing
计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138
[7] 王晓晖, 张亮, 李俊清, 孙玉翠, 田捷, 韩睿毅.
基于遗传算法与随机森林的XGBoost改进方法研究
Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest
计算机科学, 2020, 47(11A): 454-458. https://doi.org/10.11896/jsjkx.200600002
[8] 赵瑞杰, 施勇, 张涵, 龙军, 薛质.
基于TF-IDF的Webshell文件检测
Webshell File Detection Method Based on TF-IDF
计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064
[9] 崔艳鹏,史科杏,胡建伟.
基于XGBoost算法的Webshell检测方法研究
Research of Webshell Detection Method Based on XGBoost Algorithm
计算机科学, 2018, 45(6A): 375-379.
[10] 雷雪梅,谢依彤.
用于高血压菜谱识别的基于遗传算法的改进XGBoost模型
Improved XGBoostModel Based on Genetic Algorithm for Hypertension Recipe Recognition
计算机科学, 2018, 45(6A): 476-481.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!