分位数分箱python python 分箱函数

转载

mob64ca13fc220d 2023-10-29 21:38:16

文章标签 分位数分箱python python 机器学习数据挖掘数据 文章分类 Python 后端开发

《Python金融大数据风控建模实战》第6章变量分箱方法

本章引言
Python代码实现及注释

本章引言

变量分箱是一种特征工程方法，意在增强变量的可解释性与预测能力。变量分箱方法主要用于连续变量，对于变量取值较稀疏的离散变量也应该进行分箱处理。
变量分箱对模型的好处：

降低异常值的影响，增强模型的稳定性
数据中存在异常值会使模型产生一定的偏差，从而影响预测效果。通过分箱模型可以降低异常值的噪声特性，使模型更稳健。树模型对异常值不敏感，但Logistic回归模型和神经网络对异常值敏感。
缺失值作为特殊变量参与分箱，减少缺失值填补的不确定性。
缺失值造成的原因不可追溯，插补方法也不尽相同，但如果能将缺失值作为一种特征，则会免去主观填充带来的不确定性问题，以增加模型的稳定性。而分箱方法可以将缺失值作为特殊值参与分箱处理。通常的做法是，离散特征将缺失值转为字符串作为特殊字符即可，而连续特征将缺失值作为特殊值即可，这样缺失值将作为一个特征参与分箱。
增加变量的可解释性
分箱的方法往往要配合变量编码使用，这就大大提高了变量的可解释性。通常采用的编码方式为WOE编码。本章将介绍的分箱方法有Chi-megerd方法、Best-KS方法、IV最优分箱方法和基于树的最优分箱方法。
增加变量的非线性
由于分箱后会采用编码操作，常用的编码方式有WOE编码、哑变量编码和One-hot编码。对于WOE编码，编码的计算结果与目标变量相关，会产生非线性的结果，而采用哑变量编码或One-hot编码，会使编码后的变量比原始变量获得更多的权重，并且这些权重是不同的，因此增加了模型的非线性。
增加模型的预测效果
从统计学角度考虑，机器学习模型在训练时会将数据划分为训练集和测试集，通常假设训练集和测试集是服从同分布的，分箱操作使连续变量离散化，使得训练集和测试集更容易满足这种假设。因此，分箱会增加模型预测效果的稳定性，即会减少模型在训练集上的表现与测试集上的偏差。

使用分箱的局限如下：

同一箱内的样本具有同质性
分箱的基本假设是分在一个箱内样本具有相同的风险等级。对于树模型就减少了模型选择最优切分点的可选择范围，会对模型的预测能力产生影响，损失了模型的分辨能力。
需要专家经验支持
一个变量怎样分箱对结果的影响是不同的，需要专家经验进行分箱指导，这往往非常耗时。本章介绍的均是自动分箱方法，它的好处是可以减少人工干预，但对专家的经验知识却没有过多体现。如果有成体系的变量分箱经验，可以在自动分箱时设置切分点，使其在候选集中即可在结合经验的基础上完成自动分箱。

变量分箱需要注意的问题：

分箱结果不宜过多
因为分箱后需要用编码的方式进行数值转化，转化的方式为WOE编码或One-hot编码。当采用WOE编码时，如果分箱过多会造成好样本或坏样本在每个箱内的分布不均，造成某个箱内几乎没有分布，使得样本较少的箱内其代表性不足。当采用One-hot编码时，由于分箱过多导致变量过于稀疏，编码后的变量维度快速增加，使变量更加稀疏，会降低模型的预测效果，后续章节会讨论稀疏特征下的变量组合，以增加模型的预测效果。
分箱结果不易过少
由于每个箱内的变量默认是同质的，即风险等级相同，如果分箱过少，则会造成模型的辨识度过低。
分箱后单调性的要求
分箱单调性是指分箱后的WOE值随着分箱索引的增加而呈现增加或减少的趋势。分箱单调是为了让Logistic回归模型可以得到更好的预测结果（线性特征更容易学习），但是往往有的变量自身就是U形结构，不太好通过分箱的方式让数据达到单调的效果（当然也可以通过分箱合并的方式达到近似单调的效果），这时候只是Logistic回归模型可能效果不佳，但是更复杂的算法是可以学习到这种规则的。

变量分箱主要是对连续变量进行离散化，然后通过编码转化为数值特征。此外，如果离散变量过于稀疏，可以先用坏样本比率转为数值，将其作为连续变量执行分箱操作。

Python代码实现及注释

# 第6章 变量分箱方法

'''
程序运行逻辑：数据读取->划分训练集与测试集->在训练集上得到分箱规则（连续变量与离散变量分开计算）->对训练集原始数据进行分箱映射
->测试集数据分箱映射
用到的函数：
    data_read:数据读取函数
    cont_var_bin：连续变量分箱
    cont_var_bin_map：连续变量分箱映射函数，将cont_var_bin函数分箱规则应用到原始连续数据上
    disc_var_bin：离散变量分箱
    disc_var_bin_map：离散变量分箱映射函数，将disc_var_bin函数分箱规则应用到原始离散数据上
1:Chi-merge(卡方分箱), 2:IV(最优IV值分箱), 3:信息熵(基于树的分箱)
'''

'''
os是Python环境下对文件，文件夹执行操作的一个模块
这里是采用的是scikit-learn的model_selection模块中的train_test_split()函数实现数据切分
'''
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore") ##忽略警告

def data_read(data_path,file_name):
    '''
    csv文件是一种用,和换行符区分数据记录和字段的一种文件结构，可以用excel表格编辑，也可以用记事本编辑，是一种类excel的数据存
    储文件，也可以看成是一种数据库。pandas提供了pd.read_csv()方法可以读取其中的数据并且转换成DataFrame数据帧。python的强大
    之处就在于他可以把不同的数据库类型，比如txt/csv/.xls/.sql转换成统一的DataFrame格式然后进行统一的处理。真是做到了标准化。
    pd.read_csv()函数参数：
        os.path.join()函数：连接两个或更多的路径名组件
        sep：如果不指定参数，则会尝试使用逗号分隔。
        delimiter ：定界符，备选分隔符（如果指定该参数，则sep参数失效）
        delim_whitespace ： 指定空格是否作为分隔符使用，等效于设定sep=’\s+’。如果这个参数设定为True那么delimiter 参数失效。
        header ：指定行数用来作为列名，数据开始行数。如果文件中没有列名，则默认为0【第一行数据】，否则设置为None。
    '''
    df = pd.read_csv( os.path.join(data_path, file_name), delim_whitespace = True, header = None )

    # 变量重命名
    columns = ['status_account','duration','credit_history','purpose', 'amount',
               'svaing_account', 'present_emp', 'income_rate', 'personal_status',
               'other_debtors', 'residence_info', 'property', 'age',
               'inst_plans', 'housing', 'num_credits',
               'job', 'dependents', 'telephone', 'foreign_worker', 'target']

    '''
    修改列名的两种方式为：
        直接使用df.columns的方式重新命名，不过这种方式需要列出所有列名。
        使用rename方法，注意如果需要原地修改需要带上inplace=True的参数，否则原dataframe列名不会发生改变。
    '''
    df.columns = columns

    # 将标签变量由状态1,2转为0,1;0表示好用户，1表示坏用户
    df.target = df.target - 1

    '''
    数据分为data_train和 data_test两部分，训练集用于得到编码函数，验证集用已知的编码规则对验证集编码。
    这里是采用的是scikit-learn的model_selection模块中的train_test_split()函数实现数据切分，函数原型为：
        sklearn.model_selection.train_test_split(*arrays, **options)
    主要参数说明：
        arrays：为需要切分的原始数据，可以是列表、Numpy arrays、稀疏矩阵、pandas的数据框。
        test_size：划分的测试数据的占比，为0-1的数，默认为0.25，即训练数据为原始数据的75%，测试数据为原始数据的25%。
        train_size：与test_size设置一个参数即可，并满足加和为1的关系。
        random_state:随机数设置，可以保证每次切分得到的数据是相同的，这样在比较不用算法的性能时更加严谨，保证了数据集的一致性。
                     如果不设置，每次将随机选择随机数，产生不同的切分结果。  
        shuffle:是否在切分前打乱数据原有的顺序，默认为进行随机洗牌。
        stratify：设置是否采用分层抽样，默认为none，不分层。分层抽样可以保证正负样本的比例与原始的数据集一致。如果设置为none，
                  则切分时采用随机采样方式。如果需要进行分层采样，则需要指定按哪个变量分层，一般按照标签进行采样。  
                  如在本程序中，使用target标签进行采样。
    '''
    data_train, data_test = train_test_split(df, test_size=0.2, random_state=0,stratify=df.target)

    return data_train, data_test

def cal_advantage(temp, piont, method,flag='sel'):
    '''
    计算当前切分点下的指标值
    参数:
        temp: 上一步的分箱结果，pandas dataframe
        piont: 切分点，以此来划分分箱
        method: 分箱方法选择，1:chi-merge , 2:IV值, 3:信息熵
    '''

#    temp = binDS
    if flag == 'sel':
        # 用于最优切分点选择，这里只是二叉树，即二分
        bin_num = 2

        '''
        numpy.empty(shape, dtype=float, order=‘C’)
        根据给定的维度和数值类型返回一个新的数组，其元素不进行初始化。
        参数：shape：整数或者整数组成的元组	
                     功能：空数组的维度，例如：(2, 3)或者2
              dtype：数值类型，可选参数	
                     功能：指定输出数组的数值类型，例如numpy.int8。默认为numpy.float64。
              order：{‘C’, ‘F’}，可选参数
              	     功能：是否在内存中以C或fortran(行或列)顺序存储多维数据
        下面这行代码返回行为bin_num，列为3的矩阵
        '''
        good_bad_matrix = np.empty((bin_num, 3))

        for ii in range(bin_num):
            if ii==0:

                '''
                temp: 上一步的分箱结果，pandas dataframe
                ii=0时，df_temp_1是temp中'bin_raw'<= point的结果
                ii=1时，df_temp_1是temp中'bin_raw'>point的结果
                '''
                df_temp_1 = temp[temp['bin_raw'] <= piont]
            else:
                df_temp_1 = temp[temp['bin_raw'] > piont]

            '''
            计算每个箱内的好坏样本书
            good_bad_matrix[0][0] = df_temp_1['good'].sum()
            good_bad_matrix[0][1] = df_temp_1['bad'].sum()
            good_bad_matrix[0][2] = df_temp_1['total'].sum()
            good_bad_matrix[1][0] = df_temp_1['good'].sum()
            good_bad_matrix[1][1] = df_temp_1['bad'].sum()
            good_bad_matrix[1][2] = df_temp_1['total'].sum()
            '''
            good_bad_matrix[ii][0] = df_temp_1['good'].sum()
            good_bad_matrix[ii][1] = df_temp_1['bad'].sum()
            good_bad_matrix[ii][2] = df_temp_1['total'].sum()

    elif flag == 'gain':
       '''
       用于计算本次分箱后的指标结果，即分箱数，每增加一个，就要算一下当前分箱下的指标结果
       bin_num的取值为temp['bin'].max()
       '''
       bin_num = temp['bin'].max()
       good_bad_matrix = np.empty((bin_num, 3))
       for ii in range(bin_num):

           '''
           df_temp_1 = temp[temp['bin'] == 1]
           df_temp_1 = temp[temp['bin'] == 2]
           ......
           df_temp_1 = temp[temp['bin'] == (ii +1)]
           '''
           df_temp_1 = temp[temp['bin'] == (ii + 1)]
           good_bad_matrix[ii][0] = df_temp_1['good'].sum()
           good_bad_matrix[ii][1] = df_temp_1['bad'].sum()
           good_bad_matrix[ii][2] = df_temp_1['total'].sum()
       
    # 计算总样本中的好坏样本
    total_matrix = np.empty(3)
    total_matrix[0] = temp.good.sum()
    total_matrix[1] = temp.bad.sum()
    total_matrix[2] = temp.total.sum()
    
    # method ==1,表示Chi-merger分箱
    if method == 1:
        X2 = 0

        # i=0,1
        for i in range(bin_num):
            # j=0,1
            for j in range(2):
                '''
                expect = (total_matrix[0]/ total_matrix[2])*good_bad_matrix[0][2]
                expect = (total_matrix[1]/ total_matrix[2])*good_bad_matrix[0][2]
                expect = (total_matrix[0]/ total_matrix[2])*good_bad_matrix[1][2]
                expect = (total_matrix[1]/ toral_matrix[2])*good_bad_matrix[1][2]
                '''
                expect = (total_matrix[j] / total_matrix[2])*good_bad_matrix[i][2]
                X2 = X2 + (good_bad_matrix[i][j] - expect )**2/expect
        M_value = X2

    # IV分箱
    elif method == 2:
        '''
        total_matrix[0]表示总的好样本的个数，total_matrix[1]表示总的坏样本的个数
        '''
        if pd.isnull(total_matrix[0]) or  pd.isnull(total_matrix[1]) or total_matrix[0] == 0 or total_matrix[1] == 0:
            M_value = np.NaN
        else:
            IV = 0
            for i in range(bin_num):
                ##坏好比
                weight = good_bad_matrix[i][1] / total_matrix[1] - good_bad_matrix[i][0] / total_matrix[0]
                IV = IV + weight * np.log( (good_bad_matrix[i][1] * total_matrix[0]) / (good_bad_matrix[i][0] * total_matrix[1]))
            M_value = IV

    # 信息熵分箱
    elif method == 3:
        # 总的信息熵
        entropy_total = 0
        for j in range(2):

            '''
            total_matrix[0]表示总的好样本的个数，total_matrix[1]表示总的坏样本的个数,total_matrix[2]表示总的样本个数
            '''
            weight = (total_matrix[j]/ total_matrix[2])
            entropy_total = entropy_total - weight * (np.log(weight))
                    
        # 计算条件熵
        entropy_cond = 0
        for i in range(bin_num):
            entropy_temp = 0
            for j in range(2):
                entropy_temp = entropy_temp - ((good_bad_matrix[i][j] / good_bad_matrix[i][2])
                                         * np.log(good_bad_matrix[i][j] / good_bad_matrix[i][2]) )
            entropy_cond = entropy_cond + good_bad_matrix[i][2]/total_matrix[2] * entropy_temp 
        
        # 计算归一化信息增益
        M_value = 1 - (entropy_cond / entropy_total)

    # Best-Ks分箱
    else:
        pass
    return M_value

def best_split(df_temp0, method, bin_num):
    '''
    在每个候选集中寻找切分点，完成一次分裂。
    select_split_point函数的中间过程函数
    参数:
        df_temp0: 上一次分箱后的结果，pandas dataframe
        method: 分箱方法选择，1:chi-merge , 2:IV值, 3:信息熵
        bin_num: 分箱编号，在不同编号的分箱结果中继续二分
    返回值:
        返回在本次分箱标号内的最优切分结果， pandas dataframe
    '''

#    df_temp0 = df_temp
#    bin_num = 1

    '''
    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')  
    参数：
        by:str or list of str；
        axis:{0 or ‘index’, 1 or ‘columns’}, default 0，默认按照索引排序，即纵向排序，如果为1，则是横向排序    
        ascending:布尔型，True则升序，可以是[True,False]，即第一字段升序，第二个降序  
        inplace:布尔型，是否用排序后的数据框替换现有的数据框  
        kind:排序方法，{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’。似乎不用太关心  
        na_position : {‘first’, ‘last’}, default ‘last’，默认缺失值排在最后面  
    '''
    df_temp0 = df_temp0.sort_values(by=['bin', 'bad_rate'])
    piont_len = len(df_temp0[df_temp0['bin'] == bin_num])  # 候选集的长度
    bestValue = 0
    bestI = 1

    # 以候选集的每个切分点做分隔，计算指标值
    for i in range(1, piont_len):

        # 计算指标值
        value = cal_advantage(df_temp0,i,method,flag='sel')
        if bestValue < value:
            bestValue = value
            bestI = i

    # create new var split
    '''
    1.np.where(condition, x, y)
        当where内有三个参数时，第一个参数表示条件，当条件成立时where方法返回x，当条件不成立时where返回y
    2.np.where(condition)
        当where内只有一个参数时，那个参数表示条件，当条件成立时，where返回的是每个符合condition条件元素的坐标, 返回的是以元组的形式
    3.多条件时condition， & 表示与， | 表示或。如a = np.where((0 < a) & (a < 5), x, y)，当0 < a与a < 5满足时，返回x的值，当0 < a与a < 5
    不满足时，返回y的值。注意x, y必须和a保持相同尺寸。
    '''
    df_temp0['split'] = np.where(df_temp0['bin_raw'] <= bestI, 1, 0)

    '''
    DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
    参数说明：
        labels:就是要删除的行列的名字，用列表给定
        axis:默认为0，指删除行，因此删除columns时要指定axis=1；
        index:直接指定要删除的行
        columns:直接指定要删除的列
        inplace=False:默认该删除操作不改变原数据，而是返回一个执行删除操作后的新dataframe；
        inplace=True:则会直接在原数据上进行删除操作，删除后无法返回。
    '''
    df_temp0 = df_temp0.drop('bin_raw', axis=1)

    newbinDS = df_temp0.sort_values(by=['split', 'bad_rate'])

    # rebuild var i
    '''
    df_temp0['split'] = np.where(df_temp0['bin_raw'] <= bestI, 1, 0)
    newbinDS_0 为>bestI
    newbinDS_0 为<=bestI
    '''
    newbinDS_0 = newbinDS[newbinDS['split'] == 0]
    newbinDS_1 = newbinDS[newbinDS['split'] == 1]

    '''
    copy()与deepcopy()之间的区分必须要涉及到python对于数据的存储方式
    我们寻常意义的复制就是深复制，即将被复制对象完全再复制一遍作为独立的新个体单独存在。所以改变原有被复制对象不会对已经复制出
    来的新对象产生影响。 
    而浅复制并不会产生一个独立的对象单独存在，他只是将原有的数据块打上一个新标签，所以当其中一个标签被改变的时候，数据块就会发
    生变化，另一个标签也会随之改变。这就和我们寻常意义上的复制有所不同了。
    '''
    newbinDS_0 = newbinDS_0.copy()
    newbinDS_1 = newbinDS_1.copy()

    newbinDS_0['bin_raw'] = range(1, len(newbinDS_0) + 1)
    newbinDS_1['bin_raw'] = range(1, len(newbinDS_1) + 1)

    '''
    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None,
              verify_integrity=False)
    参数：
        objs:用来保存需要用来进行连接的Series/DataFrame，可以是列表或者dict类型 
        axis：表示希望进行连接的轴向，默认为0，也就是纵向拼接 
        join：有多个选择，inner,outer,这里默认值是outer,下面会根据实例来比较下 
        join_axes：默认为空，可以设置值指定为其他轴上使用的索引 
        ignore_index：连接后原来两个DF的index值会被保存，如果该索引没有实际的意义可以设置为True来进行重分配index号
    '''
    newbinDS = pd.concat([newbinDS_0, newbinDS_1], axis=0)

    return newbinDS  

def select_split_point(temp_bin, method):
    '''
    二叉树分割方式，从候选者中挑选每次的最优切分点，与切分后的指标计算,cont_var_bin函数的中间过程函数
    参数:temp_bin:分箱后的结果 pandas dataframe
         method:分箱方法选择，1:chi-merge , 2:IV值, 3:信息熵
    返回值:新的分箱结果  pandas dataframe
    '''

#    temp_bin = df_temp_all
    '''
    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')  
    参数：
        by:str or list of str；
        axis:{0 or ‘index’, 1 or ‘columns’}, default 0，默认按照索引排序，即纵向排序，如果为1，则是横向排序    
        ascending:布尔型，True则升序，可以是[True,False]，即第一字段升序，第二个降序  
        inplace:布尔型，是否用排序后的数据框替换现有的数据框  
        kind:排序方法，{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’。似乎不用太关心  
        na_position : {‘first’, ‘last’}, default ‘last’，默认缺失值排在最后面  
    '''
    temp_bin = temp_bin.sort_values(by=['bin', 'bad_rate'])

    # 得到最大的分箱值
    max_num = max(temp_bin['bin'])

#    temp_binC = dict()
#    m = dict()
#    # 不同箱内的数据取出来
#    for i in range(1, max_num + 1):
#        temp_binC[i] = temp_bin[temp_bin['bin'] == i]
#        m[i] = len(temp_binC[i])

    '''
    dict() 函数用于创建一个字典。返回一个字典。
    '''
    temp_main = dict()

    bin_i_value = []
    for i in range(1, max_num + 1):

        df_temp = temp_bin[temp_bin['bin'] == i]
        if df_temp.shape[0]>1 :

            # bin=i的做分裂
            temp_split= best_split(df_temp, method, i)

            # 完成一次分箱，更新bin的枝
            temp_split['bin'] = np.where(temp_split['split'] == 1,
                                               max_num + 1,
                                               temp_split['bin'])

            # 取出bin!=i合并为新组
            temp_main[i] = temp_bin[temp_bin['bin'] != i]
            temp_main[i] = pd.concat([temp_main[i], temp_split ], axis=0, sort=False)

            # 计算新分组的指标值
            value = cal_advantage(temp_main[i],0, method,flag='gain')
            newdata = [i, value]
            bin_i_value.append(newdata)

    # find maxinum of value bintoSplit
    bin_i_value.sort(key=lambda x: x[1], reverse=True)

    # binNum = temp_all_Vals['BinToSplit']
    binNum = bin_i_value[0][0]
    newBins = temp_main[binNum].drop('split', axis=1)
    return newBins.sort_values(by=['bin', 'bad_rate']), round( bin_i_value[0][1] ,4)


def init_equal_bin(x,bin_rate):

    '''
    初始化等距分组，cont_var_bin函数的中间过程函数
    参数：
        x:要分组的变量值，pandas series
        bin_rate：比例值1/bin_rate
    返回值：
        返回初始化分箱结果，pandas dataframe
    '''

    # 异常值剔除，只考虑95%的最大值与最小值，边界与-inf或inf分为一组
    '''
    分位数是统计中使用的度量，表示小于这个值的观察值占总数q的百分比。 函数numpy.percentile()接受以下参数。
    numpy.percentile(a, q, axis)
    参数：
        a：输入数组
        q：要计算的百分位数，在 0 ~ 100 之间
        axis：沿着它计算百分位数的轴   ，二维取值0，1
    如果x>np.percentile(x,95)的x个数，并且x的个数>=30,则var_up取其中x>np,percentile(x,95)最小值,否则直接取最大值
    如过x<np.percentile(x,5)的x个数>0，则var_low取其中x<np.percentile(x,5)最大值，苟泽直接取最小值
    '''
    if len(x[x > np.percentile(x, 95)]) > 0 and len(np.unique(x)) >=30:
        var_up= min( x[x > np.percentile(x, 95)] )
    else:
        var_up = max(x)

    if len(x[x < np.percentile(x, 5)]) > 0:
        var_low= max( x[x < np.percentile(x, 5)] )
    else:
        var_low = min(x)

    # 初始化分组
    bin_num = int(1/ bin_rate)
    dist_bin = (var_up - var_low) / bin_num  # 分箱间隔
    bin_up = []
    bin_low = []

    '''
    第一组和最后一组分开处理
    '''
    for i in range(1, bin_num + 1):
        if i == 1:
            bin_up.append( var_low + i * dist_bin)

            '''
            np.Inf:正无穷大的浮点表示,常用于数值比较当中的初始值
            '''
            bin_low.append(-np.inf)
        elif i == bin_num:
            bin_up.append( np.inf)
            bin_low.append( var_low + (i - 1) * dist_bin )
        else:
            bin_up.append( var_low + i * dist_bin )
            bin_low.append( var_low + (i - 1) * dist_bin )
    result = pd.DataFrame({'bin_up':bin_up,'bin_low':bin_low})
    result.index.name = 'bin_num'
    return result

def limit_min_sample(temp_cont,  bin_min_num_0):
    '''
    分箱约束条件：每个箱内的样本数不能小于bin_min_num_0，cont_var_bin函数的中间过程函数
    参数:
        temp_cont: 初始化分箱后的结果 pandas dataframe
        bin_min_num_0:每组内的最小样本限制
    返回值：
        合并后的分箱结果，pandas dataframe
    '''
    for i in temp_cont.index:

        '''
        行数据=temp_cont.loc[i, :]
        '''
        rowdata = temp_cont.loc[i, :]
        if i == temp_cont.index.max():

            # 如果是最后一个箱就，取倒数第二个值
            ix = temp_cont[temp_cont.index < i].index.max()
        else:

            # 否则就取大于i的最小的分箱值
            ix = temp_cont[temp_cont.index > i].index.min()

        # 如果0, 1, total项中样本的数量小于20则进行合并
        if rowdata['total'] <= bin_min_num_0:

            # 与相邻的bin合并
            temp_cont.loc[ix, 'bad'] = temp_cont.loc[ix, 'bad'] + rowdata['bad']
            temp_cont.loc[ix, 'good'] = temp_cont.loc[ix, 'good'] + rowdata['good']
            temp_cont.loc[ix, 'total'] = temp_cont.loc[ix, 'total'] + rowdata['total']
            if i < temp_cont.index.max():
                temp_cont.loc[ix, 'bin_low'] = rowdata['bin_low']
            else:
                temp_cont.loc[ix, 'bin_up'] = rowdata['bin_up']
            temp_cont = temp_cont.drop(i, axis=0)  
    return temp_cont.sort_values(by='bad_rate')

def cont_var_bin_map(x, bin_init):
    '''
    按照初始化分箱结果，对原始值进行分箱映射,用于训练集与测试集的分箱映射
    '''

    temp = x.copy()
    for i in bin_init.index:
        bin_up = bin_init['bin_up'][i]
        bin_low = bin_init['bin_low'][i]

        # 寻找出 >lower and <= upper的位置
        if pd.isnull(bin_up) or pd.isnull(bin_up):
            temp[pd.isnull(temp)] = i
        else:
            index = (x > bin_low) & (x <= bin_up)
            temp[index] = i
    temp.name = temp.name + "_BIN"
    return temp

def merge_bin(sub, i):

    '''
    将相同箱内的样本书合并，区间合并
    参数:
        sub:分箱结果子集，pandas dataframe ，如bin=1的结果
        i: 分箱标号
    返回值:
        返回合并结果
    '''

    l = len(sub)
    total = sub['total'].sum()

    '''
    loc——通过行标签索引行数据 
    iloc——通过行号索引行数据 
    ix——通过行标签或者行号索引行数据（基于loc和iloc 的混合） 
    '''
    first = sub.iloc[0, :]
    last = sub.iloc[l - 1, :]

    lower = first['bin_low']
    upper = last['bin_up']
    df = pd.DataFrame()
    df = df.append([i, lower, upper, total], ignore_index=True).T
    df.columns = ['bin', 'bin_low', 'bin_up', 'total']
    return df


def cont_var_bin(x, y, method, mmin=5, mmax=10, bin_rate=0.01, stop_limit=0.1, bin_min_num=20):
    '''
    连续变量分箱函数原型如下：
        cont_var_bin(x,y,method,mmin=5,mmax=10,bin_rate=0.01,stop_limit=0.1,bin_min_num=20)
    参数:
        x:输入分箱数据，pandas series
          （待分箱变量）
        y:标签变量
          （目标向量）
        method:分箱方法选择，1:chi-merge , 2:IV值, 3:基尼系数分箱
              （指定分箱方法，1表示采用最优Chi-merge分箱；2表示采用最用IV分箱；3表示采用信息增益分箱。另外，指标可以自行扩展，
               那么反应变量区分能力的指标可以用于分箱，如基尼指数等，这里没有对Best-KS方法做具体实现，因为Best-KS方法只能处理
               连续变量分箱）
        mmin:最小分箱数，当分箱初始化后如果初始化箱数小于等于mmin，则mmin=2，即最少分2箱，
            如果分两箱也无法满足箱内最小样本数限制而分1箱，则变量删除
            （最小分箱数。分箱初始化合并后要满足最小样本限制，如果不满足则需要进行分箱初始化合并。如果初始化合并后箱数小于等于
            mmin，则mmin=2，即最少分2箱，如果分2箱也无法满足箱内最小样本数限制而分为一箱，则删除该变量）
        mmax:最大分箱数，当分箱初始化后如果初始化箱数小于等于mmax，则mmax等于初始化箱数-1
             （最大分箱数。当分箱初始化合并后，如果从初始化箱数小于等于mmax，则mmax等于初始化合并箱数-1.如果数据中有缺失值，缺失值
             单独作为一箱，最大分箱数为mmax+1，即mmax限制的是非缺失值情况下的最大分箱数）
        bin_rate：等距初始化分箱参数，分箱数为1/bin_rate,分箱间隔在数据中的最小值与最大值将等间隔取值
                （等距初始化分箱参数，分箱数为1/bin_rate，分箱间隔在数据中的最小值与最大值的范围内等间隔取值。注意变量异常值的限制
                ，否则等间隔分箱后将有大部分箱内没有样本，而在某个箱内样本较集中，会降低分箱的辨识能力）
        stop_limit:分箱earlystopping机制，如果已经没有明显增益即停止分箱
                  （分箱前后的最小增益限值，即early stopping策略的限制。当本次分箱前后得到的增益小于限值则分箱终止）
        bin_min_num:每组最小样本数
                  （最小样本数，分箱初始化后每个箱内的最小样本数不能少于该值，否则进行分箱合并）
    返回值
        分箱结果：pandas dataframe
    '''

    # 缺失值单独取出来
    df_na = pd.DataFrame({'x': x[pd.isnull(x)], 'y': y[pd.isnull(x)]})
    y = y[~pd.isnull(x)]
    x = x[~pd.isnull(x)]

    # 初始化分箱，等距的方式，后面加上约束条件,没有箱内样本数没有限制
    bin_init = init_equal_bin(x, bin_rate)

    # 分箱映射
    bin_map = cont_var_bin_map(x, bin_init)
    
    df_temp = pd.concat([x, y, bin_map], axis=1)

    # 计算每个bin中好坏样本的频数
    '''
    pd.crosstab():用于计算分组的频率，算是一种特殊的pivot_table(),是顶级类函数
    pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, 
                    margins_name='All', dropna=True, normalize=False)
    参数：
        index：行分组键
        columns：列分组键
        margins：False 默认值，增加一行/列 ‘总计’  
        margins_name：All 默认值
        normalize：False 默认值
                   'index' or 1，normalize 每行
                   'columns' or 0，normalize 每列
                   'All' or 'True'，normalize 全部
                   如果 margins = True，则 margins 也会被 normalize
                   值 = 每个数据 / 数据总和 浮点数格式
        values：可选项
                根据 index 和 columns 的分组后，计算 values 项的值，计算规则由 aggfunc 决定
                (values 和 aggfunc 成对出现)
        aggfunc:可选项
                np.sum，np.mean，len，... ...
                (values 和 aggfunc 成对出现)
        rownames:None 默认值
                 pd.crosstab()操作数组时，设定 row's name，而不使用默认名称
        colnames:None 默认值
                 pd.crosstab()操作数组时，设定 column's name，而不使用默认名称
        dropna:True 默认值
               如果某列的数据全是 NaN，则被删除
    '''
    df_temp_1 = pd.crosstab(index=df_temp[bin_map.name], columns=y)

    '''
    zip(*iterables)：创建一个聚合了来自每个可迭代对象中的元素的迭代器。
    '''
    df_temp_1.rename(columns= dict(zip([0,1], ['good', 'bad'])) , inplace=True)

    # 计算每个bin中一共有多少样本
    '''
    groupby()是pandas库中DataFrame结构的函数
    '''
    df_temp_2 = pd.DataFrame(df_temp.groupby(bin_map.name).count().iloc[:, 0])
    df_temp_2.columns = ['total']

    '''
    pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
    参数：
        left: 拼接的左侧DataFrame对象
        right: 拼接的右侧DataFrame对象
        on: 要加入的列或索引级别名称。 必须在左侧和右侧DataFrame对象中找到。 如果未传递且left_index和right_index为False，
            则DataFrame中的列的交集将被推断为连接键。
        left_on:左侧DataFrame中的列或索引级别用作键。 可以是列名，索引级名称，也可以是长度等于DataFrame长度的数组。
        right_on: 左侧DataFrame中的列或索引级别用作键。 可以是列名，索引级名称，也可以是长度等于DataFrame长度的数组。
        left_index: 如果为True，则使用左侧DataFrame中的索引（行标签）作为其连接键。 对于具有MultiIndex（分层）的DataFrame，
                    级别数必须与右侧DataFrame中的连接键数相匹配。
        right_index: 与left_index功能相似。
        how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. 默认inner。inner是取交集，outer取并集。比如left：[‘A’,‘B’,‘C’];
             right[’'A,‘C’,‘D’]；inner取交集的话，left中出现的A会和right中出现的买一个A进行匹配拼接，如果没有是B，在right中没有匹配到，则会丢失。'outer’取并集，出现的A会进行一一匹配，没有同时出现的会将缺失的部分添加缺失值。
        sort: 按字典顺序通过连接键对结果DataFrame进行排序。 默认为True，设置为False将在很多情况下显着提高性能。
        suffixes: 用于重叠列的字符串后缀元组。 默认为（‘x’，’ y’）。
        copy: 始终从传递的DataFrame对象复制数据（默认为True），即使不需要重建索引也是如此。
        indicator:将一列添加到名为_merge的输出DataFrame，其中包含有关每行源的信息。 _merge是分类类型，并且对于其合并键仅出现
                  在“左”DataFrame中的观察值，取得值为left_only，对于其合并键仅出现在“右”DataFrame中的观察值为right_only，
                  并且如果在两者中都找到观察点的合并键，则为left_only
    '''
    df_temp_all= pd.merge(pd.concat([df_temp_1, df_temp_2], axis=1), bin_init,
                         left_index=True, right_index=True,
                         how='left')
    
    # 做分箱上下限的整理，让候选点连续
    for j in range(df_temp_all.shape[0]-1):
        if df_temp_all.bin_low.loc[df_temp_all.index[j+1]] !=  df_temp_all.bin_up.loc[df_temp_all.index[j]]:
            df_temp_all.bin_low.loc[df_temp_all.index[j+1]] = df_temp_all.bin_up.loc[df_temp_all.index[j]]

    # 离散变量中这个值为bad_rate,连续变量时为索引，索引值是分箱初始化时，箱内有变量的箱的索引
    df_temp_all['bad_rate'] = df_temp_all.index

    # 最小样本数限制，进行分箱合并
    df_temp_all = limit_min_sample(df_temp_all, bin_min_num)

    # 将合并后的最大箱数与设定的箱数进行比较，这个应该是分箱数的最大值
    if mmax >= df_temp_all.shape[0]:
        mmax = df_temp_all.shape[0]-1
    if mmin >= df_temp_all.shape[0]:
        gain_value_save0=0
        gain_rate_save0=0

        '''
        np.linspace主要用来创建等差数列。
        numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)
        参数：
            start:返回样本数据开始点
            stop:返回样本数据结束点
            num:生成的样本数据量，默认为50
            endpoint：True则包含stop；False则不包含stop
            retstep：If True, return (samples, step), where step is the spacing between samples.
                     (即如果为True则结果会给出数据间隔)
            dtype：输出数组类型
            axis：0(默认)或-1
        '''
        df_temp_all['bin'] = np.linspace(1,df_temp_all.shape[0],df_temp_all.shape[0],dtype=int)

        data = df_temp_all[['bin_low','bin_up','total','bin']]
        data.index = data['bin']
    else:
        df_temp_all['bin'] = 1
        df_temp_all['bin_raw'] = range(1, len(df_temp_all) + 1)
        df_temp_all['var'] = df_temp_all.index  # 初始化箱的编号
        gain_1 = 1e-10
        gain_rate_save0 = []
        gain_value_save0 = []

        # 分箱约束：最大分箱数限制
        for i in range(1,mmax):

    #       i = 1
            df_temp_all, gain_2 = select_split_point(df_temp_all, method=method)
            gain_rate = gain_2 / gain_1 - 1  #  ratio gain
            gain_value_save0.append(np.round(gain_2,4))
            if i == 1:
                gain_rate_save0.append(0.5)
            else:
                gain_rate_save0.append(np.round(gain_rate,4))
            gain_1 = gain_2
            if df_temp_all.bin.max() >= mmin and df_temp_all.bin.max() <= mmax:
                if gain_rate <= stop_limit or pd.isnull(gain_rate):
                    break
                
    
        df_temp_all = df_temp_all.rename(columns={'var': 'oldbin'})
        temp_Map1 = df_temp_all.drop(['good', 'bad', 'bad_rate', 'bin_raw'], axis=1)
        temp_Map1 = temp_Map1.sort_values(by=['bin', 'oldbin'])

        # get new lower, upper, bin, total for sub
        data = pd.DataFrame()
        for i in temp_Map1['bin'].unique():

            # 得到这个箱内的上下界
            sub_Map = temp_Map1[temp_Map1['bin'] == i]
            rowdata = merge_bin(sub_Map, i)
            data = data.append(rowdata, ignore_index=True)
    
        # resort data
        data = data.sort_values(by='bin_low')
        data = data.drop('bin', axis=1)
        mmax = df_temp_all.bin.max()
        data['bin'] = range(1, mmax + 1)
        data.index = data['bin']

    # 将缺失值的箱加过来
    if len(df_na) > 0:
        row_num = data.shape[0] + 1
        data.loc[row_num, 'bin_low'] = np.nan
        data.loc[row_num, 'bin_up'] = np.nan
        data.loc[row_num, 'total'] = df_na.shape[0]
        data.loc[row_num, 'bin'] = data.bin.max() + 1
    return data , gain_value_save0 ,gain_rate_save0


def cal_bin_value(x, y, bin_min_num_0=10):
    '''
    按变量类别进行分箱初始化，不满足最小样本数的箱进行合并
    参数:
        x: 待分箱的离散变量 pandas Series
        y: 标签变量
        target: 正样本标识
        bin_min_num_0：箱内的最小样本数限制
    返回值:
        计算结果
    '''

    # 按类别x计算yz中0,1两种状态的样本数
    df_temp = pd.crosstab(index=x, columns=y, margins=False)
    df_temp.rename(columns= dict(zip([0,1], ['good', 'bad'])) , inplace=True)

    '''
    assign的用途是增加新的一列
    '''
    df_temp = df_temp.assign(total=lambda x:x['good']+ x['bad'],bin=1,var_name=df_temp.index)\
              .assign(bad_rate=lambda x:x['bad']/ x['total'])

    # 按照baterate排序
    df_temp = df_temp.sort_values(by='bad_rate')
    df_temp = df_temp.reset_index(drop=True)

    # 样本数不满足最小值进行合并
    for i in df_temp.index:
        rowdata = df_temp.loc[i, :]
        if i == df_temp.index.max():

            # 如果是最后一个箱就，取倒数第二个值
            ix = df_temp[df_temp.index < i].index.max()
        else:

            # 否则就取大于i的最小的分箱值
            ix = df_temp[df_temp.index > i].index.min()

        # 如果0, 1, total项中样本的数量小于20则进行合并
        if any(rowdata[:3] <= bin_min_num_0):

            # 与相邻的bin合并
            df_temp.loc[ix, 'bad'] = df_temp.loc[ix, 'bad'] + rowdata['bad']
            df_temp.loc[ix, 'good'] = df_temp.loc[ix, 'good'] + rowdata['good']
            df_temp.loc[ix, 'total'] = df_temp.loc[ix, 'total'] + rowdata['total']
            df_temp.loc[ix, 'bad_rate'] = df_temp.loc[ix,'bad'] / df_temp.loc[ix, 'total']

            # 将区间也进行合并
            df_temp.loc[ix, 'var_name'] = str(rowdata['var_name']) +'%'+ str(df_temp.loc[ix, 'var_name'])
         
            df_temp = df_temp.drop(i, axis=0)  # 删除原来的bin

    # 如果离散变量小于等于5，每个变量为一个箱
    df_temp['bin_raw'] = range(1, df_temp.shape[0] + 1)
    df_temp = df_temp.reset_index(drop=True)
    return df_temp


def disc_var_bin(x, y, method=1, mmin=3, mmax=8, stop_limit=0.1, bin_min_num = 20  ):
    '''
    离散变量分箱方法，如果变量过于稀疏最好先编码在按连续变量分箱
    参数：
        x:输入分箱数据，pandas series
        y:标签变量
        method:分箱方法选择，1:chi-merge , 2:IV值, 3:信息熵
        mmin:最小分箱数，当分箱初始化后如果初始化箱数小于等mmin，则mmin=2，即最少分2箱，
             如果分两厢也无法满足箱内最小样本数限制而分1箱，则变量删除
        mmax:最大分箱数，当分箱初始化后如果初始化箱数小于等于mmax，则mmax等于初始化箱数-1
        stop_limit:分箱earlystopping机制，如果已经没有明显增益即停止分箱
        bin_min_num:每组最小样本数
    返回值:分箱结果：pandas dataframe
    '''

#    x = data_train.purpose
#    y = data_train.target
    del_key = []

    # 缺失值单独取出来
    df_na = pd.DataFrame({'x': x[pd.isnull(x)], 'y': y[pd.isnull(x)]})
    y = y[~pd.isnull(x)]
    x = x[~pd.isnull(x)]

    # 数据类型转化
    '''
    np.issubdtype
    可以判断某一个dtype是否是某一超类的子类，也可以用dtype的mro方法查看其所有的父类
    '''
    if np.issubdtype(x.dtype, np.int_):

        '''
        ndim返回的是数组的维度，返回的只有一个数，该数即表示数组的维度。
        shape：表示各位维度大小的元组。返回的是一个元组。
        dtype：一个用于说明数组数据类型的对象。返回的是该数组的数据类型。
        astype：转换数组的数据类型
        '''
        x = x.astype('float').astype('str')

    if np.issubdtype(x.dtype, np.float_):
        x = x.astype('str')
  
    # 按照类别分箱，得到每个箱下的统计值
    temp_cont = cal_bin_value(x, y,bin_min_num)
    
    # 如果去掉缺失值后离散变量的可能取值小于等于5不分箱
    if len(x.unique()) > 5:

        #将合并后的最大箱数与设定的箱数进行比较，这个应该是分箱数的最大值
        if mmax >= temp_cont.shape[0]:
            mmax = temp_cont.shape[0]-1
        if mmin >= temp_cont.shape[0]:
            mmin = 2
            mmax = temp_cont.shape[0]-1
        if mmax ==1:
            print('变量 {0}合并后分箱数为1，该变量删除'.format(x.name))
            del_key.append(x.name)
        
        gain_1 = 1e-10
        gain_value_save0 = []
        gain_rate_save0 = []
        for i in range(1,mmax):
            temp_cont, gain_2 = select_split_point(temp_cont, method=method)
            gain_rate = gain_2 / gain_1 - 1  #  ratio gain
            gain_value_save0.append(np.round(gain_2,4))
            if i == 1:
                gain_rate_save0.append(0.5)
            else:
                gain_rate_save0.append(np.round(gain_rate,4))
            gain_1 = gain_2
            if temp_cont.bin.max() >= mmin and temp_cont.bin.max() <= mmax:
                if gain_rate <= stop_limit:
                    break
    
        temp_cont = temp_cont.rename(columns={'var': x.name})
        temp_cont = temp_cont.drop(['good', 'bad', 'bin_raw', 'bad_rate'], axis=1)
    else:
        temp_cont.bin = temp_cont.bin_raw
        temp_cont = temp_cont[['total', 'bin', 'var_name']]
        gain_value_save0=[]
        gain_rate_save0=[]
        del_key=[]

    # 将缺失值的箱加过来
    if len(df_na) > 0:
        index_1 = temp_cont.shape[0] + 1
        temp_cont.loc[index_1, 'total'] = df_na.shape[0]
        temp_cont.loc[index_1, 'bin'] = temp_cont.bin.max() + 1
        temp_cont.loc[index_1, 'var_name'] = 'NA'
    temp_cont = temp_cont.reset_index(drop=True)  
    if temp_cont.shape[0]==1:
        del_key.append(x.name)
    return temp_cont.sort_values(by='bin') , gain_value_save0 , gain_rate_save0,del_key


def disc_var_bin_map(x, bin_map):
    '''
    用离散变量分箱后的结果，对原始值进行分箱映射
    参数:
        x: 待分箱映射的离散变量，pandas Series
        bin_map:分箱映射字典， pandas dataframe
    返回值:
        返回映射结果
    '''

    # 数据类型转化
    xx = x[~pd.isnull(x)]
    if np.issubdtype(xx.dtype, np.int_):
        x[~pd.isnull(x)] = xx.astype('float').astype('str')
    if np.issubdtype(xx.dtype, np.float_):
        x[~pd.isnull(x)] = xx.astype('str') 
    d = dict()
    for i in bin_map.index:
        for j in  bin_map.loc[i,'var_name'].split('%'):
            if j != 'NA':
                d[j] = bin_map.loc[i,'bin']

    new_x = x.map(d)

    # 有缺失值要做映射
    if sum(pd.isnull(new_x)) > 0:
        index_1 = bin_map.index[bin_map.var_name == 'NA']
        if len(index_1) > 0:

            '''
            tolist()作用：将矩阵（matrix）和数组（array）转化为列表。
            '''
            new_x[pd.isnull(new_x)] = bin_map.loc[index_1,'bin'].tolist()
    new_x.name = x.name + '_BIN'

    return new_x

if __name__ == '__main__':
    
    path = 'D:/code/chapter6/'
    data_path = os.path.join(path,'data')
    file_name = 'german.csv'

    # 读取数据
    data_train, data_test = data_read(data_path,file_name)

    # 连续变量分箱
    data_train.amount[1:30] = np.nan
    data_test1,gain_value_save1 ,gain_rate_save1  = cont_var_bin(data_train.amount, data_train.target, 
                             method=1, mmin=4 ,mmax=10,bin_rate=0.01,stop_limit=0.1 ,bin_min_num=20 )
    
    data_test2,gain_value_save2 ,gain_rate_save2  = cont_var_bin(data_train.amount, data_train.target,
                             method=2, mmin=4 ,mmax=10,bin_rate=0.01,stop_limit=0.1 ,bin_min_num=20 )

    data_test3,gain_value_save3 ,gain_rate_save3 = cont_var_bin(data_train.amount, data_train.target, 
                             method=3, mmin=4 ,mmax=10,bin_rate=0.01,stop_limit=0.1 ,bin_min_num=20 )
    
   
    # 区分离散变量和连续变量批量进行分箱，把每个变量分箱的结果保存在字典中
    dict_cont_bin = {}
    cont_name = ['duration', 'amount', 'income_rate',  'residence_info',  
               'age',  'num_credits','dependents']
    for i in cont_name:
        dict_cont_bin[i],gain_value_save , gain_rate_save = cont_var_bin(data_train[i], data_train.target, method=1, mmin=4, mmax=10,
                                     bin_rate=0.01, stop_limit=0.1, bin_min_num=20)

    # 离散变量分箱
    data_train.purpose[1:30] = np.nan
    data_disc_test1,gain_value_save1 ,gain_rate_save1,del_key  = disc_var_bin(data_train.purpose, data_train.target, 
                             method=1, mmin=4 ,mmax=10,stop_limit=0.1 ,bin_min_num=10 )
    
    data_disc_test2,gain_value_save2 ,gain_rate_save2 ,del_key = disc_var_bin(data_train.purpose, data_train.target,
                             method=2, mmin=4 ,mmax=10,stop_limit=0.1 ,bin_min_num=10 )

    data_disc_test3,gain_value_save3 ,gain_rate_save3,del_key = disc_var_bin(data_train.purpose, data_train.target, 
                             method=3, mmin=4 ,mmax=10,stop_limit=0.1 ,bin_min_num=10 )
    
    dict_disc_bin = {}
    del_key = []
    disc_name = [x for x in data_train.columns if x not in cont_name]
    disc_name.remove('target')
    for i in disc_name:
        dict_disc_bin[i],gain_value_save , gain_rate_save,del_key_1  = disc_var_bin(data_train[i], data_train.target, method=1, mmin=3,
                                     mmax=8, stop_limit=0.1, bin_min_num=5)
        if len(del_key_1)>0 :
            del_key.extend(del_key_1)

    # 删除分箱数只有1个的变量
    if len(del_key) > 0:
        for j in del_key:
            del dict_disc_bin[j]

    # 训练数据分箱
    # 连续变量分箱映射
#    ss = data_train[list( dict_cont_bin.keys())]
    df_cont_bin_train = pd.DataFrame()
    for i in dict_cont_bin.keys():
        df_cont_bin_train = pd.concat([ df_cont_bin_train , cont_var_bin_map(data_train[i], dict_cont_bin[i]) ], axis = 1)

    # 离散变量分箱映射
#    ss = data_train[list( dict_disc_bin.keys())]
    df_disc_bin_train = pd.DataFrame()
    for i in dict_disc_bin.keys():
        df_disc_bin_train = pd.concat([ df_disc_bin_train , disc_var_bin_map(data_train[i], dict_disc_bin[i]) ], axis = 1)

    # 测试数据分箱
    # 连续变量分箱映射
    ss = data_test[list( dict_cont_bin.keys())]
    df_cont_bin_test = pd.DataFrame()
    for i in dict_cont_bin.keys():
        df_cont_bin_test = pd.concat([ df_cont_bin_test , cont_var_bin_map(data_test[i], dict_cont_bin[i]) ], axis = 1)

    # 离散变量分箱映射
#    ss = data_test[list( dict_disc_bin.keys())]
    df_disc_bin_test = pd.DataFrame()
    for i in dict_disc_bin.keys():
        df_disc_bin_test = pd.concat([ df_disc_bin_test , disc_var_bin_map(data_test[i], dict_disc_bin[i]) ], axis = 1)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。