文章目录
- MAD
- 3σ法
- 百分位法
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 构造测试数据
mean = 0.6
sigma = 1
num = 3500
np.random.seed(0)
factor_data = np.random.normal(mean, sigma, num)
factor_data = pd.Series(data=factor_data)
factor_data.index = [str(x).zfill(6) for x in factor_data.index.values]
def winsorize_plot(data, vertical_lines=[]):
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
_df = ax.hist(data, 100,density=True)
color_list = ['red','green','yellow','black','gold','gray']
count = 0
for value in vertical_lines:
ax.bar(value, 0.1, width=0.05, color=color_list[count], alpha=1)
count += 1
plt.show()
data_mean = factor_data.mean()
std = np.std(factor_data)
std_up = data_mean+3*std
std_low = data_mean-3*std
quantile_low = np.percentile(factor_data, 2.5)
quantile_high = np.percentile(factor_data, 97.5)
vertical_lines = [std_up, std_low, quantile_high, quantile_low]
winsorize_plot(factor_data,vertical_lines)
MAD
MAD(mean absolute deviation)又称为绝对值差中位数法,是一种先需计算所有因子与平均值之间的距离总和来检测离群值的方法.
处理的逻辑:
- 第一步,找出所有因子的中位数
- 第二步:得到每个因子与中位数的绝对偏差值
- 第三步:得到绝对偏差值的中位数MAD
- 第四步:确定参数 n,从而确定合理的范围为 ,并针对超出合理范围的因子值做如下的调整超出最大值的用最大值代替,小于最小值的用最小值代替。
#MAD:中位数去极值
def filter_extreme_MAD(series,n=5):
median = series.quantile(0.5)
new_median = ((series - median).abs()).quantile(0.50)
max_range = median + n*new_median
min_range = median - n*new_median
return np.clip(series,min_range,max_range)
mad_winsorize = filter_extreme_MAD(factor_data,3)
print (mad_winsorize.min(),mad_winsorize.max())
winsorize_plot(mad_winsorize,vertical_lines)
-1.417957718452538 2.5492403918321855
3σ法
法又称为标准差法。标准差本身可以体现因子的离散程度,是基于因子的平均值 而定的。在离群值处理过程中,可通过用来衡量因子与平均值的距离。
标准差法处理的逻辑与MAD法类似:
- 第一步:计算出因子的平均值与标准差
- 第二步:确认参数 n(这里选定 n = 3)
- 第三步:确认因子值的合理范围为,并对因子值作如下的调整:超出最大值的用最大值代替,小于最小值的用最小值代替。
# 3 sigma
def filter_extreme_3sigma(data,n=3,times=3):
# times进行times次3sigma处理
series = data.copy()
for i in range(times):
mean = series.mean()
std = series.std()
max_range = mean + n*std
min_range = mean - n*std
series = np.clip(series,min_range,max_range)
return series
sigma_winsorize = filter_extreme_3sigma(factor_data,3,3)
print (sigma_winsorize.min(),sigma_winsorize.max())
winsorize_plot(sigma_winsorize,vertical_lines)
-2.357313193311948 3.498196906137111
百分位法
将因子值进行升序的排序,对排位百分位高于97.5%或排位百分位低于2.5%的因子值,进行类似于
#百分位法
def filter_extreme_percentile(series,min =0.025,max = 0.975):
series = series.sort_values()
q = series.quantile([min,max])
return np.clip(series,q.iloc[0],q.iloc[1])
percentile_winsorize = filter_extreme_percentile(factor_data,0.025,0.975)
print (percentile_winsorize.min(),percentile_winsorize.max())
winsorize_plot(percentile_winsorize,vertical_lines)
-1.2758881404314582 2.509630292141858