有序redis 有序分类变量

转载

云端小悟空 2023-12-23 21:04:54

文章标签 有序redis python特征工程有序变量处理 ci 线性回归 5e 文章分类 Redis 数据库

分类变量是表示类别或标记的。与数值型变量不同，分类变量的值是不能被排序的，故而又称为无序变量。

one-hot编码

独热编码（one-hot encoding）通常用于处理类别间不具有大小关系的特征。独热编码使用一组比特位表示不同的类别，每个比特位表示一个特征。因此，一个可能有k个类别的分类变脸就可以编码成为一个长度为k的特征向量。若变量不能同时属于多个类别，那这组值就只有一个比特位是‘开’的。

独热编码的优缺点：

独热编码解决了分类器不好处理属性数据的问题，在一定程度上也起到了扩充特征的作用。它的值只有0和1，不同的类型存储在垂直的空间。

当类别的数量很多时，特征空间会变得非常大。在这种情况下，一般可以用PCA来减少维度。而且one hot encoding+PCA这种组合在实际中也非常有用。使用稀疏向量节省空间配合特征选择降低维度

import pandas as pd

from sklearn import linear_model

df = pd.DataFrame({'city':['SF','SF','SF','NYC','NYC','NYC','Seattle','Seattle','Seattle'],

'Rent':[3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]})

df['Rent'].mean()

3333.3333333333335

#将分类变量转换为one-hot编码并拟合一个线性回归模型

one_hot_df = pd.get_dummies(df, prefix=['city'])

one_hot_df

Rent

city_NYC

city_SF

city_Seattle

3999

4000

4001

3499

3500

3501

2499

2500

2501

model = linear_model.LinearRegression()

model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],

one_hot_df['Rent'])

model.coef_ #获取线性回归模型的系数

array([ 166.66666667, 666.66666667, -833.33333333])

model.intercept_ #获取线性回归模型的截距

3333.3333333333335

model.score(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],

one_hot_df['Rent']) #获取模型的拟合优度R2

0.9999982857172245

使用one-hot编码时，截距表示目标变量rent的整体均值，每个线性系数表示相应城市的Rent均值与整体Rent均值有多大

虚拟编码

虚拟编码在进行表示时只使用k-1个特征，除去了额外的自由度。没有被使用的那个特征通过一个全零向量来表示，它称为参照类。虚拟编码和one-hot都可以通过pandas.get_dummies实现

#用虚拟编码训练一个线性回归模型，指定drop_first标志来生成虚拟编码

dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True)

dummy_df

Rent

city_SF

city_Seattle

3999

4000

4001

3499

3500

3501

2499

2500

2501

model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])

model.coef_

array([ 500., -1000.])

model.intercept_

3500.0

model.score(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])

0.9999982857172245

使用虚拟编码时，偏差系数表示相应变量y对于参照类的均值，该例中参照类是city_NYC。第i个特征的系数等于第i个类别的均值与参照类均值的差。

效果编码

效果编码与虚拟编码非常相似，区别在于参照类的用全部由-1组成的向量表示的

effect_df = dummy_df.copy()

effect_df.loc[3:5, ['city_SF','city_Seattle']]= -1.0

effect_df

Rent

city_SF

city_Seattle

3999

1.0

0.0

4000

1.0

0.0

4001

1.0

0.0

3499

-1.0

3500

-1.0

3501

-1.0

2499

0.0

1.0

2500

0.0

1.0

2501

0.0

1.0

model.fit(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.coef_

array([ 666.66666667, -833.33333333])

model.intercept_

3333.3333333333335

model.score(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])

0.9999982857172245

处理大型分类变量

特征散列化

散列函数是一种确定性函数，它可以将一个可能无界的整数映射到一个有限的整数范围【1，m】中。

import pandas as pd

import json

js = []

with open('yelp_academic_dataset_review.json') as f:

for i in range(10000):

js.append(json.loads(f.readline()))

f.close()

review_df = pd.DataFrame(js)

# 定义m为唯一的business_id的数量

m = len(review_df.business_id.unique())

4174

from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features = m , input_type='string')

f = h.transform(review_df['business_id'])

review_df['business_id'].unique().tolist()[0:5]

['9yKzy9PApeiPPOUJEtnvkg',

'ZRJwVLyzEJq1VAihDhYiow',

'6oRAC4uyJCsJl1X0WZpVSA',

'_1QQZuf4zZOyFCvXc0o6Vg',

'6ozycU1RpktNG2-1BroVtw']

f.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.]])

from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id']))

print('Our hashed numpy array, in bytes: ', getsizeof(f))

Our pandas Series, in bytes: 790152

Our hashed numpy array, in bytes: 56

分箱计数

import pandas as pd

df = pd.read_csv('train_subset.csv')

len(df['device_id'].unique()) #查看训练集中有多少个唯一的特征

1075

df.head()

click

hour

banner_pos

site_id

site_domain

site_category

app_id

app_domain

...

device_type

device_conn_type

C14

C15

C16

C17

C18

C19

C20

C21

1000009418151094273

14102100

1005

1fbe01fe

f3845767

28905ebd

ecad2386

7801e8d9

...

15706

320

1722

-1

10000169349117863715

14102100

1005

1fbe01fe

f3845767

28905ebd

ecad2386

7801e8d9

...

15704

320

1722

100084

10000371904215119486

14102100

1005

1fbe01fe

f3845767

28905ebd

ecad2386

7801e8d9

...

15704

320

1722

100084

10000640724480838376

14102100

1005

1fbe01fe

f3845767

28905ebd

ecad2386

7801e8d9

...

15706

320

1722

100084

10000679056417042096

14102100

1005

fe8cc448

9166c161

0569f928

ecad2386

7801e8d9

...

18993

320

2161

-1

157

5 rows × 24 columns

def click_counting(x, bin_column):

clicks = pd.Series(

x[x['click'] > 0][bin_column].value_counts(), name='clicks')

no_clicks = pd.Series(

x[x['click'] < 1][bin_column].value_counts(), name='no_clicks')

counts = pd.DataFrame([clicks, no_clicks]).T.fillna('0')

counts['total'] = counts['clicks'].astype(

'int64') + counts['no_clicks'].astype('int64')

return counts

def bin_counting(counts):

counts['N+'] = counts['clicks'].astype('int64').divide(

counts['total'].astype('int64'))

counts['N-'] = counts['no_clicks'].astype('int64').divide(

counts['total'].astype('int64'))

counts['log_N+'] = counts['N+'].divide(counts['N-'])

# If we wanted to only return bin-counting properties, we would filter here

bin_counts = counts.filter(items=['N+', 'N-', 'log_N+'])

return counts, bin_counts

bin_column = 'device_id'

device_clicks = click_counting(df.filter(items = [bin_column, 'click']), bin_column)

device_all, device_bin_counts = bin_counting(device_clicks)

len(device_bin_counts)

1075

device_all.sort_values(by = 'total', ascending = False).head(4)

clicks

no_clicks

total

N+

N-

log_N+

a99f214a

1561

7163

8724

0.178932

0.821068

0.217925

c357dbff

0.117647

0.882353

0.133333

a167aa83

0.000000

1.000000

0.000000

3c0208dc

0.000000

1.000000

0.000000

from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(df.filter(items=['device_id', 'click'])))

print('Our bin-counting feature, in bytes: ', getsizeof(device_bin_counts))

Our pandas Series, in bytes: 730152

Our bin-counting feature, in bytes: 95699

参考：

爱丽丝·郑、阿曼达·卡萨丽，精通特征工程

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：spark client cluster优势 spark client和cluster区别

下一篇：shell脚本 mysql Initializing shell脚本怎么写

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

有序redis 有序分类变量

有序redis 有序分类变量

51CTO博客