series 改索引名

转载

字节小舞神 2024-10-18 07:27:28

文章标签 series 改索引名数据 bc 调用约定 文章分类 架构后端开发

pandas 重新索引和更改标签

文章目录

pandas 重新索引和更改标签

重新索引以至与另一个对象对齐
使用align对齐对象
重新索引并填充空白
重新索引时填充的限制
从一个轴中去除标签
重命名和映射标签

reindex()是panda中的基本数据对齐方法。它用于实现几乎所有其他依赖于标签对齐功能的特性。reindex意味着使数据符合特定轴上给定的一组标签。

它完成以下几件事：

重新排序现有数据以匹配一组新标签
在缺少标签数据的标签位置中插入缺失值（NA）标记
如果指定，则使用逻辑填充缺失标签的数据（与使用时间序列数据高度相关）

import numpy as np
import pandas as pd

s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
s

a   -0.530117
b   -1.376770
c    0.128544
d    1.710105
e   -0.303147
dtype: float64

s.reindex(['e','b','f','d'])

e   -0.303147
b   -1.376770
f         NaN
d    1.710105
dtype: float64

以上例子可以发现，s中少了a，b多了f，且f值为NaN。

使用DataFrame，您可以同时为索引和列重新建立索引：

df = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'],columns=['one','two','three'])

df

	one	two	three
a	0.023919	-0.168714	-0.130197
b	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

	three	two	one
c	0.504420	-0.146834	-0.595582
f	NaN	NaN	NaN
b	0.449531	0.770867	-2.191789

DataFrame.reindex（）还支持“轴样式”调用约定，在该约定中，您可以指定单个标签参数及其所应用的轴。

df.reindex(['c','f','b'], axis='index')

	one	two	three
c	-0.595582	-0.146834	0.504420
f	NaN	NaN	NaN
b	-2.191789	0.770867	0.449531

df.reindex(['three', 'two', 'one'], axis='columns')

	three	two	one
a	-0.130197	-0.168714	0.023919
b	0.449531	0.770867	-2.191789
c	0.504420	-0.146834	-0.595582
d	0.026612	-1.342882	1.771169

许多操作在预对齐的数据上操作更快，所以在编写高性能代码时值得考虑。当然重新索引也会带来一些开销。

重新索引以至与另一个对象对齐

根据上一节的方法，可以很容易的设置重新索引以至与另一个对象对齐，但是pandas为我们提供了一个更加简介的方法：reindex_like() 方法：

df2 = pd.DataFrame(np.random.randn(3,2),index=['a','b','c'],columns=['one','two'])

df3 = pd.DataFrame(np.random.randn(3,2),index=['a','b','c'],columns=['one','two'])

df

	one	two	three
a	0.023919	-0.168714	-0.130197
b	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

df.reindex_like(df2)

	one	two
a	0.023919	-0.168714
b	-2.191789	0.770867
c	-0.595582	-0.146834

使用align对齐对象

align方法是一个快速的方法，用来同时对齐两个对象。它支持一个join参数(与joining and merging有关)

join = ‘outer’ : 接受索引的并集（默认）
join = ‘left’ : 使用调用对象的索引
join = ‘right’ : 使用传入参数的索引
join = ‘inner’ : 索引的交集

它返回一个元组其中包含了两个重新索引后的Series

s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])

s1 = s[:4]

s2 = s[1:]

s1.align(s2)

(a    0.673400
 b   -0.164149
 c    1.021917
 d    0.015151
 e         NaN
 dtype: float64, a         NaN
 b   -0.164149
 c    1.021917
 d    0.015151
 e    0.532343
 dtype: float64)

s1.align(s2,join='inner')

(b   -0.164149
 c    1.021917
 d    0.015151
 dtype: float64, b   -0.164149
 c    1.021917
 d    0.015151
 dtype: float64)

s1.align(s2, join='left')

(a    0.673400
 b   -0.164149
 c    1.021917
 d    0.015151
 dtype: float64, a         NaN
 b   -0.164149
 c    1.021917
 d    0.015151
 dtype: float64)

对于DataFrame来说join参数将同时作用于index和columns

df.align(df2,join='inner')

(        one       two
 a  0.023919 -0.168714
 b -2.191789  0.770867
 c -0.595582 -0.146834,         one       two
 a -0.484455  0.301743
 b -0.963329  0.013888
 c -0.955406  1.144505)

同样可以指定join的线：

df.align(df2,join='inner',axis=0)

(        one       two     three
 a  0.023919 -0.168714 -0.130197
 b -2.191789  0.770867  0.449531
 c -0.595582 -0.146834  0.504420,         one       two
 a -0.484455  0.301743
 b -0.963329  0.013888
 c -0.955406  1.144505)

如果将Series传递给DataFrame.align（），则可以选择使用axis参数在DataFrame的索引或列上对齐两个对象：

df.align(df2.iloc[0], axis=1)

(        one     three       two
 a  0.023919 -0.130197 -0.168714
 b -2.191789  0.449531  0.770867
 c -0.595582  0.504420 -0.146834
 d  1.771169  0.026612 -1.342882, one     -0.484455
 three         NaN
 two      0.301743
 Name: a, dtype: float64)

重新索引并填充空白

reindex方法支持一个可选参数，它指定一个方法用来填充

Method	Action
pad/ffill	填充前一个值
bfill/backfill	填充后一个值
nearest	填充最近索引的值

rng = pd.date_range('1/3/2000',periods=8)

ts = pd.Series(np.random.randn(8),index=rng)

ts2=ts[[0,3,6]]

ts

2000-01-03   -1.459239
2000-01-04   -0.778597
2000-01-05    0.193446
2000-01-06   -1.135547
2000-01-07    0.664596
2000-01-08    0.394573
2000-01-09    0.498136
2000-01-10    0.265899
Freq: D, dtype: float64

ts2

2000-01-03   -1.459239
2000-01-06   -1.135547
2000-01-09    0.498136
dtype: float64

ts2.reindex(ts.index)

2000-01-03   -1.459239
2000-01-04         NaN
2000-01-05         NaN
2000-01-06   -1.135547
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    0.498136
2000-01-10         NaN
Freq: D, dtype: float64

ts2.reindex(ts.index,method='ffill')

2000-01-03   -1.459239
2000-01-04   -1.459239
2000-01-05   -1.459239
2000-01-06   -1.135547
2000-01-07   -1.135547
2000-01-08   -1.135547
2000-01-09    0.498136
2000-01-10    0.498136
Freq: D, dtype: float64

ts2.reindex(ts.index, method='nearest')

2000-01-03   -1.459239
2000-01-04   -1.459239
2000-01-05   -1.135547
2000-01-06   -1.135547
2000-01-07   -1.135547
2000-01-08    0.498136
2000-01-09    0.498136
2000-01-10    0.498136
Freq: D, dtype: float64

如果索引不是单调递增或递减的，reindex()将引发ValueError错误。fillna()和ate()不会对索引的顺序执行任何检查。

重新索引时填充的限制

limit和tolerance参数提供了对重新索引时填充的额外控制。

Limit指定连续匹配的最大数量:

ts2.reindex(ts.index,method='ffill',limit=1)

2000-01-03   -1.459239
2000-01-04   -1.459239
2000-01-05         NaN
2000-01-06   -1.135547
2000-01-07   -1.135547
2000-01-08         NaN
2000-01-09    0.498136
2000-01-10    0.498136
Freq: D, dtype: float64

相反，tolerance指定了index和indexer值之间的最大距离:

ts2.reindex(ts.index, method='ffill', tolerance='1 day')

2000-01-03   -1.459239
2000-01-04   -1.459239
2000-01-05         NaN
2000-01-06   -1.135547
2000-01-07   -1.135547
2000-01-08         NaN
2000-01-09    0.498136
2000-01-10    0.498136
Freq: D, dtype: float64

请注意，当在DatetimeIndex、TimedeltaIndex或dindex上使用时，如果可能，公差将强制转换为Timedelta(时间增量)。这允许您使用适当的字符串指定公差。

从一个轴中去除标签

与reindex紧密相关的一种方法是drop()函数。它从轴上删除了一组标签：

df

	one	two	three
a	0.023919	-0.168714	-0.130197
b	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

df.drop(['a','b'],axis=0)

	one	two	three
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

df.drop(['one'], axis=1)

	two	three
a	-0.168714	-0.130197
b	0.770867	0.449531
c	-0.146834	0.504420
d	-1.342882	0.026612

重命名和映射标签

通过rename()方法，可以基于某些映射（字典或系列）或任意函数来重新标记轴。

a    0.673400
b   -0.164149
c    1.021917
d    0.015151
e    0.532343
dtype: float64

s.rename(str.upper)

A    0.673400
B   -0.164149
C    1.021917
D    0.015151
E    0.532343
dtype: float64

如果传递一个函数，它必须在使用任何标签调用时返回一个值(并且必须生成一组唯一的值)。

也可以使用dict或Series:

df

	one	two	three
a	0.023919	-0.168714	-0.130197
b	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

df.rename(columns={'one':'foo','two':'bar'},
         index={'a':'apple','b':'banana','d':'durian'})

	foo	bar	three
apple	0.023919	-0.168714	-0.130197
banana	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
durian	1.771169	-1.342882	0.026612

DataFrame.rename()还支持“轴样式”调用约定，在该约定中，您可以指定单个映射器和要应用该映射的轴。

df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')

	foo	bar	three
a	0.023919	-0.168714	-0.130197
b	-2.191789	0.770867	0.449531
c	-0.595582	-0.146834	0.504420
d	1.771169	-1.342882	0.026612

rename()还接受标量或者list-like来改变Series.name属性：

a    0.673400
b   -0.164149
c    1.021917
d    0.015151
e    0.532343
dtype: float64

s.rename('scalar-name')

a    0.673400
b   -0.164149
c    1.021917
d    0.015151
e    0.532343
Name: scalar-name, dtype: float64

rename_axis()和rename_axis()方法允许更改多索引的特定名称(与标签相反)。0.24.0版本加入

df = pd.DataFrame({'x':[1,2,3,4,5,6],
                  'y':[10,20,30,40,50,60]},
                  index=pd.MultiIndex.from_product([['a','b','c'],[1,2]],names=['let','num']))