继上一篇文章​​Pandas数据处理——盘点那些常用的函数(上)​​,这篇文章整理了剩下的一些Pandas常见方法,整体难度会比上一篇文章中的大一点,但还是比较容易理解的。话不多说,直接进入正题。

用于演示的数据如下:

In [11]: data
Out[11]:
company gender salary age
0 B female 30 40.0
1 A female 36 31.0
2 B female 35 28.0
3 B female 9 18.0
4 B female 16 43.0
5 A male 46 22.0
6 B female 15 28.0
7 B female 33 40.0
8 C male 19 32.0

.astype()

作用对象:​​Series​​​和​​DataFrame​

主要用途:修改字段的数据类型,数据量大的情况下可用于减小数据占用的内存,多用于​​Series​​。

用法:

# 把age字段转为int类型
In [12]: data["age"] = data["age"].astype(int)

In [13]: data
Out[13]:
company gender salary age
0 B female 30 40
1 A female 36 31
2 B female 35 28
3 B female 9 18
4 B female 16 43
5 A male 46 22
6 B female 15 28
7 B female 33 40
8 C male 19 32

.rename()

作用对象:​​Series​​​,​​DataFrame​​(大多数情况下)

主要用途:多用于修改​​DataFrame​​的列名

主要参数:

  • columns (dict-like or function
    指定要修改的列名以及新的列名,一般以字典形式传入
  • inplace (boolean, default False
    是否作用于原对象

用法:

# 将'age'更改为员工编号'number',并作用于原对象
In [15]: data.rename(columns={'age':'number'},inplace=True)

In [16]: data
Out[16]:
company gender salary number
0 B female 30 40
1 A female 36 31
2 B female 35 28
3 B female 9 18
4 B female 16 43
5 A male 46 22
6 B female 15 28
7 B female 33 40
8 C male 19 32

.set_index()

作用对象:​​DataFrame​

主要用途:将​​DataFrame​​中的某一(多)个字段设置为索引

用法:

In [19]: data.set_index('number',inplace=True)

In [20]: data
Out[20]:
company gender salary
number
40 B female 30
31 A female 36
28 B female 35
18 B female 9
43 B female 16
22 A male 46
28 B female 15
40 B female 33
32 C male 19

.reset_index()

作用对象:​​Series​​​,​​DataFrame​

主要用途:重置索引,默认重置后的索引为​​0~len(df)-1​

主要参数:

  • drop (boolean, default False
    是否丢弃原索引,具体看下方演示
  • inplace (boolean, default False
    是否作用于原对象

用法:

# drop = True,重置索引,并把原有的索引丢弃
In [22]: data.reset_index(drop=True)
Out[22]:
company gender salary
0 B female 30
1 A female 36
2 B female 35
3 B female 9
4 B female 16
5 A male 46
6 B female 15
7 B female 33
8 C male 19

# drop = False,重置索引
# 原索引列'number'作为新字段进入DataFrame
In [23]: data.reset_index(drop=False,inplace=True)

In [24]: data
Out[24]:
number company gender salary
0 40 B female 30
1 31 A female 36
2 28 B female 35
3 18 B female 9
4 43 B female 16
5 22 A male 46
6 28 B female 15
7 40 B female 33
8 32 C male 19

.drop_duplicates()

作用对象:​​Series​​​,​​DataFrame​

主要用途:去掉重复值,作用和​​SQL​​​中的​​distinct​​类似

用法:

In [26]: data['company'].drop_duplicates()
Out[26]:
0 B
1 A
8 C
Name: company, dtype: object

.drop()

作用对象:​​Series​​​,​​DataFrame​

主要用途:常用于删掉​​DataFrame​​中的某些字段

主要参数:

  • columns (single label or list-like
    指定要删掉的字段

用法:

# 删掉'gender'列
In [27]: data.drop(columns = ['gender'])
Out[27]:
number company salary
0 40 B 30
1 31 A 36
2 28 B 35
3 18 B 9
4 43 B 16
5 22 A 46
6 28 B 15
7 40 B 33
8 32 C 19

.isin()

作用对象:​​Series​​​,​​DataFrame​

主要用途:常用于构建布尔索引,对​​DataFrame​​的数据进行条件筛选

用法:

# 筛选出A公司和C公司的员工记录
In [29]: data.loc[data['company'].isin(['A','C'])]
Out[29]:
number company gender salary
1 31 A female 36
5 22 A male 46
8 32 C male 19

pd.cut()

主要用途:将连续变量离散化,比如将人的年龄划分为各个区间

主要参数:

  • x (array-like
    需要进行离散化的一维数据
  • bins (int, sequence of scalars, or IntervalIndex
    设置需要分成的区间,可以指定区间数量,也可以指定间断点
  • labels (array or bool, optional
    设置区间的标签

用法:

# 把薪水分成5个区间
In [33]: pd.cut(data.salary,bins = 5)
Out[33]:
0 (23.8, 31.2]
1 (31.2, 38.6]
2 (31.2, 38.6]
3 (8.963, 16.4]
4 (8.963, 16.4]
5 (38.6, 46.0]
6 (8.963, 16.4]
7 (31.2, 38.6]
8 (16.4, 23.8]
Name: salary, dtype: category
Categories (5, interval[float64]): [(8.963, 16.4] < (16.4, 23.8] < (23.8, 31.2] < (31.2, 38.6] <(38.6, 46.0]]

# 自行指定间断点
In [32]: pd.cut(data.salary,bins = [0,10,20,30,40,50])
Out[32]:
0 (20, 30]
1 (30, 40]
2 (30, 40]
3 (0, 10]
4 (10, 20]
5 (40, 50]
6 (10, 20]
7 (30, 40]
8 (10, 20]
Name: salary, dtype: category
Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40]