一.Pandas
1.Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具,主要用于数据处理(数据整理,操作,存储,读取等)
2.http://pandas.pydata.org/
3.pandas中的数据结构:
Series一维数组,只允许存储相同的数据类型;
Time- Series:以时间为索引的Series;
DataFrame:二维的表格型数据结构;
Panel :三维的数组,可以理解为DataFrame的容器
二.创建表格
1.Series创建一维标记数据表格,相似于ndarry,但有索引(从0开始)(表格的列的列表)
__init__(self, data=None, index=None, dtype=None, name=None,copy=False, fastpath=False)
一般形式:s = pd.Series(data, index=index) #data可以是一个字典,一个ndarray,或标量值(5)
(1)ndarray
1 #ndarry,若设置索引,则索引的长度必须与数据的长度相同,
2 s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
3 s2=pd.Series(np.random.randn(5))#如果没有传递索引,将创建一个具有值的索引。[0, ..., len(data) - 1]
4 print(s)
5 print(s2)
6 -------------------------------------------
7 a -0.019921
8 b -2.324644
9 c -0.429393
10 d 1.436731
11 e 2.564406
12 dtype: float64
13 0 -0.925714
14 1 0.319075
15 2 0.528071
16 3 -0.385841
17 4 0.963207
18 dtype: float64
ndarray
(2)dict
1 #dict,
2 #1.当未传递Series索引时,键表示索引,值表示值
3 d = {'b' : 1, 'a' : 0, 'c' : 2}
4 s=pd.Series(d)
5 print(s)
6 #2.如果传递索引,则将拉出与索引中的标签对应的数据中的值,NaN(不是数字)是pandas中使用的标准缺失数据标记
7 s2=pd.Series(d, index=['b', 'c', 'd', 'a'])
8 print(s2)
9 ----------------------------------------------------------
10 a 0
11 b 1
12 c 2
13 dtype: int64
14 b 1.0
15 c 2.0
16 d NaN
17 a 0.0
18 dtype: float64
字典
(3)标量
1 #标量,如果data是标量值,则必须提供索引。将重复该值以匹配索引的长度
2 s=pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
3 print(s)
4
5 -------------------------------------------------------
6 a 5.0
7 b 5.0
8 c 5.0
9 d 5.0
10 e 5.0
11 dtype: float64
标量
2.DataFrame:二维的表格型数据结构,具有可能不同类型的列,可以视为电子表格或SQL表,或Series对象的字典
__init__(self, data=None, index=None, columns=None, dtype=None,copy=False)
一般形式:s = pd.DataFrame(data, index=index) #data可以是ndarray,列表,字典,series,dataframe
(1)字典{‘columns’:series},推荐
1 #字典,#字典的键为columns,值为每一个series,#通过字典创建会产生列的顺序会是随机的
2 d = { 'one': pd.Series([1., 2., 3.],index=['a', 'b', 'c']),
3 'two': pd.Series([1., 2., 3., 4.],index=['a', 'b', 'c', 'd'])}
4 df = pd.DataFrame(d)
5 print(df)
6 print(pd.DataFrame(d, index=['d', 'b', 'a']))#如果没有传递列,则列将是dict键的有序列表。
7 print(pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']))
8 -----------------------------------------------------------
9 one two
10 a 1.0 1.0
11 b 2.0 2.0
12 c 3.0 3.0
13 d NaN 4.0
14 one two
15 d NaN 4.0
16 b 2.0 2.0
17 a 1.0 1.0
18 two three
19 d 4.0 NaN
20 b 2.0 NaN
21 a 1.0 NaN
字典
(2)字典{‘columns’:ndarrays / lists}
1 #字典,ndarrays必须都是相同的长度,
2 d = {'one' : [1., 2., 3., 4.],
3 'two' : [4., 3., 2., 1.]}
4 df = pd.DataFrame(d)
5 print(df)#如果没有传递索引,结果将是range(n),
6 print(pd.DataFrame(d, index=['a', 'b', 'c', 'd']))#如果传递索引,则它必须明显与数组的长度相同。
7 ----------------------------------------
8 one two
9 0 1.0 4.0
10 1 2.0 3.0
11 2 3.0 2.0
12 3 4.0 1.0
13 one two
14 a 1.0 4.0
15 b 2.0 3.0
16 c 3.0 2.0
17 d 4.0 1.0
字典2
(3)列表【{‘columns’:},{}】
1 #列表【字典】,每个字典的值代表的是每条记录(一行),而且顺序确定,字典的键表示columns
2 data2 = [{'a': 1, 'b': 2},
3 {'a': 5, 'b': 10, 'c': 20}]
4 df=pd.DataFrame(data2)
5 print(df)
6 print(pd.DataFrame(data2, index=['first', 'second']))
7 print(pd.DataFrame(data2, columns=['a', 'b']))
8 -----------------------------------------------------
9 a b c
10 0 1 2 NaN
11 1 5 10 20.0
12 a b c
13 first 1 2 NaN
14 second 5 10 20.0
15 a b
16 0 1 2
17 1 5 10
列表
(4)其他:结构化,字典(元组),记录等
3.总结:
(1)series是具有索引的一维标记表格,而dataframe类似索引的Series对象的dict(columns)
(2)index索引(行标签)(表格第一列项)(描述每一行),columns列(列标签)(表格第一行项)(描述每一列)
三.查看数据
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)#利用pandas产生日期数据,(开始日期,长度)
4 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
5 #生成二维表格,索引是日期,纵列是abcd的列表,数据为随机产生的数
6 #看dataframe:从左到右,从上到下,最左边一列是索引列表,每一条索引表示一条记录
7 print(df)
8 print(df.dtypes)#显示每条纵列的数据类型
9 print(df.head(1))#从表格顶部开始显示表格,默认全部显示
10 print(df.tail(4))#从表格底部开始显示表格,默认全部显示
11 ------------------------------------------------------
12 A B C D
13 2018-05-07 -1.224660 0.642997 0.513311 0.867978
14 2018-05-08 -0.190702 -0.186858 -1.187651 -0.243407
15 2018-05-09 2.036292 1.720265 0.633207 -0.480980
16 2018-05-10 -2.141022 1.062058 1.118255 0.677325
17 2018-05-11 0.084533 -1.357477 1.135133 0.163912
18 2018-05-12 -1.111821 -1.859636 -1.018877 1.500960
19 A float64
20 B float64
21 C float64
22 D float64
23 dtype: object
24 A B C D
25 2018-05-07 -1.22466 0.642997 0.513311 0.867978
26 A B C D
27 2018-05-09 2.036292 1.720265 0.633207 -0.480980
28 2018-05-10 -2.141022 1.062058 1.118255 0.677325
29 2018-05-11 0.084533 -1.357477 1.135133 0.163912
30 2018-05-12 -1.111821 -1.859636 -1.018877 1.500960
查看数据
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)#利用pandas产生日期数据,(开始日期,长度)
4 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
5 print(df.index)#显示索引相关信息
6 print(df.columns)#显示纵列相关信息
7 print(df.values)#显示表格数据
8 -----------------------------------------------------------
9 DatetimeIndex(['2018-05-07', '2018-05-08', '2018-05-09', '2018-05-10',
10 '2018-05-11', '2018-05-12'],
11 dtype='datetime64[ns]', freq='D')
12 Index(['A', 'B', 'C', 'D'], dtype='object')
13 [[ 0.7783973 -0.85917061 -0.80297609 0.19146651]
14 [ 0.17786108 0.05542154 0.26696242 0.85977364]
15 [-0.90406197 0.86919366 -0.19532951 -1.51333716]
16 [ 1.17939517 -0.71215508 0.6994515 0.14105961]
17 [-1.40057553 -0.19872321 1.04690513 -0.02428399]
18 [-1.33587412 1.27565206 -1.22414705 -0.31345343]]
查看数据2
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)#利用pandas产生日期数据,(开始日期,长度)
4 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
5 print(df.describe())#显示图标数据的一些统计摘要,比如平均值,方差等
6 print(df.T)#转置整个表格
7 print(df.sort_index(axis=1,ascending=False))#按index排序,以列名称进行排序,以倒的顺序排序
8 print(df.sort_values(by='C'))#指定某一属性,按值从小到大把整个列表排序
9 --------------------------------------------------------------------
10 A B C D
11 count 6.000000 6.000000 6.000000 6.000000
12 mean 0.322268 0.380395 -0.007915 -0.074197
13 std 0.747649 1.309477 1.065533 1.122326
14 min -0.581089 -1.056308 -1.277190 -1.683641
15 25% -0.140576 -0.432361 -0.891893 -0.807084
16 50% 0.292569 -0.052893 0.099397 0.057339
17 75% 0.573371 1.229704 0.639316 0.812405
18 max 1.547544 2.346076 1.433943 1.154910
19 2018-05-07 2018-05-08 2018-05-09 2018-05-10 2018-05-11 2018-05-12
20 A 0.558207 -0.196412 0.578425 0.026930 1.547544 -0.581089
21 B 0.241584 -1.056308 1.559078 -0.347370 -0.460691 2.346076
22 C -0.283660 0.691603 -1.277190 0.482454 -1.094637 1.433943
23 D -0.933690 -1.683641 0.902558 -0.427266 0.541944 1.154910
24 D C B A
25 2018-05-07 -0.933690 -0.283660 0.241584 0.558207
26 2018-05-08 -1.683641 0.691603 -1.056308 -0.196412
27 2018-05-09 0.902558 -1.277190 1.559078 0.578425
28 2018-05-10 -0.427266 0.482454 -0.347370 0.026930
29 2018-05-11 0.541944 -1.094637 -0.460691 1.547544
30 2018-05-12 1.154910 1.433943 2.346076 -0.581089
31 A B C D
32 2018-05-09 0.578425 1.559078 -1.277190 0.902558
33 2018-05-11 1.547544 -0.460691 -1.094637 0.541944
34 2018-05-07 0.558207 0.241584 -0.283660 -0.933690
35 2018-05-10 0.026930 -0.347370 0.482454 -0.427266
36 2018-05-08 -0.196412 -1.056308 0.691603 -1.683641
37 2018-05-12 -0.581089 2.346076 1.433943 1.154910
查看数据3
四.选择数据
1 import numpy as np
2 import pandas as pd
3
4 dates = pd.date_range('20180507', periods=6)
5 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
6 print(df['A']) # 获取指定纵行数据,按键索引,等同于print(df.A)
7 print(df[0:3]) # 获取指定索引数据,按序索引,等同于print(df['20180507':'20180509'])
8 #按标签选择数据loc
9 print(df.loc[dates[0]])#选择index列表中第一个元素对应的一组数据,相当于print(df.loc['20180507'])
10 print(df.loc[:, ['A', 'B']])#选择(所有index,a和b两组)的数据
11 print(df.loc['20180509':'20180512', ['A', 'B']])
12 #选择(index:'20180509':'20180512',columns:['A', 'B'])的数据
13 --------------------------------------------------------
14 2018-05-07 1.199961
15 2018-05-08 -0.336282
16 2018-05-09 0.033081
17 2018-05-10 -0.949975
18 2018-05-11 0.155855
19 2018-05-12 -0.901445
20 Freq: D, Name: A, dtype: float64
21 A B C D
22 2018-05-07 1.199961 0.190652 1.412162 -1.005569
23 2018-05-08 -0.336282 1.445801 -0.584581 -0.101581
24 2018-05-09 0.033081 -0.069119 -0.545210 -0.339678
25 A 1.199961
26 B 0.190652
27 C 1.412162
28 D -1.005569
29 Name: 2018-05-07 00:00:00, dtype: float64
30 A B
31 2018-05-07 1.199961 0.190652
32 2018-05-08 -0.336282 1.445801
33 2018-05-09 0.033081 -0.069119
34 2018-05-10 -0.949975 0.557865
35 2018-05-11 0.155855 -1.079446
36 2018-05-12 -0.901445 1.649649
37 A B
38 2018-05-09 0.033081 -0.069119
39 2018-05-10 -0.949975 0.557865
40 2018-05-11 0.155855 -1.079446
41 2018-05-12 -0.901445 1.649649
获取数据
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)
4 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
5 print(df)
6 #按位置选择iloc(索引序号进行选择)
7 print(df.iloc[3])#选择index列表中索引为3的数据
8 print(df.iloc[3:5, 0:2])#选择(index:3:5;columns:0:2)的数据
9 print(df.iloc[[1, 2, 4], [0, 2]])#(index:1,2,3)(columns:0,2)
10 print(df.iloc[1:3, :])#index:1:3;columns:所有
11 print(df.iloc[1, 1])#index:1;columns:1
12 ------------------------------------------------
13 A B C D
14 2018-05-07 0.201876 -1.442301 1.465751 0.640606
15 2018-05-08 -0.775345 -0.534636 -0.091328 1.228146
16 2018-05-09 0.975878 1.260147 -0.006231 0.335106
17 2018-05-10 -0.520035 -1.354576 -1.364484 0.276557
18 2018-05-11 0.726874 0.223242 -0.037863 -1.681729
19 2018-05-12 0.550026 2.378874 -0.973442 1.530179
20 A -0.520035
21 B -1.354576
22 C -1.364484
23 D 0.276557
24 Name: 2018-05-10 00:00:00, dtype: float64
25 A B
26 2018-05-10 -0.520035 -1.354576
27 2018-05-11 0.726874 0.223242
28 A C
29 2018-05-08 -0.775345 -0.091328
30 2018-05-09 0.975878 -0.006231
31 2018-05-11 0.726874 -0.037863
32 A B C D
33 2018-05-08 -0.775345 -0.534636 -0.091328 1.228146
34 2018-05-09 0.975878 1.260147 -0.006231 0.335106
35 -0.534635999017828
获取数据2
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)
4 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
5 print(df)
6 #综合索引
7 print(df.ix[:3, ['A', 'C']])#index:0:3;columns:'A','C'
8 #布尔索引
9 print(df[df.A > 0])#条件筛选
10 ----------------------------------------------------
11 A B C D
12 2018-05-07 0.383763 0.177597 0.445263 0.873108
13 2018-05-08 0.171476 0.549246 0.531994 -1.414907
14 2018-05-09 -1.008178 0.927677 0.774119 -0.058670
15 2018-05-10 -0.500451 -0.881271 -0.576227 0.876132
16 2018-05-11 -1.289566 -0.351046 0.765377 0.168464
17 2018-05-12 0.676490 -0.806772 0.194579 -0.205643
18 A C
19 2018-05-07 0.383763 0.445263
20 2018-05-08 0.171476 0.531994
21 2018-05-09 -1.008178 0.774119
22 A B C D
23 2018-05-07 0.383763 0.177597 0.445263 0.873108
24 2018-05-08 0.171476 0.549246 0.531994 -1.414907
25 2018-05-12 0.676490 -0.806772 0.194579 -0.205643
获取数据3
五.修改,添加数据
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=6)
4 df = pd.DataFrame(np.arange(24).reshape(6,4), index=dates, columns=list('ABCD'))
5 print(df)
6 #通过索引找到数据,再直接赋值修改数据
7 df.iloc[2,2]=1111
8 df.loc['20180509','B']=1111
9 df[df.A>4]=0
10 df['E']=np.nan#添加空列E
11 #添加新的一列数据,要对齐
12 df['E']=pd.Series([1,2,3,4,5,6],index=pd.date_range('20180507',periods=6))
13 print(df)
14 -------------------------------------------------------
15 A B C D
16 2018-05-07 0 1 2 3
17 2018-05-08 4 5 6 7
18 2018-05-09 8 9 10 11
19 2018-05-10 12 13 14 15
20 2018-05-11 16 17 18 19
21 2018-05-12 20 21 22 23
22 A B C D E
23 2018-05-07 0 1 2 3 1
24 2018-05-08 4 5 6 7 2
25 2018-05-09 0 0 0 0 3
26 2018-05-10 0 0 0 0 4
27 2018-05-11 0 0 0 0 5
28 2018-05-12 0 0 0 0 6
修改添加数据
六.处理丢失数据
1 import numpy as np
2 import pandas as pd
3 dates = pd.date_range('20180507', periods=3)
4 df = pd.DataFrame(np.arange(12).reshape(3,4), index=dates, columns=list('ABCD'))
5 #np.nan表示丢失的数据,默认不包含计算中
6 df.ix[1,'C']=np.nan
7 print(df)
8 #删除对应数据
9 print(df.dropna(axis=0,how='any'))#删除行:行中只要有一个丢失数据就删除
10 print(df.dropna(axis=1,how='all'))#删除列:列中所有数据都是丢失数据就删除
11 #填充对应数据
12 print(df.fillna(value=0))#在丢失数据上把nan变为0
13 #检查是否确实数据
14 print(df.isnull())#print(df.isna())
15 -----------------------------------------------------------
16 A B C D
17 2018-05-07 0 1 2.0 3
18 2018-05-08 4 5 NaN 7
19 2018-05-09 8 9 10.0 11
20 A B C D
21 2018-05-07 0 1 2.0 3
22 2018-05-09 8 9 10.0 11
23 A B C D
24 2018-05-07 0 1 2.0 3
25 2018-05-08 4 5 NaN 7
26 2018-05-09 8 9 10.0 11
27 A B C D
28 2018-05-07 0 1 2.0 3
29 2018-05-08 4 5 0.0 7
30 2018-05-09 8 9 10.0 11
31 A B C D
32 2018-05-07 False False False False
33 2018-05-08 False False True False
34 2018-05-09 False False False False
处理丢失数据
七.文件读取和保存
选择格式时,选择csv utf-8(逗号分隔)
CSV即Comma Separate Values,这种文件格式经常用来作为不同程序之间的数据交互的格式;CSV是以逗号间隔的文本文件
1 import pandas as pd
2 data=pd.read_csv('foo.csv')
3 print(data)
4 data.to_excel('foo.xlsx')
5 -----------------------------------------------
6 jpdfpa
7 0 dafhpah
8 1 adfalln
9 2 '活动'
数据导入导出
八.合并
1 import numpy as np
2 import pandas as pd
3 #concatenating
4 df1=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
5 df2=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])
6 print('df1:',df1)
7 print('df2:',df2)
8 res=pd.concat([df1,df2],axis=0,ignore_index=True)#合并,index列表添加(上下合并),忽略以前的index
9 print('res:',res)
10
11 #join['inner','outer']
12 df3=pd.DataFrame(np.ones((3,4))*3,columns=['b','c','d','e'])
13 print('df3:',df3)
14 res2=pd.concat([df1,df3],join='outer')#合并,index列表添加,columns列表并集
15 res3=pd.concat([df1,df3],join='inner')#合并,index列表添加,columns列表交集
16 print('res2:',res2)
17 print('res3:',res3)
18
19 #join_axes
20 res4=pd.concat([df1,df3],axis=1,join_axes=[df1.index])
21 #合并,columns列表添加(左右合并),以df1的index进行处理
22 print('res4:',res4)
23
24 #append
25 res5=df1.append(df2,ignore_index=True)#把一个表添加到另一个表中,向下添加,
26 res6=df1.append([df2,df3])#把两个表添加到另一个表中,向下添加,
27 print('res5:',res5)
28 print('res6:',res6)
29 s1=pd.Series([1,2,3,4],index=['a','b','c','d'])
30 res7=df1.append(s1,ignore_index=True)#添加一个index
31 print('res7:',res7)
32 --------------------------------------------------------------
33 df1: a b c d
34 0 1.0 1.0 1.0 1.0
35 1 1.0 1.0 1.0 1.0
36 2 1.0 1.0 1.0 1.0
37 df2: a b c d
38 0 2.0 2.0 2.0 2.0
39 1 2.0 2.0 2.0 2.0
40 2 2.0 2.0 2.0 2.0
41 res: a b c d
42 0 1.0 1.0 1.0 1.0
43 1 1.0 1.0 1.0 1.0
44 2 1.0 1.0 1.0 1.0
45 3 2.0 2.0 2.0 2.0
46 4 2.0 2.0 2.0 2.0
47 5 2.0 2.0 2.0 2.0
48 df3: b c d e
49 0 3.0 3.0 3.0 3.0
50 1 3.0 3.0 3.0 3.0
51 2 3.0 3.0 3.0 3.0
52 res2: a b c d e
53 0 1.0 1.0 1.0 1.0 NaN
54 1 1.0 1.0 1.0 1.0 NaN
55 2 1.0 1.0 1.0 1.0 NaN
56 0 NaN 3.0 3.0 3.0 3.0
57 1 NaN 3.0 3.0 3.0 3.0
58 2 NaN 3.0 3.0 3.0 3.0
59 res3: b c d
60 0 1.0 1.0 1.0
61 1 1.0 1.0 1.0
62 2 1.0 1.0 1.0
63 0 3.0 3.0 3.0
64 1 3.0 3.0 3.0
65 2 3.0 3.0 3.0
66 res4: a b c d b c d e
67 0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0
68 1 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0
69 2 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0
70 res5: a b c d
71 0 1.0 1.0 1.0 1.0
72 1 1.0 1.0 1.0 1.0
73 2 1.0 1.0 1.0 1.0
74 3 2.0 2.0 2.0 2.0
75 4 2.0 2.0 2.0 2.0
76 5 2.0 2.0 2.0 2.0
77 res6: a b c d e
78 0 1.0 1.0 1.0 1.0 NaN
79 1 1.0 1.0 1.0 1.0 NaN
80 2 1.0 1.0 1.0 1.0 NaN
81 0 2.0 2.0 2.0 2.0 NaN
82 1 2.0 2.0 2.0 2.0 NaN
83 2 2.0 2.0 2.0 2.0 NaN
84 0 NaN 3.0 3.0 3.0 3.0
85 1 NaN 3.0 3.0 3.0 3.0
86 2 NaN 3.0 3.0 3.0 3.0
87 res7: a b c d
88 0 1.0 1.0 1.0 1.0
89 1 1.0 1.0 1.0 1.0
90 2 1.0 1.0 1.0 1.0
91 3 1.0 2.0 3.0 4.0
添加
1 #merge
2 import numpy as np
3 import pandas as pd
4 left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
5 right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
6 print(left)
7 print(right)
8 res=pd.merge(left,right,on='key')#基于相同columns=‘key’进行合并
9 print(res)
10 # res2=pd.merge(left,right,on=['key1','key2'])合并多个key
11 #how=['left','right','outer','inner']合并的方式:基于左边的表进行填充,右边的表进行填充,并集,交集
12 df1=pd.DataFrame({'coll':[0,1],'col_left':['a','b']})
13 df2=pd.DataFrame({'coll':[1,2,2],'col_left':[2,2,2]})
14 print(df1)
15 print(df2)
16 res2=pd.merge(df1,df2,on='coll',how='outer',indicator=True)#indicator显示merge方式
17 print(res2)
18 #left_index和right_index:是否考虑左边的index和右边的index,值有True或False
19 #suffixes:合并时,给一样的columns,不一样的数据,添加标记进行区分
20 boy=pd.DataFrame({'k':['k1','k2'],'age':[1,2]})
21 girl=pd.DataFrame({'k':['k1','k2'],'age':[3,4]})
22 res3=pd.merge(boy,girl,on='k',suffixes=['_boy','_girl'],how='inner')
23 print(res3)
24 ----------------------------------------------------------------
25 key lval
26 0 foo 1
27 1 bar 2
28 key rval
29 0 foo 4
30 1 bar 5
31 key lval rval
32 0 foo 1 4
33 1 bar 2 5
34 col_left coll
35 0 a 0
36 1 b 1
37 col_left coll
38 0 2 1
39 1 2 2
40 2 2 2
41 col_left_x coll col_left_y _merge
42 0 a 0 NaN left_only
43 1 b 1 2.0 both
44 2 NaN 2 2.0 right_only
45 3 NaN 2 2.0 right_only
46 age_boy k age_girl
47 0 1 k1 3
48 1 2 k2 4
merge
九.运算
1 import pandas as pd
2 import numpy as np
3 data={'one':{'a':1,'b':2,'c':3,'d':np.nan},
4 'two': {'a': 2.0, 'b': 3.0, 'c': 4.0, 'd': 5.0}}
5 df=pd.DataFrame(data,index=['a','b','c','d'],columns=['one','two'])
6 print('df:',df)
7 #根据原有的列添加新列
8 df['three']=df['one']*10+df['two']
9 print(df)
10 #求均值
11 print('mean:\n',df.mean(1))#参数1,计算行均值;默认0,按列求均值,
12 print('sum:\n',df.sum(1))#参数1,计算行和;默认0,按列求和
13 print('函数:\n',df.apply(lambda x:x.max()-x.min()))#将一个函数应用到dataframe的每一列
14 #设置索引
15 df.set_index('one')
16 #重命名列
17 df.rename(columns={'one':'1'},inplace=True)
18 print(df)
19 ----------------------------------------------------
20 df: one two
21 a 1.0 2.0
22 b 2.0 3.0
23 c 3.0 4.0
24 d NaN 5.0
25 one two three
26 a 1.0 2.0 12.0
27 b 2.0 3.0 23.0
28 c 3.0 4.0 34.0
29 d NaN 5.0 NaN
30 mean:
31 a 5.000000
32 b 9.333333
33 c 13.666667
34 d 5.000000
35 dtype: float64
36 sum:
37 a 15.0
38 b 28.0
39 c 41.0
40 d 5.0
41 dtype: float64
42 函数:
43 one 2.0
44 two 3.0
45 three 22.0
46 dtype: float64
47 1 two three
48 a 1.0 2.0 12.0
49 b 2.0 3.0 23.0
50 c 3.0 4.0 34.0
51 d NaN 5.0 NaN
运算
1 import pandas as pd
2 import numpy as np
3 data={'one':{'a':1,'b':2,'c':3,'d':np.nan},
4 'two': {'a': 2.0, 'b': 3.0, 'c': 4.0, 'd': 5.0}}
5 df=pd.DataFrame(data,index=['a','b','c','d'],columns=['one','two'])
6 print('df:',df)
7 #查看最大最小值即其对应index
8 print(df['two'].max())#min取最小
9 print(df['two'].idxmax())#idmin取最小
10 #重设索引
11 df.reset_index(inplace=True)
12 print(df)
13 #改变数据类型
14 print(df.dtypes)
15 df1=df[['two',]].astype('int32')
16 print(df1.dtypes)
17 #计算频率
18 print(df['index'].value_counts())
19 #字符方法
20 print(df['index'].str.upper())#大写
21 print(df['index'].str.len())#长度
22 print(df['index'].str.contains('s'))#包含
23 #。。。。
24 -------------------------------------------------------------
25 df: one two
26 a 1.0 2.0
27 b 2.0 3.0
28 c 3.0 4.0
29 d NaN 5.0
30 5.0
31 d
32 index one two
33 0 a 1.0 2.0
34 1 b 2.0 3.0
35 2 c 3.0 4.0
36 3 d NaN 5.0
37 index object
38 one float64
39 two float64
40 dtype: object
41 two int32
42 dtype: object
43 c 1
44 d 1
45 b 1
46 a 1
47 Name: index, dtype: int64
48 0 A
49 1 B
50 2 C
51 3 D
52 Name: index, dtype: object
53 0 1
54 1 1
55 2 1
56 3 1
57 Name: index, dtype: int64
58 0 False
59 1 False
60 2 False
61 3 False
62 Name: index, dtype: bool
运算2
十.可视化数据
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 # data=pd.Series(np.random.randn(1000),index=np.arange(1000))#随机生成数据
5 # data=data.cumsum()#把生成的数据进行累加
6 # data.plot()#画图
7 # plt.show()#显示图
8
9 data=pd.DataFrame(np.random.randn(1000,4),
10 index=np.arange(1000),
11 columns=list('ABCD'))
12 data=data.cumsum()#把生成的数据进行累加
13 data.plot()#画图
14 plt.show()#显示图
15 -------------------------------------
可视化
1 data=pd.DataFrame(np.random.randn(1000,4),
2 index=np.arange(1000),
3 columns=list('ABCD'))
4 data=data.cumsum()
5 ax=data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class 1')
6 data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class 2',ax=ax)
7 plt.show()
8 -----------------------------------------
可视化
十.总结
1.pandas中index表示行(列表),对应axis=0行操作
columns表示列(列表),对应axis=1列操作
通过字典创建dataframe,创建的图表:最左边是index列表,从上到下;最上面是columns列表,从左到右;中间是字典数据,每一个数据对应相应的index和columns
ix(index,columns):既可以传入索引,也可以传入切片
file:///C:/Users/11373/Desktop/Pandas_Cheat_Sheet.pdf