这是关于pandas的简短介绍,主要面向新用户。可以参阅Cookbook了解更复杂的使用方法。
习惯上,我们做以下导入
Python
In [ 1 ] : import pandas as pd In [ 2 ] : import numpy as np In [ 3 ] : import matplotlib . pyplot as plt |
创建对象
使用传递的值列表序列创建序列, 让pandas创建默认整数索引
Python
In [ 4 ] : s = pd . Series ( [ 1 , 3 , 5 , np . nan , 6 , 8 ] ) In [ 5 ] : s Out [ 5 ] : 0 1 1 3 2 5 3 NaN 4 6 5 8 dtype : float64 |
使用传递的numpy数组创建数据帧,并使用日期索引和标记列.
Python
In [ 6 ] : dates = pd . date_range ( '20130101' , periods = 6 ) In [ 7 ] : dates Out [ 7 ] : < class 'pandas.tseries.index.DatetimeIndex' > [ 2013 - 01 - 01 , . . . , 2013 - 01 - 06 ] Length : 6 , Freq : D , Timezone : None
In [ 8 ] : df = pd . DataFrame ( np . random . randn ( 6 , 4 ) , index = dates , columns = list ( 'ABCD' ) ) In [ 9 ] : df Out [ 9 ] : A B C D 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401 2013 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988 |
使用传递的可转换序列的字典对象创建数据帧.
Python
In [ 10 ] : df2 = pd . DataFrame ( { 'A' : 1. , . . . . : 'B' : pd . Timestamp ( '20130102' ) , . . . . : 'C' : pd . Series ( 1 , index = list ( range ( 4 ) ) , dtype = 'float32' ) , . . . . : 'D' : np . array ( [ 3 ] * 4 , dtype = 'int32' ) , . . . . : 'E' : pd . Categorical ( [ "test" , "train" , "test" , "train" ] ) , . . . . : 'F' : 'foo' } ) . . . . : In [ 11 ] : df2 Out [ 11 ] : A B C D E F 0 1 2013 - 01 - 02 1 3 test foo 1 1 2013 - 01 - 02 1 3 train foo 2 1 2013 - 01 - 02 1 3 test foo 3 1 2013 - 01 - 02 1 3 train foo |
所有明确类型
Python
In [ 12 ] : df2 . dtypes Out [ 12 ] : A float64 B datetime64 [ ns ] C float32 D int32 E category F object dtype : object |
如果你这个正在使用IPython,标签补全列名(以及公共属性)将自动启用。这里是将要完成的属性的子集:
Python
In [ 13 ] : df2 . < TAB > df2 . A df2 . boxplot df2 . abs df2 . C df2 . add df2 . clip df2 . add_prefix df2 . clip_lower df2 . add_suffix df2 . clip_upper df2 . align df2 . columns df2 . all df2 . combine df2 . any df2 . combineAdd df2 . append df2 . combine_first df2 . apply df2 . combineMult df2 . applymap df2 . compound df2 . as_blocks df2 . consolidate df2 . asfreq df2 . convert_objects df2 . as_matrix df2 . copy df2 . astype df2 . corr df2 . at df2 . corrwith df2 . at_time df2 . count df2 . axes df2 . cov df2 . B df2 . cummax df2 . between_time df2 . cummin df2 . bfill df2 . cumprod df2 . blocks df2 . cumsum df2 . bool df2 . D |
A, B, C, 和 D 也是自动完成标签. E
查看数据
参阅基础部分
查看帧顶部和底部行
Python
In [ 14 ] : df . head ( ) Out [ 14 ] : A B C D 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401
In [ 15 ] : df . tail ( 3 ) Out [ 15 ] : A B C D 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401 2013 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988 |
显示索引,列,和底层numpy数据
Python
In [ 16 ] : df . index Out [ 16 ] : < class 'pandas.tseries.index.DatetimeIndex' > [ 2013 - 01 - 01 , . . . , 2013 - 01 - 06 ] Length : 6 , Freq : D , Timezone : None
In [ 17 ] : df . columns Out [ 17 ] : Index ( [ u 'A' , u 'B' , u 'C' , u 'D' ] , dtype = 'object' )
In [ 18 ] : df . values Out [ 18 ] : array ( [ [ 0.4691 , - 0.2829 , - 1.5091 , - 1.1356 ] , [ 1.2121 , - 0.1732 , 0.1192 , - 1.0442 ] , [ - 0.8618 , - 2.1046 , - 0.4949 , 1.0718 ] , [ 0.7216 , - 0.7068 , - 1.0396 , 0.2719 ] , [ - 0.425 , 0.567 , 0.2762 , - 1.0874 ] , [ - 0.6737 , 0.1136 , - 1.4784 , 0.525 ] ] ) |
描述显示数据快速统计摘要
Python
In [ 19 ] : df . describe ( ) Out [ 19 ] : A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 - 0.431125 - 0.687758 - 0.233103 std 0.843157 0.922818 0.779887 0.973118 min - 0.861849 - 2.104569 - 1.509059 - 1.135632 25 % - 0.611510 - 0.600794 - 1.368714 - 1.076610 50 % 0.022070 - 0.228039 - 0.767252 - 0.386188 75 % 0.658444 0.041933 - 0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804 |
转置数据
Python
In [ 20 ] : df . T Out [ 20 ] : 2013 - 01 - 01 2013 - 01 - 02 2013 - 01 - 03 2013 - 01 - 04 2013 - 01 - 05 2013 - 01 - 06 A 0.469112 1.212112 - 0.861849 0.721555 - 0.424972 - 0.673690 B - 0.282863 - 0.173215 - 2.104569 - 0.706771 0.567020 0.113648 C - 1.509059 0.119209 - 0.494929 - 1.039575 0.276232 - 1.478427 D - 1.135632 - 1.044236 1.071804 0.271860 - 1.087401 0.524988 |
按轴排序
Python
In [ 21 ] : df . sort_index ( axis = 1 , ascending = False ) Out [ 21 ] : D C B A 2013 - 01 - 01 - 1.135632 - 1.509059 - 0.282863 0.469112 2013 - 01 - 02 - 1.044236 0.119209 - 0.173215 1.212112 2013 - 01 - 03 1.071804 - 0.494929 - 2.104569 - 0.861849 2013 - 01 - 04 0.271860 - 1.039575 - 0.706771 0.721555 2013 - 01 - 05 - 1.087401 0.276232 0.567020 - 0.424972 2013 - 01 - 06 0.524988 - 1.478427 0.113648 - 0.673690 |
按值排序
Python
In [ 22 ] : df . sort ( columns = 'B' ) Out [ 22 ] : A B C D 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401 |
选择器
注释:
参阅索引文档 索引和选择数据 and 多索引/高级索引
读取
序列, 等价df.A
Python
In [ 23 ] : df [ 'A' ] Out [ 23 ] : 2013 - 01 - 01 0.469112 2013 - 01 - 02 1.212112 2013 - 01 - 03 - 0.861849 2013 - 01 - 04 0.721555 2013 - 01 - 05 - 0.424972 2013 - 01 - 06 - 0.673690 Freq : D , Name : A , dtype : float64 |
使用[]选择行片断
Python
In [ 24 ] : df [ 0 : 3 ] Out [ 24 ] : A B C D 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804
In [ 25 ] : df [ '20130102' : '20130104' ] Out [ 25 ] : A B C D 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 |
使用标签选择
更多信息请参阅按标签选择
使用标签获取横截面
Python
In [ 26 ] : df . loc [ dates [ 0 ] ] Out [ 26 ] : A 0.469112 B - 0.282863 C - 1.509059 D - 1.135632 Name : 2013 - 01 - 01 00 : 00 : 00 , dtype : float64 |
使用标签选择多轴
Python
In [ 27 ] : df . loc [ : , [ 'A' , 'B' ] ] Out [ 27 ] : A B 2013 - 01 - 01 0.469112 - 0.282863 2013 - 01 - 02 1.212112 - 0.173215 2013 - 01 - 03 - 0.861849 - 2.104569 2013 - 01 - 04 0.721555 - 0.706771 2013 - 01 - 05 - 0.424972 0.567020 2013 - 01 - 06 - 0.673690 0.113648 |
包含两个端点
Python
In [ 28 ] : df . loc [ '20130102' : '20130104' , [ 'A' , 'B' ] ] Out [ 28 ] : A B 2013 - 01 - 02 1.212112 - 0.173215 2013 - 01 - 03 - 0.861849 - 2.104569 2013 - 01 - 04 0.721555 - 0.706771 |
降低返回对象维度
Python
In [ 29 ] : df . loc [ '20130102' , [ 'A' , 'B' ] ] Out [ 29 ] : A 1.212112 B - 0.173215 Name : 2013 - 01 - 02 00 : 00 : 00 , dtype : float64 |
获取标量值
Python
In [ 30 ] : df . loc [ dates [ 0 ] , 'A' ] Out [ 30 ] : 0.46911229990718628 |
快速访问并获取标量数据 (等价上面的方法)
Python
In [ 31 ] : df . at [ dates [ 0 ] , 'A' ] Out [ 31 ] : 0.46911229990718628 |
按位置选择
更多信息请参阅按位置参阅
传递整数选择位置
Python
In [ 32 ] : df . iloc [ 3 ] Out [ 32 ] : A 0.721555 B - 0.706771 C - 1.039575 D 0.271860 Name : 2013 - 01 - 04 00 : 00 : 00 , dtype : float64 |
使用整数片断,效果类似numpy/python
Python
In [ 33 ] : df . iloc [ 3 : 5 , 0 : 2 ] Out [ 33 ] : A B 2013 - 01 - 04 0.721555 - 0.706771 2013 - 01 - 05 - 0.424972 0.567020 |
使用整数偏移定位列表,效果类似 numpy/python 样式
Python
In [ 34 ] : df . iloc [ [ 1 , 2 , 4 ] , [ 0 , 2 ] ] Out [ 34 ] : A C 2013 - 01 - 02 1.212112 0.119209 2013 - 01 - 03 - 0.861849 - 0.494929 2013 - 01 - 05 - 0.424972 0.276232 |
显式行切片
Python
In [ 35 ] : df . iloc [ 1 : 3 , : ] Out [ 35 ] : A B C D 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 |
显式列切片
Python
In [ 36 ] : df . iloc [ : , 1 : 3 ] Out [ 36 ] : B C 2013 - 01 - 01 - 0.282863 - 1.509059 2013 - 01 - 02 - 0.173215 0.119209 2013 - 01 - 03 - 2.104569 - 0.494929 2013 - 01 - 04 - 0.706771 - 1.039575 2013 - 01 - 05 0.567020 0.276232 2013 - 01 - 06 0.113648 - 1.478427 |
显式获取一个值
Python
In [ 37 ] : df . iloc [ 1 , 1 ] Out [ 37 ] : - 0.17321464905330861 |
快速访问一个标量(等同上个方法)
Python
In [ 38 ] : df . iat [ 1 , 1 ] Out [ 38 ] : - 0.17321464905330861 |
布尔索引
使用单个列的值选择数据.
Python
In [ 39 ] : df [ df . A > 0 ] Out [ 39 ] : A B C D 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 |
where
Python
In [ 40 ] : df [ df > 0 ] Out [ 40 ] : A B C D 2013 - 01 - 01 0.469112 NaN NaN NaN 2013 - 01 - 02 1.212112 NaN 0.119209 NaN 2013 - 01 - 03 NaN NaN NaN 1.071804 2013 - 01 - 04 0.721555 NaN NaN 0.271860 2013 - 01 - 05 NaN 0.567020 0.276232 NaN 2013 - 01 - 06 NaN 0.113648 NaN 0.524988 |
使用 isin()
Python
In [ 41 ] : df2 = df . copy ( ) In [ 42 ] : df2 [ 'E' ] = [ 'one' , 'one' , 'two' , 'three' , 'four' , 'three' ]
In [ 43 ] : df2 Out [ 43 ] : A B C D E 2013 - 01 - 01 0.469112 - 0.282863 - 1.509059 - 1.135632 one 2013 - 01 - 02 1.212112 - 0.173215 0.119209 - 1.044236 one 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 two 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 0.271860 three 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401 four 2013 - 01 - 06 - 0.673690 0.113648 - 1.478427 0.524988 three
In [ 44 ] : df2 [ df2 [ 'E' ] . isin ( [ 'two' , 'four' ] ) ] Out [ 44 ] : A B C D E 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 1.071804 two 2013 - 01 - 05 - 0.424972 0.567020 0.276232 - 1.087401 four |
赋值
赋值一个新列,通过索引自动对齐数据
Python
In [ 45 ] : s1 = pd . Series ( [ 1 , 2 , 3 , 4 , 5 , 6 ] , index = pd . date_range ( '20130102' , periods = 6 ) ) In [ 46 ] : s1 Out [ 46 ] : 2013 - 01 - 02 1 2013 - 01 - 03 2 2013 - 01 - 04 3 2013 - 01 - 05 4 2013 - 01 - 06 5 2013 - 01 - 07 6 Freq : D , dtype : int64
In [ 47 ] : df [ 'F' ] = s1 |
按标签赋值
Python
In [ 48 ] : df . at [ dates [ 0 ] , 'A' ] = 0 |
按位置赋值
Python
In [ 49 ] : df . iat [ 0 , 1 ] = 0 |
通过numpy数组分配赋值
Python
In [ 50 ] : df . loc [ : , 'D' ] = np . array ( [ 5 ] * len ( df ) ) |
之前的操作结果
Python
In [ 51 ] : df Out [ 51 ] : A B C D F 2013 - 01 - 01 0.000000 0.000000 - 1.509059 5 NaN 2013 - 01 - 02 1.212112 - 0.173215 0.119209 5 1 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 5 2 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 5 3 2013 - 01 - 05 - 0.424972 0.567020 0.276232 5 4 2013 - 01 - 06 - 0.673690 0.113648 - 1.478427 5 5 |
where
Python
In [ 52 ] : df2 = df . copy ( ) In [ 53 ] : df2 [ df2 > 0 ] = - df2 In [ 54 ] : df2 Out [ 54 ] : A B C D F 2013 - 01 - 01 0.000000 0.000000 - 1.509059 - 5 NaN 2013 - 01 - 02 - 1.212112 - 0.173215 - 0.119209 - 5 - 1 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 - 5 - 2 2013 - 01 - 04 - 0.721555 - 0.706771 - 1.039575 - 5 - 3 2013 - 01 - 05 - 0.424972 - 0.567020 - 0.276232 - 5 - 4 2013 - 01 - 06 - 0.673690 - 0.113648 - 1.478427 - 5 - 5 |
丢失的数据
np.nan替换丢失的数据. 默认情况下它并不包含在计算中. 请参阅 Missing Data section
重建索引允许更改/添加/删除指定轴索引,并返回数据副本.
Python
In [ 55 ] : df1 = df . reindex ( index = dates [ 0 : 4 ] , columns = list ( df . columns ) + [ 'E' ] ) In [ 56 ] : df1 . loc [ dates [ 0 ] : dates [ 1 ] , 'E' ] = 1 In [ 57 ] : df1 Out [ 57 ] : A B C D F E 2013 - 01 - 01 0.000000 0.000000 - 1.509059 5 NaN 1 2013 - 01 - 02 1.212112 - 0.173215 0.119209 5 1 1 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 5 2 NaN 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 5 3 NaN |
删除任何有丢失数据的行.
Python
In [ 58 ] : df1 . dropna ( how = 'any' ) Out [ 58 ] : A B C D F E 2013 - 01 - 02 1.212112 - 0.173215 0.119209 5 1 1 |
填充丢失数据
Python
In [ 59 ] : df1 . fillna ( value = 5 ) Out [ 59 ] : A B C D F E 2013 - 01 - 01 0.000000 0.000000 - 1.509059 5 5 1 2013 - 01 - 02 1.212112 - 0.173215 0.119209 5 1 1 2013 - 01 - 03 - 0.861849 - 2.104569 - 0.494929 5 2 5 2013 - 01 - 04 0.721555 - 0.706771 - 1.039575 5 3 5 |
获取值是否nan的布尔标记
Python
In [ 60 ] : pd . isnull ( df1 ) Out [ 60 ] : A B C D F E 2013 - 01 - 01 False False False False True False 2013 - 01 - 02 False False False False False False 2013 - 01 - 03 False False False False False True 2013 - 01 - 04 False False False False False True |
运算
参阅二元运算基础
统计
计算时一般不包括丢失的数据
执行描述性统计
Python
In [ 61 ] : df . mean ( ) Out [ 61 ] : A - 0.004474 B - 0.383981 C - 0.687758 D 5.000000 F 3.000000 dtype : float64 |
在其他轴做相同的运算
Python
In [ 62 ] : df . mean ( 1 ) Out [ 62 ] : 2013 - 01 - 01 0.872735 2013 - 01 - 02 1.431621 2013 - 01 - 03 0.707731 2013 - 01 - 04 1.395042 2013 - 01 - 05 1.883656 2013 - 01 - 06 1.592306 Freq : D , dtype : float64 |
用于运算的对象有不同的维度并需要对齐.除此之外,pandas会自动沿着指定维度计算.
Python
In [ 63 ] : s = pd . Series ( [ 1 , 3 , 5 , np . nan , 6 , 8 ] , index = dates ) . shift ( 2 ) In [ 64 ] : s Out [ 64 ] : 2013 - 01 - 01 NaN 2013 - 01 - 02 NaN 2013 - 01 - 03 1 2013 - 01 - 04 3 2013 - 01 - 05 5 2013 - 01 - 06 NaN Freq : D , dtype : float64
In [ 65 ] : df . sub ( s , axis = 'index' ) Out [ 65 ] : A B C D F 2013 - 01 - 01 NaN NaN NaN NaN NaN 2013 - 01 - 02 NaN NaN NaN NaN NaN 2013 - 01 - 03 - 1.861849 - 3.104569 - 1.494929 4 1 2013 - 01 - 04 - 2.278445 - 3.706771 - 4.039575 2 0 2013 - 01 - 05 - 5.424972 - 4.432980 - 4.723768 0 - 1 2013 - 01 - 06 NaN NaN NaN NaN NaN |
Apply
在数据上使用函数
Python
In [ 66 ] : df . apply ( np . cumsum ) Out [ 66 ] : A B C D F 2013 - 01 - 01 0.000000 0.000000 - 1.509059 5 NaN 2013 - 01 - 02 1.212112 - 0.173215 - 1.389850 10 1 2013 - 01 - 03 0.350263 - 2.277784 - 1.884779 15 3 2013 - 01 - 04 1.071818 - 2.984555 - 2.924354 20 6 2013 - 01 - 05 0.646846 - 2.417535 - 2.648122 25 10 2013 - 01 - 06 - 0.026844 - 2.303886 - 4.126549 30 15
In [ 67 ] : df . apply ( lambda x : x . max ( ) - x . min ( ) ) Out [ 67 ] : A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype : float64 |
直方图
请参阅 直方图和离散化
Python
In [ 68 ] : s = pd . Series ( np . random . randint ( 0 , 7 , size = 10 ) ) In [ 69 ] : s Out [ 69 ] : 0 4 1 2 2 1 3 2 4 6 5 4 6 4 7 6 8 4 9 4 dtype : int32
In [ 70 ] : s . value_counts ( ) Out [ 70 ] : 4 5 6 2 2 2 1 1 dtype : int64 |
字符串方法
序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。 注意字符匹配方法默认情况下通常使用正则表达式(并且大多数时候都如此). 更多信息请参阅字符串向量方法.
Python
In [ 71 ] : s = pd . Series ( [ 'A' , 'B' , 'C' , 'Aaba' , 'Baca' , np . nan , 'CABA' , 'dog' , 'cat' ] ) In [ 72 ] : s . str . lower ( ) Out [ 72 ] : 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype : object |