Pandas中文官档~基础用法3

转载

lsxxx2011 2022-11-29 19:08:01

文章标签 函数应用 ide 函数返回 文章分类 虚拟化云计算

呆鸟云：“觉得有用，就请点个在看，哈哈”

函数应用

不管是为 pandas 对象应用自定义函数，还是应用其它第三方函数，都离不开以下三种方法。用哪种方法取决于操作的对象是 DataFrame 或 Series ，是行或列，还是元素。

表级函数应用：`pipe()`
行列级函数应用： apply()
聚合 API：`agg()` 与 `transform()`
元素级函数应用：`applymap()`

表级函数应用

虽然可以把 DataFrame 与 Series 传递给函数。不过，通过链式调用函数时，最好使用 pipe() 方法。对比以下两种方式：

# f, g, and h are functions taking and returning ``DataFrames``
>>> f(g(h(df), arg1= 1), arg2= 2, arg3= 3)

下列代码与上述代码等效

>>> (df.pipe(h)
... .pipe(g, arg1= 1)
... .pipe(f, arg2= 2, arg3= 3))

pandas 鼓励使用第二种方式，即链式方法。在链式方法中调用自定义函数或第三方支持库函数时，用 pipe 更容易，与用 pandas 自身方法一样。

上例中， f、 g 与 h 这几个函数都把 DataFrame 当作首位参数。要是想把数据作为第二个参数，该怎么办？本例中， pipe 为元组（ callable,data_keyword）形式。 .pipe 把 DataFrame 作为元组里指定的参数。

下例用 statsmodels 拟合回归。该 API 先接收一个公式， DataFrame 是第二个参数， data。要传递函数，则要用 pipe 接收关键词对 ( sm.ols,'data')。

138]: import statsmodels.formula.api as sm

In [ 139]: bb = pd.read_csv( 'data/baseball.csv', index_col= 'id')

In [ 140]: (bb.query( 'h > 0')
   .....:    .assign(ln_h= lambda df: np.log(df.h))
   .....:    .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
   .....:    .fit()
   .....:    .summary()
   .....:  )
   .....:
Out[ 140]:
< class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                        0.685
Model:                            OLS   Adj. R-squared:                   0.665
Method:                 Least Squares   F-statistic:                      34.28
Date:                Thu, 22 Aug 2019   Prob (F-statistic):            3.48e-15
Time:                         15: 48: 59   Log-Likelihood:                 -205.92
No. Observations:                   68   AIC:                              421.8
Df Residuals:                       63   BIC:                              432.9
Df Model:                            4
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [ 0.025       0.975]
-------------------------------------------------------------------------------
Intercept    -8484.7720    4664.146      -1.819       0.074    -1.78e+04      835.780
C(lg)[T.NL]     -2.2736       1.325      -1.716       0.091       -4.922        0.375
ln_h            -1.3542       0.875      -1.547       0.127       -3.103        0.395
year             4.2277       2.324       1.819       0.074       -0.417        8.872
g                0.1841       0.029       6.258       0.000        0.125        0.243
==============================================================================
Omnibus:                        10.875   Durbin-Watson:                    1.999
Prob(Omnibus):                   0.004   Jarque-Bera (JB):                17.298
Skew:                            0.537   Prob(JB):                      0.000175
Kurtosis:                        5.225   Cond. No.                      1.49e+07
==============================================================================

Warnings:
[ 1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[ 2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

unix 的 pipe 与后来出现的 dplyr 及 magrittr 启发了 pipe 方法，在此，引入了 R 语言里用于读取 pipe 的操作符 ( %>%)。 pipe 的实现思路非常清晰，仿佛 Python 源生的一样。强烈建议大家阅读 pipe() 的源代码。

行列级函数应用

apply() 方法可以沿着 DataFrame 的轴应用任何函数，比如，描述性统计方法，该方法支持 axis 参数。

141]: df.apply(np.mean)
Out[ 141]:
one       0.811094
two       1.360588
three     0.187958
dtype: float64

In [ 142]: df.apply(np.mean, axis= 1)
Out[ 142]:
a     1.583749
b     0.734929
c     1.133683
d    -0.166914
dtype: float64

In [ 143]: df.apply( lambda x: x.max() - x.min())
Out[ 143]:
one       1.051928
two       1.632779
three     1.840607
dtype: float64

In [ 144]: df.apply(np.cumsum)
Out[ 144]:
        one       two     three
a   1.394981   1.772517       NaN
b   1.738035   3.684640 -0.050390
c   2.433281   5.163008   1.177045
d       NaN   5.442353   0.563873

In [ 145]: df.apply(np.exp)
Out[ 145]:
        one       two     three
a   4.034899   5.885648       NaN
b   1.409244   6.767440   0.950858
c   2.004201   4.385785   3.412466
d       NaN   1.322262   0.541630

apply() 方法还支持通过函数名字符串调用函数。

146]: df.apply( 'mean')
Out[ 146]:
one       0.811094
two       1.360588
three     0.187958
dtype: float64

In [ 147]: df.apply( 'mean', axis= 1)
Out[ 147]:
a     1.583749
b     0.734929
c     1.133683
d    -0.166914
dtype: float64

默认情况下， apply() 调用的函数返回的类型会影响 DataFrame.apply 输出结果的类型。

函数返回的是 Series 时，最终输出的结果是 DataFrame。输出的列与函数返回的 Series 索引相匹配。
函数返回其它任意类型时，输出结果是 Series。

result_type 会覆盖默认行为，该参数有三个选项： reduce、 broadcast、 expand。这些选项决定了列表型返回值是否扩展为 DataFrame。

用好 apply() 可以了解数据集的很多信息。比如可以提取每列的最大值对应的日期：

148]: tsdf = pd.DataFrame(np.random.randn( 1000, 3), columns=[ 'A', 'B', 'C'],
   .....:                     index=pd.date_range( '1/1/2000', periods= 1000))
   .....:

In [ 149]: tsdf.apply( lambda x: x.idxmax())
Out[ 149]:
A    2000 -08 -06
B    2001 -01 -18
C    2001 -07 -18
dtype: datetime64[ns]

还可以向 apply() 方法传递额外的参数与关键字参数。比如下例中要应用的这个函数：

def subtract_and_divide(x, sub, divide=1):
return (x - sub) / divide

可以用下列方式应用该函数：

5,), divide= 3)

为每行或每列执行 Series 方法的功能也很实用：

150]: tsdf
Out[ 150]:
                   A         B         C
2000 -01 -01 -0.158131 -0.232466   0.321604
2000 -01 -02 -1.810340 -3.105758   0.433834
2000 -01 -03 -1.209847 -1.156793 -0.136794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08 -0.653602   0.178875   1.008298
2000 -01 -09   1.007996   0.462824   0.254472
2000 -01 -10   0.307473   0.600337   1.643950

In [ 151]: tsdf.apply(pd.Series.interpolate)
Out[ 151]:
                   A         B         C
2000 -01 -01 -0.158131 -0.232466   0.321604
2000 -01 -02 -1.810340 -3.105758   0.433834
2000 -01 -03 -1.209847 -1.156793 -0.136794
2000 -01 -04 -1.098598 -0.889659   0.092225
2000 -01 -05 -0.987349 -0.622526   0.321243
2000 -01 -06 -0.876100 -0.355392   0.550262
2000 -01 -07 -0.764851 -0.088259   0.779280
2000 -01 -08 -0.653602   0.178875   1.008298
2000 -01 -09   1.007996   0.462824   0.254472
2000 -01 -10   0.307473   0.600337   1.643950

apply() 有一个参数 raw，默认值为 False，在应用函数前，使用该参数可以将每行或列转换为 Series。该参数为 True 时，传递的函数接收 ndarray 对象，若不需要索引功能，这种操作能显著提高性能。

聚合 API

0.20.0 版新增。

聚合 API 可以快速、简洁地执行多个聚合操作。Pandas 对象支持多个类似的 API，如 groupby API、window functions API、resample API。聚合函数为 DataFrame.aggregate()，它的别名是 DataFrame.agg()。

这里使用与前例类似的 DataFrame：

152]: tsdf = pd.DataFrame(np.random.randn( 10, 3), columns=[ 'A', 'B', 'C'],
   .....:                     index=pd.date_range( '1/1/2000', periods= 10))
   .....:

In [ 153]: tsdf.iloc[ 3: 7] = np.nan

In [ 154]: tsdf
Out[ 154]:
                   A         B         C
2000 -01 -01   1.257606   1.004194   0.167574
2000 -01 -02 -0.749892   0.288112 -0.757304
2000 -01 -03 -0.207550 -0.298599   0.116018
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.814347 -0.257623   0.869226
2000 -01 -09 -0.250663 -1.206601   0.896839
2000 -01 -10   2.169758 -1.333363   0.283157

应用单个函数时，该操作与 apply() 等效，这里也可以用字符串表示聚合函数名。下面的聚合函数输出的结果为 Series：

155]: tsdf.agg(np.sum)
Out[ 155]:
A     3.033606
B    -1.803879
C     1.575510
dtype: float64

In [ 156]: tsdf.agg( 'sum')
Out[ 156]:
A     3.033606
B    -1.803879
C     1.575510
dtype: float64

# 因为应用的是单个函数，该操作与`.sum()` 是等效的
In [ 157]: tsdf.sum()
Out[ 157]:
A     3.033606
B    -1.803879
C     1.575510
dtype: float64

对 Series 进行单个聚合操作，返回的是标量值：

158]: tsdf.A.agg( 'sum')
Out[ 158]: 3.033606102414146

多函数聚合

还可以用列表形式传递多个聚合函数。每个函数在输出结果 DataFrame 里以行的形式显示，行名是每个聚合函数的函数名。

159]: tsdf.agg([ 'sum'])
Out[ 159]:
A B C
sum 3.033606 -1.803879 1.57551

多个函数输出多行：

160]: tsdf.agg([ 'sum', 'mean'])
Out[ 160]:
             A         B         C
sum    3.033606 -1.803879   1.575510
mean   0.505601 -0.300647   0.262585

对于 Series，多个函数返回的结果也是 Series，其索引为函数名：

161]: tsdf.A.agg([ 'sum', 'mean'])
Out[ 161]:
sum 3.033606
mean 0.505601
Name: A, dtype: float64

传递 lambda 函数时，输出名为 <lambda> 的行：

162]: tsdf.A.agg([ 'sum', lambda x: x.mean()])
Out[ 162]:
sum 3.033606
< lambda> 0.505601
Name: A, dtype: float64

应用自定义函数时，则该函数名为输出结果的行名：

163]: def mymean(x):
   .....:      return x.mean()
   .....:

In [ 164]: tsdf.A.agg([ 'sum', mymean])
Out[ 164]:
sum        3.033606
mymean     0.505601
Name: A, dtype: float64

用字典实现聚合

指定为哪些列应用哪些聚合函数时，需要把包含列名与标量（或标量列表）的字典传递给 DataFrame.agg。

注意：这里输出结果的顺序不是固定的，要想让输出顺序与输入顺序一致，请使用 OrderedDict。

165]: tsdf.agg({ 'A': 'mean', 'B': 'sum'})
Out[ 165]:
A 0.505601
B -1.803879
dtype: float64

输入的参数是列表时，输出结果为 DataFrame，并以矩阵形式显示所有聚合函数的计算结果，且输出结果由所有唯一函数组成。未执行聚合操作的列输出结果为 NaN 值：

166]: tsdf.agg({ 'A': [ 'mean', 'min'], 'B': 'sum'})
Out[ 166]:
             A         B
mean   0.505601       NaN
min   -0.749892       NaN
sum        NaN -1.803879

多种 Dtype

DataFrame 里包含不能执行聚合操作的多种 Dtype 时， .agg 只计算可以执行聚合的列。这与 groupby 的 .agg 操作类似：

167]: mdf = pd.DataFrame({ 'A': [ 1, 2, 3],
   .....:                      'B': [ 1., 2., 3.],
   .....:                      'C': [ 'foo', 'bar', 'baz'],
   .....:                      'D': pd.date_range( '20130101', periods= 3)})
   .....:

In [ 168]: mdf.dtypes
Out[ 168]:
A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object

169]: mdf.agg([ 'min', 'sum'])
Out[ 169]:
     A    B          C          D
min   1   1.0        bar 2013 -01 -01
sum   6   6.0  foobarbaz        NaT

自定义 Describe

用 .agg() 可以轻松地创建与内置 describe 函数类似的自定义 describe 函数。

170]: from functools import partial

In [ 171]: q_25 = partial(pd.Series.quantile, q= 0.25)

In [ 172]: q_25.__name__ = '25%'

In [ 173]: q_75 = partial(pd.Series.quantile, q= 0.75)

In [ 174]: q_75.__name__ = '75%'

In [ 175]: tsdf.agg([ 'count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])
Out[ 175]:
               A         B         C
count    6.000000   6.000000   6.000000
mean     0.505601 -0.300647   0.262585
std      1.103362   0.887508   0.606860
min     -0.749892 -1.333363 -0.757304
25%     -0.239885 -0.979600   0.128907
median   0.303398 -0.278111   0.225365
75%      1.146791   0.151678   0.722709
max      2.169758   1.004194   0.896839

Transform API

0.20.0 版新增。

transform() 方法返回的结果与原始数据具有同样索引，且大小相同。这个 API 支持同时处理多种操作，不用一个一个操作，且该 API 与 .agg API 类似。

下面先创建一个 DataFrame：

176]: tsdf = pd.DataFrame(np.random.randn( 10, 3), columns=[ 'A', 'B', 'C'],
   .....:                     index=pd.date_range( '1/1/2000', periods= 10))
   .....:

In [ 177]: tsdf.iloc[ 3: 7] = np.nan

In [ 178]: tsdf
Out[ 178]:
                   A         B         C
2000 -01 -01 -0.428759 -0.864890 -0.675341
2000 -01 -02 -0.168731   1.338144 -1.279321
2000 -01 -03 -1.621034   0.438107   0.903794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374 -1.240447 -0.201052
2000 -01 -09 -0.157795   0.791197 -1.144209
2000 -01 -10 -0.030876   0.371900   0.061932

这里转换的是整个 DataFrame。 .transform() 支持 Numpy 函数、字符串函数及自定义函数。

179]: tsdf.transform(np.abs)
Out[ 179]:
                   A         B         C
2000 -01 -01   0.428759   0.864890   0.675341
2000 -01 -02   0.168731   1.338144   1.279321
2000 -01 -03   1.621034   0.438107   0.903794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374   1.240447   0.201052
2000 -01 -09   0.157795   0.791197   1.144209
2000 -01 -10   0.030876   0.371900   0.061932

In [ 180]: tsdf.transform( 'abs')
Out[ 180]:
                   A         B         C
2000 -01 -01   0.428759   0.864890   0.675341
2000 -01 -02   0.168731   1.338144   1.279321
2000 -01 -03   1.621034   0.438107   0.903794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374   1.240447   0.201052
2000 -01 -09   0.157795   0.791197   1.144209
2000 -01 -10   0.030876   0.371900   0.061932

In [ 181]: tsdf.transform( lambda x: x.abs())
Out[ 181]:
                   A         B         C
2000 -01 -01   0.428759   0.864890   0.675341
2000 -01 -02   0.168731   1.338144   1.279321
2000 -01 -03   1.621034   0.438107   0.903794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374   1.240447   0.201052
2000 -01 -09   0.157795   0.791197   1.144209
2000 -01 -10   0.030876   0.371900   0.061932

这里的 transform() 接受单个函数；与 ufunc 等效。

182]: np.abs(tsdf)
Out[ 182]:
                   A         B         C
2000 -01 -01   0.428759   0.864890   0.675341
2000 -01 -02   0.168731   1.338144   1.279321
2000 -01 -03   1.621034   0.438107   0.903794
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374   1.240447   0.201052
2000 -01 -09   0.157795   0.791197   1.144209
2000 -01 -10   0.030876   0.371900   0.061932

.transform() 向 Series 传递单个函数时，返回的结果也是单个 Series。

183]: tsdf.A.transform(np.abs)
Out[ 183]:
2000 -01 -01     0.428759
2000 -01 -02     0.168731
2000 -01 -03     1.621034
2000 -01 -04         NaN
2000 -01 -05         NaN
2000 -01 -06         NaN
2000 -01 -07         NaN
2000 -01 -08     0.254374
2000 -01 -09     0.157795
2000 -01 -10     0.030876
Freq: D, Name: A, dtype: float64

多函数 Transform

transform() 调用多个函数时，将生成多重索引 DataFrame。第一层是原始数据集的列名；第二层是 transform() 调用的函数名。

184]: tsdf.transform([np.abs, lambda x: x + 1])
Out[ 184]:
                   A                   B                   C
            absolute  < lambda>  absolute  < lambda>  absolute  < lambda>
2000 -01 -01   0.428759   0.571241   0.864890   0.135110   0.675341   0.324659
2000 -01 -02   0.168731   0.831269   1.338144   2.338144   1.279321 -0.279321
2000 -01 -03   1.621034 -0.621034   0.438107   1.438107   0.903794   1.903794
2000 -01 -04       NaN       NaN       NaN       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN       NaN       NaN       NaN
2000 -01 -08   0.254374   1.254374   1.240447 -0.240447   0.201052   0.798948
2000 -01 -09   0.157795   0.842205   0.791197   1.791197   1.144209 -0.144209
2000 -01 -10   0.030876   0.969124   0.371900   1.371900   0.061932   1.061932

为 Series 应用多个函数时，输出结果是 DataFrame，列名是 transform() 调用的函数名。

185]: tsdf.A.transform([np.abs, lambda x: x + 1])
Out[ 185]:
            absolute  < lambda>
2000 -01 -01   0.428759   0.571241
2000 -01 -02   0.168731   0.831269
2000 -01 -03   1.621034 -0.621034
2000 -01 -04       NaN       NaN
2000 -01 -05       NaN       NaN
2000 -01 -06       NaN       NaN
2000 -01 -07       NaN       NaN
2000 -01 -08   0.254374   1.254374
2000 -01 -09   0.157795   0.842205
2000 -01 -10   0.030876   0.969124

用字典执行 `transform` 操作

函数字典可以为每列执行指定 transform() 操作。

186]: tsdf.transform({ 'A': np.abs, 'B': lambda x: x + 1})
Out[ 186]:
                   A         B
2000 -01 -01   0.428759   0.135110
2000 -01 -02   0.168731   2.338144
2000 -01 -03   1.621034   1.438107
2000 -01 -04       NaN       NaN
2000 -01 -05       NaN       NaN
2000 -01 -06       NaN       NaN
2000 -01 -07       NaN       NaN
2000 -01 -08   0.254374 -0.240447
2000 -01 -09   0.157795   1.791197
2000 -01 -10   0.030876   1.371900

transform() 的参数是列表字典时，生成的是以 transform() 调用的函数为名的多重索引 DataFrame。

187]: tsdf.transform({ 'A': np.abs, 'B': [ lambda x: x + 1, 'sqrt']})
Out[ 187]:
                   A         B
            absolute  < lambda>      sqrt
2000 -01 -01   0.428759   0.135110       NaN
2000 -01 -02   0.168731   2.338144   1.156782
2000 -01 -03   1.621034   1.438107   0.661897
2000 -01 -04       NaN       NaN       NaN
2000 -01 -05       NaN       NaN       NaN
2000 -01 -06       NaN       NaN       NaN
2000 -01 -07       NaN       NaN       NaN
2000 -01 -08   0.254374 -0.240447       NaN
2000 -01 -09   0.157795   1.791197   0.889493
2000 -01 -10   0.030876   1.371900   0.609836

元素级函数应用

并非所有函数都能矢量化，即接受 Numpy 数组，返回另一个数组或值，DataFrame 的 applymap() 及 Series 的 map() ，支持任何接收单个值并返回单个值的 Python 函数。

示例如下：

188]: df4
Out[ 188]:
        one       two     three
a   1.394981   1.772517       NaN
b   0.343054   1.912123 -0.050390
c   0.695246   1.478369   1.227435
d       NaN   0.279344 -0.613172

In [ 189]: def f(x):
   .....:      return len(str(x))
   .....:

In [ 190]: df4[ 'one'].map(f)
Out[ 190]:
a     18
b     19
c     18
d      3
Name: one, dtype: int64

In [ 191]: df4.applymap(f)
Out[ 191]:
   one  two  three
a    18    17       3
b    19    18      20
c    18    18      16
d     3    19      19

Series.map() 还有个功能，可以“连接”或“映射”第二个 Series 定义的值。这与 merging/joining 功能联系非常紧密：

192]: s = pd.Series([ 'six', 'seven', 'six', 'seven', 'six'],
   .....:               index=[ 'a', 'b', 'c', 'd', 'e'])
   .....:

In [ 193]: t = pd.Series({ 'six': 6., 'seven': 7.})

In [ 194]: s
Out[ 194]:
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [ 195]: s.map(t)
Out[ 195]:
a     6.0
b     7.0
c     6.0
d     7.0
e     6.0
dtype: float64