目录
- Pandas
- Data Structure
- Series
- DataFrame
- Index Objects
- Essential Functionality
- Reindex
- Dropping Entries from an Axis
- Indexing, Selection, and Filtering
- Selection with loc and iloc
- Integer Indexing
- Arithmetic and Data Alignment
Pandas
Data Structure
Series
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
Series 是一个一维的 array-like object,包含了一个值序列,(类型与 Numpy 类型相同),以及一个相关联的数据标签数组,叫做索引(index)
import pandas as pd
obj1 = pd.Series([4,7,-5,3])
obj2 = pd.Serise([4,7,-5,3], index = ['d','b','a','c'])
#access value or set of values in Serise by lebal in index
obj2['d'] # 4
obj2[['c','a','b']]
import pandas as pd
obj1 = pd.Series([4,7,-5,3])
obj2 = pd.Serise([4,7,-5,3], index = ['d','b','a','c'])
#access value or set of values in Serise by lebal in index
obj2['d'] # 4
obj2[['c','a','b']]
Series也可以被理解为一个有序的字典。
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
同样可以通过显示的给index
赋值一个label list
来改变index
构造函数中修改index
:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index = states)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index = states)
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
通过instance method
来修改index:
In [41]: obj
Out[41]:
0 4
1 7
2 -5
3 3
dtype: int64
In [42]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [43]: obj
Out[43]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
In [41]: obj
Out[41]:
0 4
1 7
2 -5
3 3
dtype: int64
In [42]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [43]: obj
Out[43]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
isnull
方法用来检测Series
中每个元素是否为空(NaN:Not a Number)
pd.isnull(obj4)
# 等价于 obj4.isnull
#Output
California True
Ohio False
Oregon False
Texas False
dtype: bool
pd.isnull(obj4)
# 等价于 obj4.isnull
#Output
California True
Ohio False
Oregon False
Texas False
dtype: bool
一个有用的特性是,两个Series之间进行数学运算时,会通过 index 自动关联起来。
In [35]: obj3
Out[35]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [36]: obj4
Out[36]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [37]: obj3 + obj4
Out[37]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Series
本身以及其索引都是具有name
的。
In [38]: obj4.name = 'population'
In [39]: obj4.index.name = 'state'
In [40]: obj4
Out[40]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
In [38]: obj4.name = 'population'
In [39]: obj4.index.name = 'state'
In [40]: obj4
Out[40]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.
DataFrame 代表一个数据表格,其中包含了一个有序的 columns 集合,每个列可以是不同的数据类型。DataFrame 同时具有行索引和列索引。它可以被理解为是一个由Series
组成的字典,所有的Series
共享一个索引。
构建一个DataFrame最简单的方式是通过一个 由相同长度list
组成的字典来产生DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
构造得到的DataFrame会自动给index
赋值,并且columns
被自动排序。
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
如果在构造的时候指定了columns的顺序,那么就会按照指定的顺序排列。
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
通过类似字典访问的语法,可以从DataFrame
中抽取出一个Series
frame2['pop']
frame2['pop']
0 1.5
1 1.7
2 3.6
3 2.4
4 2.9
5 3.2
Name: pop, dtype: float64
frame2.year
frame2.year
0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
可以通过赋值操作来添加或者修改columns
的值。Assigning a column that doesn’t exist will create a new column. 如果是用list
或者array
来给column
赋值,那么其长度必须等于DataFrame的长度。
frame2['debt'] = 6.5
frame2['debt'] = 6.5
year state pop debt
0 2000 Ohio 1.5 6.5
1 2001 Ohio 1.7 6.5
2 2002 Ohio 3.6 6.5
3 2001 Nevada 2.4 6.5
4 2002 Nevada 2.9 6.5
5 2003 Nevada 3.2 6.5
frame2['debt'] = np.arange(6.)
frame2['debt'] = np.arange(6.)
year state pop debt
0 2000 Ohio 1.5 0.0
1 2001 Ohio 1.7 1.0
2 2002 Ohio 3.6 2.0
3 2001 Nevada 2.4 3.0
4 2002 Nevada 2.9 4.0
5 2003 Nevada 3.2 5.0
如果是用 Series 来赋值,那么 Series 的索引将会被重组与 DataFrame的索引对应,如果不对应,那么相应的值为NaN
frame2.index = ['zero', 'one', 'two', 'three', 'four', 'five']
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2.debt = val
frame2.index = ['zero', 'one', 'two', 'three', 'four', 'five']
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2.debt = val
year state pop debt
zero 2000 Ohio 1.5 NaN
one 2001 Ohio 1.7 NaN
two 2002 Ohio 3.6 -1.2
three 2001 Nevada 2.4 NaN
four 2002 Nevada 2.9 -1.5
five 2003 Nevada 3.2 -1.7
del
可以用来删除 column
嵌套形式的字典作为数据传递给DataFrame
时,pandas
会将外层关键字作为columns
,将内层关键字作为 row index
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df3 = pd.DataFrame(pop)
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df3 = pd.DataFrame(pop)
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
可以使用类似NumPy数组的方式来对DataFrame进行翻转。
df3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
df3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
DataFrame的 values 方式将 DataFrame 中的值放在一个二维的 NumPy Array 中返回。
df3.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
df3.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
Possible data inputs to DataFrame constructor
Type | Notes |
2D ndarray | 行和列的标签为可选参数 |
dict of arrays, lists,or tuples | 每个sequence成为DataFrame中的一个列;所有的sequence必须长度相同 |
NumPy Structured/record array | 按照”dict of arrays"处理 |
dict of Series | dict中的每个Series成为一个列;如果没有显示规定index,那么所有Series的index成为DataFrame的row index |
dict of dicts | 每个内部dict成为一个列;所有内部dict的 keys 组成row index,外层dict的keys组成column index |
List of dicts or Series | 每一个item成为DataFrame中的一个行;union of dict keys or Series indexes 成为DataFrame的列标签 |
List of lists or tuples | 处理方式与2D ndarray相同 |
Another DataFrame | 除非显示改变index,否则使用源DataFrame的index |
几个例子:
# dict of Series
s1 = pd.Series([1,2,3], index=['one','two','three'])
s2 = pd.Series([4,5,6])
s3 = pd.Series([7,8,9])
dic = {'s1':s1,'s2':s2,'s3':s3}
df = pd.DataFrame(dic)
df
s1 s2 s3
one 1.0 NaN NaN
two 2.0 NaN NaN
three 3.0 NaN NaN
0 NaN 4.0 7.0
1 NaN 5.0 8.0
2 NaN 6.0 9.0
# dict of Series
s1 = pd.Series([1,2,3], index=['one','two','three'])
s2 = pd.Series([4,5,6])
s3 = pd.Series([7,8,9])
dic = {'s1':s1,'s2':s2,'s3':s3}
df = pd.DataFrame(dic)
df
s1 s2 s3
one 1.0 NaN NaN
two 2.0 NaN NaN
three 3.0 NaN NaN
0 NaN 4.0 7.0
1 NaN 5.0 8.0
2 NaN 6.0 9.0
# dict of dicts
d1 = {'one':1,'two':2}
d2 = {'three':3,'four':4}
dic = {'A':d1, 'B':d2}
df = pd.DataFrame(dic)
df
A B
four NaN 4.0
one 1.0 NaN
three NaN 3.0
two 2.0 NaN
# dict of dicts
d1 = {'one':1,'two':2}
d2 = {'three':3,'four':4}
dic = {'A':d1, 'B':d2}
df = pd.DataFrame(dic)
df
A B
four NaN 4.0
one 1.0 NaN
three NaN 3.0
two 2.0 NaN
Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata
Index objects 是不可变对象,因此通过pd.Index
构造函数来构造一个Index objects可以安全地创建不同的Series或者DataFrame
Index object 除了可以当作数组使用,也可以像固定大小的set一样使用。不同之处在于其中可以有相同的元素。
Essential Functionality
Reindex
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a','b','c','d','e'])
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a','b','c','d','e'])
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
reindex 方式的可选参数 method 可以规定 reindex 时添加元素的规则,ffill
表示 forward fill
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj4 = obj3.reindex(range(6), method = 'ffill')
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj4 = obj3.reindex(range(6), method = 'ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
作用于DataFrame对象,当只传递一个 sequence 时默认对 row index 重排。
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
对列序列的 reindex 可以显示地给 column 传值
states = [['Texas', 'Utah', 'California']
frame2.reindex(columns = states)
states = [['Texas', 'Utah', 'California']
frame2.reindex(columns = states)
Dropping Entries from an Axis
drop method 可以用于 Series 和 DataFrame,用于 Series 比较简单。
With DataFrame, index values can be deleted from either axis.
Calling drop with a sequence of labels will drop values from the row labels(axis 0).
传递的参数如果是一个 sequence of labels,那么将会从 row lebels (axis 0) 中删除对应的标签。
data = pd.DataFrame(np.arange(16).reshape((4, 4)),\
index=['Ohio', 'Colorado', 'Utah', 'New York'],\
columns=['one', 'two', 'three', 'four'])
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data = pd.DataFrame(np.arange(16).reshape((4, 4)),\
index=['Ohio', 'Colorado', 'Utah', 'New York'],\
columns=['one', 'two', 'three', 'four'])
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
可以通过显式地给参数axis
赋值来规定从个坐标轴drop values
data.drop('two', axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop('two', axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
包括 drop 在内的很多 method 会对原对象进行 in-place modification.
Indexing, Selection, and Filtering
Series
Series 可以被理解为是有序的字典,Series index 与 NumPy Array index 相似,区别在于 Series indexing 可以使用 index values 来进行索引,而不是只能使用整数进行索引。
使用 label 进行切片和普通的 slicing 区别在于最后一个值会被包含。
DataFrame
DataFrame 中 如果 select 使用的是 slicing 或者 boolean array,那么会按行进行数据选择;如果使用单个 label(only column label) 或者 list of labels 进行索引那么会按列进行选择。
如果使用单个 row label 进行select会出错。
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
#integer slicing
data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
#label slicing
data['Ohio':'Utah']
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
# select with boolean array
data['three']>5
Ohio False
Colorado True
Utah True
New York True
Name: three, dtype: bool
data[data['three']>5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
# select with single label
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
# select with list of labels
data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
#integer slicing
data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
#label slicing
data['Ohio':'Utah']
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
# select with boolean array
data['three']>5
Ohio False
Colorado True
Utah True
New York True
Name: three, dtype: bool
data[data['three']>5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
# select with single label
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
# select with list of labels
data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
像二维数组一样,DataFrame 也可以使用 boolean DataFrame 进行选择。
Selection with loc and iloc
前面提到 label indexing 只能使用 column label,如果希望使用 row labels 进行 indexing,那么需要借助 loc 和 iloc 这两个方法。
loc
及iloc
的用途在于:使用类似于NumPy中的标记来访问DataFrame中行或者列的子集。
先回顾一下 Numpy 中是如何进行 indexing 的
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr[[1,2]]
array([[3, 4, 5],
[6, 7, 8]])
arr[ :2, [1,2]]
array([[1, 2],
[4, 5]])
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr[[1,2]]
array([[3, 4, 5],
[6, 7, 8]])
arr[ :2, [1,2]]
array([[1, 2],
[4, 5]])
下面使用loc
方式以相似的标记形式来在DataFrame中进行索引,传递给loc
的值被理解为label
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.loc['Colorado']
one 4
two 5
three 6
four 7
Name: Colorado, dtype: int32
data.loc['Colorado',['one','two']]
one 4
two 5
Name: Colorado, dtype: int32
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.loc['Colorado']
one 4
two 5
three 6
four 7
Name: Colorado, dtype: int32
data.loc['Colorado',['one','two']]
one 4
two 5
Name: Colorado, dtype: int32
如果要使用整数进行索引,那么就使用iloc
,这里的整数被用作指示位置而不是label。
data.iloc[1]
one 4
two 5
three 6
four 7
Name: Colorado, dtype: int32
data.iloc[1,[0,1]]
one 4
two 5
Name: Colorado, dtype: int32
data.iloc[1]
one 4
two 5
three 6
four 7
Name: Colorado, dtype: int32
data.iloc[1,[0,1]]
one 4
two 5
Name: Colorado, dtype: int32
Integer Indexing
一个很容易出现的bug:
ser = pd.Series(np.arange(3.))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
# BUG
ser[-1]
ser = pd.Series(np.arange(3.))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
# BUG
ser[-1]
按照Python内置数据结构中的索引逻辑,这里应该返回 Series 中的最后一行才是,但是实际上会报错。
这是因为这里 Series 中的 index value 是整数,
ser.index.dtype
dtype('int64')
ser.index.dtype
dtype('int64')
当使用同样为整数的 -1 作为索引时,会将 -1 实际上理解成一个 label,在 index object 中去寻找整数 -1,因此出错。
如果我们不使用整数作为index value dtype,那么就可以避免这种情况。
ser = pd.Series(np.arange(3.), index = ['a','b','c'])
ser[-1]
2.0
ser = pd.Series(np.arange(3.), index = ['a','b','c'])
ser[-1]
2.0
从保持一致性的角度来说,最好的方式是我们使用loc
或iloc
来明确 index 的方式
ser = pd.Series(np.arange(3.))
ser.iloc(-1)
2
#ser.loc(-1) 会出错
ser = pd.Series(np.arange(3.))
ser.iloc(-1)
2
#ser.loc(-1) 会出错
传递给loc
的所有值都被理解为label
传递给iloc
的整数被用作指示位置而不是label
Arithmetic and Data Alignment
对于Pandas中基本数据结构之间的算术运算,如果具有相同的index,那么他们之间进行正常运算,对于不同的index,则执行类似并集的操作。
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), \
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 + df2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), \
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),\
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 + df2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
如果想自定义fill_value:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),\
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),\
columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 1 8.0 19.0
df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),\
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),\
columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 1 8.0 19.0
df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0