"二维数组"Dataframe:是一个表格型的数据结构,包含一组有序的列,其列的值类型可以是数值、字符串、布尔值等。

Dataframe中的数据以一个或多个二维块存放,不是列表、字典或一维数组结构。

# 导入pandas
import pandas as pd

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
  • 参数:
  • index:行标签。如果没有传入索引参数,则默认会自动创建一个从0-N的整数索引。
  • index参数:重新定义index,格式为list,长度必须保持一致
  • columns:列标签。如果没有传入索引参数,则默认会自动创建一个从0-N的整数索引。
  • columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如’d’),则产生NaN值
  • 如果columns重新指定时候,列的数量可以少于原数据

DataFrame是一个类似于二维数组或表格(如excel)的对象,既有行索引,又有列索引

  • 行索引,表明不同行,横向索引,叫index,0轴,axis=0
  • 列索引,表名不同列,纵向索引,叫columns,1轴,axis=1

df的创建 spark pandas 创建df_数组

一、由“数组/list”组成的字典

创建方法:pandas.Dataframe()

  • 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
  • 字典的值的长度必须保持一致!
import numpy as np
import pandas as pd

# Dataframe 创建方法一:由数组/list组成的字典
# 创建方法:pandas.Dataframe()


# 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
# 字典的值的长度必须保持一致!
data1 = {'a': [1, 2, 3],
         'b': [3, 4, 5],
         'c': [5, 6, 7]}

print("data1 = {0}----type(data1) = {1}".format(data1, type(data1)))
print("-" * 50)
df1 = pd.DataFrame(data1)
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 50)

# columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如'd'),则产生NaN值
# 如果columns重新指定时候,列的数量可以少于原数据
df1 = pd.DataFrame(data1, columns=['b', 'c', 'a', 'd'])
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 50)
df1 = pd.DataFrame(data1, columns=['b', 'c'])
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 200)

data2 = {'one': np.random.rand(3),
         'two': np.random.rand(3)}  # 这里如果尝试  'two':np.random.rand(4) 会怎么样? 会报错:【ValueError: arrays must all be same length】

print("data2 = {0}----type(data2) = {1}".format(data2, type(data2)))
print("-" * 50)
df2 = pd.DataFrame(data2)
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))
print("-" * 50)

# index参数:重新定义index,格式为list,长度必须保持一致
df2 = pd.DataFrame(data2, index=['f1', 'f2', 'f3'])  # 这里如果尝试  index = ['f1','f2','f3','f4'] 会怎么样?会报错:【ValueError: Shape of passed values is (3, 2), indices imply (4, 2)】
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))

打印结果:

data1 = {'a': [1, 2, 3], 'b': [3, 4, 5], 'c': [5, 6, 7]}----type(data1) = <class 'dict'>
--------------------------------------------------
df1 = 
   a  b  c
0  1  3  5
1  2  4  6
2  3  5  7 
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df1 = 
   b  c  a    d
0  3  5  1  NaN
1  4  6  2  NaN
2  5  7  3  NaN 
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df1 = 
   b  c
0  3  5
1  4  6
2  5  7 
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data2 = {'one': array([0.74944895, 0.2239778 , 0.06525004]), 'two': array([0.80358123, 0.04858883, 0.05254801])}----type(data2) = <class 'dict'>
--------------------------------------------------
df2 = 
        one       two
0  0.749449  0.803581
1  0.223978  0.048589
2  0.065250  0.052548 
type(df2) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 = 
         one       two
f1  0.749449  0.803581
f2  0.223978  0.048589
f3  0.065250  0.052548 
type(df2) = <class 'pandas.core.frame.DataFrame'>

Process finished with exit code 0

二、由Series组成的字典

Dataframe 创建方法二:由Series组成的字典

  • 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
  • Series可以长度不一样,生成的Dataframe会出现NaN值
import numpy as np
import pandas as pd

# Dataframe 创建方法二:由Series组成的字典

data1 = {'one': pd.Series(np.random.rand(2)),
         'two': pd.Series(np.random.rand(3))}  # 没有设置index的Series

print("data1 = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print("-" * 50)
# 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
# Series可以长度不一样,生成的Dataframe会出现NaN值
df1 = pd.DataFrame(data1)
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 200)

data2 = {'one': pd.Series(np.random.rand(2), index=['a', 'b']),
         'two': pd.Series(np.random.rand(3), index=['a', 'b', 'c'])}  # 设置了index的Series

print("data2 = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
print("-" * 50)
df2 = pd.DataFrame(data2)
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))

打印结果:

data1 = 
{'one': 0    0.769247
1    0.982215
dtype: float64, 'two': 0    0.191847
1    0.473236
2    0.172925
dtype: float64}
type(data1) = <class 'dict'>
--------------------------------------------------
df1 = 
        one       two
0  0.769247  0.191847
1  0.982215  0.473236
2       NaN  0.172925 
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data2 = 
{'one': a    0.835023
b    0.285224
dtype: float64, 'two': a    0.436375
b    0.149480
c    0.364252
dtype: float64}
type(data2) = <class 'dict'>
--------------------------------------------------
df2 = 
        one       two
a  0.835023  0.436375
b  0.285224  0.149480
c       NaN  0.364252 
type(df2) = <class 'pandas.core.frame.DataFrame'>

Process finished with exit code 0

三、通过二维数组直接创建

Dataframe 创建方法三:通过二维数组直接创建

  • 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
  • index和colunms指定长度与原数组保持一致
import numpy as np
import pandas as pd

# Dataframe 创建方法三:通过二维数组直接创建

ar = np.random.rand(9).reshape(3,3)
print("ar = \n{0}\ntype(ar) = {1}".format(ar, type(ar)))
print("-" * 200)
# 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
# index和colunms指定长度与原数组保持一致
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three'])  # 可以尝试一下index或columns长度不等于已有数组的情况
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 50)
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))

四、由字典组成的列表

Dataframe 创建方法四:由字典组成的列表

  • 由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签
  • colunms和index参数分别重新指定相应列及行标签
import numpy as np
import pandas as pd

# Dataframe 创建方法四:由字典组成的列表

data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
print("data = \n{0}\ntype(data) = {1}".format(data, type(data)))
print("-" * 200)

# 由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签
# colunms和index参数分别重新指定相应列及行标签
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 50)
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))
print("-" * 50)
print("df3 = \n{0} \ntype(df3) = {1}".format(df3, type(df3)))

打印结果:

data = 
[{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
type(data) = <class 'list'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
df1 = 
   one  two  three
0    1    2    NaN
1    5   10   20.0 
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 = 
   one  two  three
a    1    2    NaN
b    5   10   20.0 
type(df2) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df3 = 
   one  two
0    1    2
1    5   10 
type(df3) = <class 'pandas.core.frame.DataFrame'>

Process finished with exit code 0

五、由字典组成的字典

Dataframe 创建方法五:由字典组成的字典

  • 由字典组成的字典创建Dataframe,columns为字典的key,index为子字典的key
  • columns参数可以增加和减少现有列,如出现新的列,值为NaN
  • index在这里和之前不同,并不能改变原有index,如果指向新的标签,值为NaN (非常重要!)
import numpy as np
import pandas as pd

# Dataframe 创建方法五:由字典组成的字典

data = {'Jack':{'math':90,'english':89,'art':78},
       'Marry':{'math':82,'english':95,'art':92},
       'Tom':{'math':78,'english':67}}

# 由字典组成的字典创建Dataframe,columns为字典的key,index为子字典的key
# columns参数可以增加和减少现有列,如出现新的列,值为NaN
# index在这里和之前不同,并不能改变原有index,如果指向新的标签,值为NaN (非常重要!)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print("df1 = \n{0} \ntype(df1) = {1}".format(df1, type(df1)))
print("-" * 100)
print("df2 = \n{0} \ntype(df2) = {1}".format(df2, type(df2)))
print("-" * 100)
print("df3 = \n{0} \ntype(df3) = {1}".format(df3, type(df3)))

打印结果:

df1 = 
         Jack  Marry   Tom
math       90     82  78.0
english    89     95  67.0
art        78     92   NaN 
type(df1) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
df2 = 
         Jack   Tom  Bob
math       90  78.0  NaN
english    89  67.0  NaN
art        78   NaN  NaN 
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
df3 = 
   Jack  Marry  Tom
a   NaN    NaN  NaN
b   NaN    NaN  NaN
c   NaN    NaN  NaN 
type(df3) = <class 'pandas.core.frame.DataFrame'>

Process finished with exit code 0