文章目录

  • 🥇1.总览数据
  • 🥈2.筛选数据
  • 🥉3.替换数据
  • 🏅4.更改列名
  • 🥇5.查找唯一值
  • 🥈6.查找缺失值
  • 🥉7.删除列或行
  • 🏅8. groupby分组
  • 🥇9.按照时间段来进行分组
  • 🥈10.遍历一个列的数据
  • 🥉11.对一列的所有元素应用某个函数
  • 🏅12. pandas高级函数
  • 🥇13. 连接多个Dataframe


在上一篇文章中,介绍了如何使用python导入数据,导入数据后的第二步往往就是数据清洗,下面我们来看看如何使用pandas进行数据清洗工作

导入相关库

import pandas as pd

dataframe = pd.read_csv(r'C:/Users/DELL/data-science-learning/python数据分析笔记/探索性数据分析/train.csv')
dataframe.head(5)



PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

🥇1.总览数据

  • 查看数据维度
dataframe.shape
(891, 12)
  • 描述性统计分析
dataframe.describe()



PassengerId

Survived

Pclass

Age

SibSp

Parch

Fare

count

891.000000

891.000000

891.000000

714.000000

891.000000

891.000000

891.000000

mean

446.000000

0.383838

2.308642

29.699118

0.523008

0.381594

32.204208

std

257.353842

0.486592

0.836071

14.526497

1.102743

0.806057

49.693429

min

1.000000

0.000000

1.000000

0.420000

0.000000

0.000000

0.000000

25%

223.500000

0.000000

2.000000

20.125000

0.000000

0.000000

7.910400

50%

446.000000

0.000000

3.000000

28.000000

0.000000

0.000000

14.454200

75%

668.500000

1.000000

3.000000

38.000000

1.000000

0.000000

31.000000

max

891.000000

1.000000

3.000000

80.000000

8.000000

6.000000

512.329200

🥈2.筛选数据

  • 过滤所有女性和年龄大于60岁的乘客
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age']>=60)]



PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

275

276

1

1

Andrews, Miss. Kornelia Theodosia

female

63.0

1

0

13502

77.9583

D7

S

366

367

1

1

Warren, Mrs. Frank Manley (Anna Sophia Atkinson)

female

60.0

1

0

110813

75.2500

D37

C

483

484

1

3

Turkula, Mrs. (Hedwig)

female

63.0

0

0

4134

9.5875

NaN

S

829

830

1

1

Stone, Mrs. George Nelson (Martha Evelyn)

female

62.0

0

0

113572

80.0000

B28

NaN

可以看出,一共有四名年龄大于60岁的女性乘客

🥉3.替换数据

  • female换成woman,将male换成man
dataframe['Sex'].replace(['female','male'],['woman','man']).head(5)
0      man
1    woman
2    woman
3    woman
4      man
Name: Sex, dtype: object

🏅4.更改列名

  • 查看所有列名
dataframe.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
  • 重命名列

PassengerId

Survived

Passenger Class

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

dataframe.rename(columns={'Pclass':'Passenger Class','Sex':'Gender'}).head()



PassengerId

Survived

Passenger Class

Name

Gender

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

🥇5.查找唯一值

pandas中,我们可以使用unique()查找唯一值

# 查找唯一值
dataframe['Sex'].unique()
array(['male', 'female'], dtype=object)
# 显示唯一值出现的个数
dataframe['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
# 查找类型票的数量
dataframe['Pclass'].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
# 查找唯一值的种类
dataframe['Pclass'].nunique()
3

🥈6.查找缺失值

# 查找空数据
dataframe[dataframe['Age'].isnull()].head()



PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

5

6

0

3

Moran, Mr. James

male

NaN

0

0

330877

8.4583

NaN

Q

17

18

1

2

Williams, Mr. Charles Eugene

male

NaN

0

0

244373

13.0000

NaN

S

19

20

1

3

Masselmani, Mrs. Fatima

female

NaN

0

0

2649

7.2250

NaN

C

26

27

0

3

Emir, Mr. Farred Chehab

male

NaN

0

0

2631

7.2250

NaN

C

28

29

1

3

O'Dwyer, Miss. Ellen "Nellie"

female

NaN

0

0

330959

7.8792

NaN

Q

pandas没有NaN 如果想要处理的话必须导入numpy

import numpy as np
dataframe['Sex'].replace('male',np.nan).head()
0       NaN
1    female
2    female
3    female
4       NaN
Name: Sex, dtype: object

🥉7.删除列或行

# 删除一列,采用drop方法,并传入参数axis
dataframe.drop('Age',axis=1).head()



PassengerId

Survived

Pclass

Name

Sex

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

0

0

373450

8.0500

NaN

S

#删除行
dataframe.drop(1)
# 删除重复行 使用subset参数指明要删除的列
dataframe.drop_duplicates(subset='Sex').head()



Name

PClass

Age

Sex

Survived

SexCode

0

Allen, Miss Elisabeth Walton

1st

29.0

female

1

1

2

Allison, Mr Hudson Joshua Creighton

1st

30.0

male

0

0

🏅8. groupby分组

  • 计算男性和女性的平均值

思路一,将所有男性和女性的条件进行选取分别计算

man = dataframe[dataframe['Sex']=='male']
woman = dataframe[dataframe['Sex']=='female']
print(man.mean())
print(woman.mean())
Age         31.014338
Survived     0.166863
SexCode      0.000000
dtype: float64
Age         29.396424
Survived     0.666667
SexCode      1.000000
dtype: float64

思路二,用groupby方法简化

dataframe.groupby('Sex').mean()



Age

Survived

SexCode

Sex

female

29.396424

0.666667

1.0

male

31.014338

0.166863

0.0

# 按行分组,计算行数
dataframe.groupby('Sex')['Name'].count()
Sex
female    462
male      851
Name: Name, dtype: int64
dataframe.groupby(['Sex','Survived']).mean()



PassengerId

Pclass

Age

SibSp

Parch

Fare

Sex

Survived

female

0

434.851852

2.851852

25.046875

1.209877

1.037037

23.024385

1

429.699571

1.918455

28.847716

0.515021

0.515021

51.938573

male

0

449.121795

2.476496

31.618056

0.440171

0.207265

21.960993

1

475.724771

2.018349

27.276022

0.385321

0.357798

40.821484

🥇9.按照时间段来进行分组

  • 使用resample参数来进行取样本
# 创建时期范围
time_index = pd.date_range('06/06/2017', periods=100000, freq='30S') # periods表示有多少数据,freq表示步长
dataframe = pd.DataFrame(index=time_index)
# 创建一个随机变量
dataframe['Sale_Amout'] = np.random.randint(1, 10, 100000)
# resample 参数,按周对行分组,计算每一周的总和
dataframe.resample('W').sum()



Sale_Amout

2017-06-11

86292

2017-06-18

100359

2017-06-25

100907

2017-07-02

100868

2017-07-09

100522

2017-07-16

10478

# 使用resample可以按一组时间间隔来进行分组,然后计算每一个时间组的某个统计量
dataframe.resample('2W').mean()



Sale_Amout

2017-06-11

4.993750

2017-06-25

4.991716

2017-07-09

4.994792

2017-07-23

5.037500

dataframe.resample('M').count()



Sale_Amout

2017-06-30

72000

2017-07-31

28000

# resample默认是以最后一个数据作 使用label参数可以进行调整
dataframe.resample('M', label='left').count()



Sale_Amout

2017-05-31

72000

2017-06-30

28000

🥈10.遍历一个列的数据

dataframe = pd.read_csv(url)
# 以大写的形势打印前两行的名字
for name in dataframe['Name'][0:2]:
    print(name.upper())
ALLEN, MISS ELISABETH WALTON
ALLISON, MISS HELEN LORAINE

🥉11.对一列的所有元素应用某个函数

def uppercase(x):
    return x.upper()
dataframe['Name'].apply(uppercase)[0:2]
0    ALLEN, MISS ELISABETH WALTON
1     ALLISON, MISS HELEN LORAINE
Name: Name, dtype: object

🏅12. pandas高级函数

dataframe.groupby('Sex').apply(lambda x:x.count())



Name

PClass

Age

Sex

Survived

SexCode

Sex

female

462

462

288

462

462

462

male

851

851

468

851

851

851

通过联合使用groupbyapply,我们就能计算自定义的统计量
例如上面我们发现agecabin具有大量的缺失值

🥇13. 连接多个Dataframe

data_a = {'id':['1', '2', '3'],
          'first': ['Alex', 'Amy', 'Allen'],
          'last': ['Anderson', 'Ackerman', 'Ali']}
dataframe_a = pd.DataFrame(data_a, columns=['id','first', 'last'])
data_b = {'id':['4', '5', '6'],
          'first': ['Billy', 'Brian', 'Bran'],
          'last': ['Bonder', 'Black', 'Balwner']}
dataframe_b = pd.DataFrame(data_b, columns=['id','first', 'last'])
pd.concat([dataframe_a, dataframe_b], axis=0)#在行的方向进行



id

first

last

0

1

Alex

Anderson

1

2

Amy

Ackerman

2

3

Allen

Ali

0

4

Billy

Bonder

1

5

Brian

Black

2

6

Bran

Balwner

pd.concat([dataframe_a, dataframe_b], axis=1)#在列的方向进行



id

first

last

id

first

last

0

1

Alex

Anderson

4

Billy

Bonder

1

2

Amy

Ackerman

5

Brian

Black

2

3

Allen

Ali

6

Bran

Balwner

# 也可以用append方法进行添加
c = pd.Series([10, 'Chris', 'Chillon'], index=['id','first','last'])
dataframe.append(c, ignore_index=True)#如果c原来有名字忽略



id

first

last

0

1

Alex

Anderson

1

2

Amy

Ackerman

2

3

Allen

Ali

3

10

Chris

Chillon