分析
拿到数据,首先可以一窥数据大概,将数据展示出来看看大概构成:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
其中数据栏目分别为:id,是否获救,座位等级,姓名,性别,年龄,船上兄弟姐妹个数,船上父母子女个数,票号,票价,客舱号,登陆港口
接下来就要开始分析这些数据了。
基于个人理解,船上人员如果绅士,那么最有可能和存活相关的应该是性别,年龄。
然后考虑的应该是当时的社会等级,有可能相关的是座位等级,票价。
还有可能和当时团队力量相关,那么是否有亲人在船上也是一个可以考虑的因素。
现在就要验证之前的想法:
Sex | Survived | |
0 | female | 0.742038 |
1 | male | 0.188908 |
Pclass | Survived | |
0 | 1 | 0.629630 |
1 | 2 | 0.472826 |
2 | 3 | 0.242363 |
SibSp | Survived | |
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
Parch | Survived | |
3 | 3 | 0.600000 |
1 | 1 | 0.550847 |
2 | 2 | 0.500000 |
0 | 0 | 0.343658 |
5 | 5 | 0.200000 |
4 | 4 | 0.000000 |
6 | 6 | 0.000000 |
直接能够分析的几个数据已经看出来了,其中性别和座位等级确实对获救影响很大,亲人数量影响不是很绝对。
还有几个其他的数据,都可以像这样来分析,分析出来数据之后,可以判断出这个条件是否是决定存活与否的关键因素。
数据补充
还有几个维度的数据有缺失,需要进行数据处理,所以先进行一下数据的处理,为缺失的年龄数据补齐,补充数据的思路很多,可以随机年龄,可以平均年龄:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
年龄补充完整之后,分析年龄。
因为像年龄这种带跨度的数据,可以使用跨度来进行分析,用pandas的cut方法生成跨度字段,再通过跨度字段进行统计:
AgeBand | Survived | |
0 | (-0.08, 16.0] | 0.550000 |
1 | (16.0, 32.0] | 0.344762 |
2 | (32.0, 48.0] | 0.403226 |
3 | (48.0, 64.0] | 0.434783 |
4 | (64.0, 80.0] | 0.090909 |
可以看到年龄段对生存的影响不是很绝对,但是也有将近一半的概率。
其他的数据中,也有一部分数据是缺失的,可以用上面提到的办法,或者更巧妙的办法进行补充,补充之后进行统计,更加准确。
标准化
目前数据都已经分析好,或者说准备好,但是这些数据还不能直接用来做模型的输入。
因为模型不能处理如:male, female。这样的文字。而且Fare这一栏中,数字要远远大过其他栏的数字,这样Fare的影响可能会让模型不够准确。所以在进行模型训练之前,先要对数据进行标准化。
以年龄标准化为例:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 1 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | (16.0, 32.0] |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 2 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | (32.0, 48.0] |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 1 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | (16.0, 32.0] |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 2 | 1 | 0 | 113803 | 53.1000 | C123 | S | (32.0, 48.0] |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 2 | 0 | 0 | 373450 | 8.0500 | NaN | S | (32.0, 48.0] |
将性别和登陆地也进行标准化,使用map函数来做比较简单:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 1 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | (16.0, 32.0] |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 1 | 2 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | (32.0, 48.0] |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 1 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | (16.0, 32.0] |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 2 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | (32.0, 48.0] |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 2 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | (32.0, 48.0] |
去掉我们所不需要的数据(当然,只是我懒,不想分析其他的因素了):
Survived | Pclass | Sex | Age | SibSp | Parch | Embarked | |
0 | 0 | 3 | 0 | 1 | 1 | 0 | 0 |
1 | 1 | 1 | 1 | 2 | 1 | 0 | 1 |
2 | 1 | 3 | 1 | 1 | 0 | 0 | 0 |
3 | 1 | 1 | 1 | 2 | 1 | 0 | 0 |
4 | 0 | 3 | 0 | 2 | 0 | 0 | 0 |