Titanic Meta¶

Passengerid : 승객번호
Survived : 생존여부(1:생존, 0:사망)
Pclass : 승선 등급(1: 1st, 2:2nd, 3:3rd)
Name : 승객 이름
Sex : 성별
Age : 나이
SibSp : 동행자 수(형제)
Patch : 동행자 수(부모)
Ticket : 티켓번호
Fare : 티켓요금
Cabin : 객실번호
Embarked : 승선 항구명(C: Cherbourg, Q: Queenstown, S: Southampton)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

데이터로드¶

titanic = spark.read.csv("data/kaggle-titanic-train.csv", inferSchema=True, header=True)

titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

titanic.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|          330877| 8.4583| null|       Q|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|           17463|51.8625|  E46|       S|
|          8|       0|     3|Palsson, Master. ...|  male| 2.0|    3|    1|          349909| 21.075| null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333| null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708| null|       C|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1|         PP 9549|   16.7|   G6|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|          113783|  26.55| C103|       S|
|         13|       0|     3|Saundercock, Mr. ...|  male|20.0|    0|    0|       A/5. 2151|   8.05| null|       S|
|         14|       0|     3|Andersson, Mr. An...|  male|39.0|    1|    5|          347082| 31.275| null|       S|
|         15|       0|     3|Vestrom, Miss. Hu...|female|14.0|    0|    0|          350406| 7.8542| null|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|          248706|   16.0| null|       S|
|         17|       0|     3|Rice, Master. Eugene|  male| 2.0|    4|    1|          382652| 29.125| null|       Q|
|         18|       1|     2|Williams, Mr. Cha...|  male|null|    0|    0|          244373|   13.0| null|       S|
|         19|       0|     3|Vander Planke, Mr...|female|31.0|    1|    0|          345763|   18.0| null|       S|
|         20|       1|     3|Masselmani, Mrs. ...|female|null|    0|    0|            2649|  7.225| null|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 20 rows

titanic.count()

891

titanic.summary().show()

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.2042079685746| null|    null|
| stddev|257.3538420152301|0.48659245426485753|0.8360712409770491|                null|  null|14.526497332334035|1.1027434322934315| 0.8060572211299488|471609.26868834975|49.69342859718089| null|    null|
|    min|                1|                  0|                 1|"Andersson, Mr. A...|female|              0.42|                 0|                  0|            110152|              0.0|  A10|       C|
|    25%|              223|                  0|                 2|                null|  null|              20.0|                 0|                  0|           19996.0|           7.8958| null|    null|
|    50%|              446|                  0|                 3|                null|  null|              28.0|                 0|                  0|          236171.0|          14.4542| null|    null|
|    75%|              669|                  1|                 3|                null|  null|              38.0|                 1|                  0|          347743.0|             31.0| null|    null|
|    max|              891|                  1|                 3|van Melkebeke, Mr...|  male|              80.0|                 8|                  6|         WE/P 5735|         512.3292|    T|       S|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+

titanic_pandas = titanic.toPandas()

titanic_pandas.describe()

sns.countplot(x='Survived', data = titanic_pandas)

<matplotlib.axes._subplots.AxesSubplot at 0x7fc0d535ca90>

print(titanic_pandas.Survived.sum()/titanic_pandas.Survived.count())

0.3838383838383838

titanic_pandas.groupby(['Survived', 'Sex'])['Survived'].count()

Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
Name: Survived, dtype: int64

sns.catplot(x='Sex', col='Survived', kind='count', data=titanic_pandas)

<seaborn.axisgrid.FacetGrid at 0x7fc0d3dd7ed0>

titanic_pandas[titanic_pandas.Sex == 'female'].Survived.sum()/titanic_pandas[titanic_pandas.Sex == 'female'].Survived.count()

0.7420382165605095

titanic_pandas[titanic_pandas.Sex == 'male'].Survived.sum()/titanic_pandas[titanic_pandas.Sex == 'male'].Survived.count()

0.18890814558058924

#성별에 따른 생존율
f, ax = plt.subplots(1, 2, figsize=(16,7))
titanic_pandas['Survived'][titanic_pandas['Sex'] == 'male'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[0], shadow=True)
titanic_pandas['Survived'][titanic_pandas['Sex'] == 'female'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[1], shadow=True)

ax[0].set_title('Survived Male')
ax[1].set_title('Survived Female')

plt.show()

#등급별 생존율
pd.crosstab(titanic_pandas.Pclass, titanic_pandas.Survived, margins=True)

titanic_pandas.Survived[titanic_pandas.Pclass == 1].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 1].count()

0.6296296296296297

titanic_pandas.Survived[titanic_pandas.Pclass == 2].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 2].count()

0.47282608695652173

titanic_pandas.Survived[titanic_pandas.Pclass == 3].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 3].count()

0.24236252545824846

sns.catplot('Pclass', 'Survived', kind='point', data=titanic_pandas)

<seaborn.axisgrid.FacetGrid at 0x7fc0d1c40490>

pd.crosstab([titanic_pandas.Sex, titanic_pandas.Survived], [titanic_pandas.Pclass])

sns.catplot('Pclass', 'Survived', hue='Sex', kind='point', data=titanic_pandas)

<seaborn.axisgrid.FacetGrid at 0x7fc0d1ccf110>

sns.catplot(x='Survived', col='Embarked', kind='count', data=titanic_pandas)

<seaborn.axisgrid.FacetGrid at 0x7fc0d1c39f10>

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Spark을 이용한 Deeplearning (0)	2020.06.11
Spark SQL(Pyspark) (0)	2020.05.26
R을 이용한 Bioinformatics (Bioconductor) (1)	2020.04.20
Apache Spark 기능 (0)	2020.02.13
Elastic Search란? (0)	2020.01.20

Spark DataFrame (PySpark)

Titanic Meta¶

데이터로드¶

'Data > Bigdata' 카테고리의 다른 글

티스토리툴바

Survived	0	1	All
Pclass
1	80	136	216
2	97	87	184
3	372	119	491
All	549	342	891