Data/Bigdata

Spark DataFrame (PySpark)

안녕하세요. (주)씨앤텍시스템즈입니다.

 

이번에 Spark 2.x의 주요 데이터처리 타입인 DataFrame과 널리 알려진 타이타닉 데이터를 Spark Dataframe으로 처리하는 예제를 수행해보겠습니다.


본 포스팅 앞의 글에서 2020/01/09 - [Bigdata] - Apache Spark란? 을 통해 Spark란 무엇인지,

 

Apache Spark란?

안녕하세요 씨앤텍시스템즈입니다. 이번 포스팅은 빅데이터 소프트웨어이자 가장 화두인 Apache Spark에 대해서 살펴보겠습니다. 1. Apache Spark이란? Apache Spark는 인-메모리 기반 통합 컴퓨팅 엔진이며, 빅데..

cntechsystems.tistory.com

2020/02/13 - [Bigdata] - Apache Spark 기능에  대해서 알아 보았습니다.

 

Apache Spark 기능

안녕하세요 씨앤텍시스템즈입니다. 이번 포스팅은 이전 포스팅인 Apache Spark란?에 이어서 Spark 기능에 대해서 살펴보겠습니다. Apache Spark의 다양한 기능 중 메인 기능인 아래 3가지를 주로 살펴보겠습니다...

cntechsystems.tistory.com

이번엔 Spark 기능 중 DataFrame을 이용하여 타이타닉을 분석해보는 시간을 가지고 데이터는 kaggle에서 참고하였습니다.

 

 

 

 

 

Titanic Meta

 
  • Passengerid : 승객번호
  • Survived : 생존여부(1:생존, 0:사망)
  • Pclass : 승선 등급(1: 1st, 2:2nd, 3:3rd)
  • Name : 승객 이름
  • Sex : 성별
  • Age : 나이
  • SibSp : 동행자 수(형제)
  • Patch : 동행자 수(부모)
  • Ticket : 티켓번호
  • Fare : 티켓요금
  • Cabin : 객실번호
  • Embarked : 승선 항구명(C: Cherbourg, Q: Queenstown, S: Southampton)
In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
 

데이터로드

In [12]:
titanic = spark.read.csv("data/kaggle-titanic-train.csv", inferSchema=True, header=True)
In [13]:
titanic.printSchema()
 
root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

In [14]:
titanic.show()
 
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|          330877| 8.4583| null|       Q|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|           17463|51.8625|  E46|       S|
|          8|       0|     3|Palsson, Master. ...|  male| 2.0|    3|    1|          349909| 21.075| null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333| null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708| null|       C|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1|         PP 9549|   16.7|   G6|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|          113783|  26.55| C103|       S|
|         13|       0|     3|Saundercock, Mr. ...|  male|20.0|    0|    0|       A/5. 2151|   8.05| null|       S|
|         14|       0|     3|Andersson, Mr. An...|  male|39.0|    1|    5|          347082| 31.275| null|       S|
|         15|       0|     3|Vestrom, Miss. Hu...|female|14.0|    0|    0|          350406| 7.8542| null|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|          248706|   16.0| null|       S|
|         17|       0|     3|Rice, Master. Eugene|  male| 2.0|    4|    1|          382652| 29.125| null|       Q|
|         18|       1|     2|Williams, Mr. Cha...|  male|null|    0|    0|          244373|   13.0| null|       S|
|         19|       0|     3|Vander Planke, Mr...|female|31.0|    1|    0|          345763|   18.0| null|       S|
|         20|       1|     3|Masselmani, Mrs. ...|female|null|    0|    0|            2649|  7.225| null|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 20 rows

In [15]:
titanic.count()
Out[15]:
891
In [16]:
titanic.summary().show()
 
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.2042079685746| null|    null|
| stddev|257.3538420152301|0.48659245426485753|0.8360712409770491|                null|  null|14.526497332334035|1.1027434322934315| 0.8060572211299488|471609.26868834975|49.69342859718089| null|    null|
|    min|                1|                  0|                 1|"Andersson, Mr. A...|female|              0.42|                 0|                  0|            110152|              0.0|  A10|       C|
|    25%|              223|                  0|                 2|                null|  null|              20.0|                 0|                  0|           19996.0|           7.8958| null|    null|
|    50%|              446|                  0|                 3|                null|  null|              28.0|                 0|                  0|          236171.0|          14.4542| null|    null|
|    75%|              669|                  1|                 3|                null|  null|              38.0|                 1|                  0|          347743.0|             31.0| null|    null|
|    max|              891|                  1|                 3|van Melkebeke, Mr...|  male|              80.0|                 8|                  6|         WE/P 5735|         512.3292|    T|       S|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+

In [17]:
titanic_pandas = titanic.toPandas()
In [19]:
titanic_pandas.describe()
Out[19]:
  PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [20]:
sns.countplot(x='Survived', data = titanic_pandas)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0d535ca90>
 
In [24]:
print(titanic_pandas.Survived.sum()/titanic_pandas.Survived.count())
 
0.3838383838383838
In [21]:
titanic_pandas.groupby(['Survived', 'Sex'])['Survived'].count()
Out[21]:
Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
Name: Survived, dtype: int64
In [25]:
sns.catplot(x='Sex', col='Survived', kind='count', data=titanic_pandas)
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x7fc0d3dd7ed0>
 
In [26]:
titanic_pandas[titanic_pandas.Sex == 'female'].Survived.sum()/titanic_pandas[titanic_pandas.Sex == 'female'].Survived.count()
Out[26]:
0.7420382165605095
In [27]:
titanic_pandas[titanic_pandas.Sex == 'male'].Survived.sum()/titanic_pandas[titanic_pandas.Sex == 'male'].Survived.count()
Out[27]:
0.18890814558058924
In [32]:
#성별에 따른 생존율
f, ax = plt.subplots(1, 2, figsize=(16,7))
titanic_pandas['Survived'][titanic_pandas['Sex'] == 'male'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[0], shadow=True)
titanic_pandas['Survived'][titanic_pandas['Sex'] == 'female'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[1], shadow=True)

ax[0].set_title('Survived Male')
ax[1].set_title('Survived Female')

plt.show()
 
In [33]:
#등급별 생존율
pd.crosstab(titanic_pandas.Pclass, titanic_pandas.Survived, margins=True)
Out[33]:
Survived 0 1 All
Pclass      
1 80 136 216
2 97 87 184
3 372 119 491
All 549 342 891
In [34]:
titanic_pandas.Survived[titanic_pandas.Pclass == 1].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 1].count()
Out[34]:
0.6296296296296297
In [35]:
titanic_pandas.Survived[titanic_pandas.Pclass == 2].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 2].count()
Out[35]:
0.47282608695652173
In [36]:
titanic_pandas.Survived[titanic_pandas.Pclass == 3].sum()/titanic_pandas.Survived[titanic_pandas.Pclass == 3].count()
Out[36]:
0.24236252545824846
In [37]:
sns.catplot('Pclass', 'Survived', kind='point', data=titanic_pandas)
Out[37]:
<seaborn.axisgrid.FacetGrid at 0x7fc0d1c40490>
 
In [39]:
pd.crosstab([titanic_pandas.Sex, titanic_pandas.Survived], [titanic_pandas.Pclass])
Out[39]:
  Pclass 1 2 3
Sex Survived      
female 0 3 6 72
1 91 70 72
male 0 77 91 300
1 45 17 47
In [40]:
sns.catplot('Pclass', 'Survived', hue='Sex', kind='point', data=titanic_pandas)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7fc0d1ccf110>
 
In [41]:
sns.catplot(x='Survived', col='Embarked', kind='count', data=titanic_pandas)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7fc0d1c39f10>
 
In [ ]:
 
 
728x90

'Data > Bigdata' 카테고리의 다른 글

Spark을 이용한 Deeplearning  (0) 2020.06.11
Spark SQL(Pyspark)  (0) 2020.05.26
R을 이용한 Bioinformatics (Bioconductor)  (1) 2020.04.20
Apache Spark 기능  (0) 2020.02.13
Elastic Search란?  (0) 2020.01.20