from pysparkling import *
import h2o

hc = H2OContext.getOrCreate(spark)

Connecting to H2O server at http://192.168.56.101:54321 ... successful.

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.1-1-2.4
 * H2O name: sparkling-water-carbig_local-1580951039227
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.101,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.101:54321 (CMD + click in Mac OSX)

housingDF = spark.read.csv("data/housing.data", inferSchema=True, header=True)

housingDF.printSchema()

root
 |--  0.00632: double (nullable = true)
 |--   18.00: double (nullable = true)
 |--    2.310: double (nullable = true)
 |--   0: double (nullable = true)
 |--   0.5380: double (nullable = true)
 |--   6.5750: double (nullable = true)
 |--   65.20: double (nullable = true)
 |--   4.0900: double (nullable = true)
 |--    1: double (nullable = true)
 |--   296.0: double (nullable = true)
 |--   15.30: double (nullable = true)
 |--  396.90: double (nullable = true)
 |--    4.98: double (nullable = true)
 |--   24.00: double (nullable = true)

housingDF.show(5)

+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.00632|  18.00|   2.310|  0|  0.5380|  6.5750|  65.20|  4.0900|   1|  296.0|  15.30| 396.90|   4.98|  24.00|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.02731|    0.0|    7.07|0.0|   0.469|   6.421|   78.9|  4.9671| 2.0|  242.0|   17.8|  396.9|   9.14|   21.6|
| 0.02729|    0.0|    7.07|0.0|   0.469|   7.185|   61.1|  4.9671| 2.0|  242.0|   17.8| 392.83|   4.03|   34.7|
| 0.03237|    0.0|    2.18|0.0|   0.458|   6.998|   45.8|  6.0622| 3.0|  222.0|   18.7| 394.63|   2.94|   33.4|
| 0.06905|    0.0|    2.18|0.0|   0.458|   7.147|   54.2|  6.0622| 3.0|  222.0|   18.7|  396.9|   5.33|   36.2|
| 0.02985|    0.0|    2.18|0.0|   0.458|    6.43|   58.7|  6.0622| 3.0|  222.0|   18.7| 394.12|   5.21|   28.7|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
only showing top 5 rows

from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, DoubleType

housingSchema = StructType([
    StructField('crim', DoubleType(), True),
    StructField('zn', DoubleType(), True),
    StructField('indus', DoubleType(), True),
    StructField('chas', DoubleType(), True),
    StructField('nox', DoubleType(), True),
    StructField('rm', DoubleType(), True),
    StructField('age', DoubleType(), True),
    StructField('dis', DoubleType(), True),
    StructField('rad', DoubleType(), True),
    StructField('tax', DoubleType(), True),
    StructField('ptratio', DoubleType(), True),
    StructField('b', DoubleType(), True),
    StructField('lstat', DoubleType(), True),
    StructField('medv', DoubleType(), True),
])

housingDF = spark.read.csv("data/housing.data", inferSchema=True, schema=housingSchema)

housingDF.show(5)

+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|  tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|0.00632|18.0| 2.31| 0.0|0.538|6.575|65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07| 0.0|0.469|6.421|78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07| 0.0|0.469|7.185|61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18| 0.0|0.458|6.998|45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18| 0.0|0.458|7.147|54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
only showing top 5 rows

housingDFh20 = hc.as_h2o_frame(housingDF, 'housing') 
#spark에서 만든 데이터 프레임을 h20에서 쓸 수 있도록

splitDF = housingDFh20.split_frame([0.75,0.24]) #합이 1이 안되게

train = splitDF[0]

train.frame_id = 'housing_train'

test = splitDF[1]
test.frame_id = 'housing_test'

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

m = H2ODeepLearningEstimator(hidden=[200, 200, 200], epochs=800, activation='rectifierwithdropout',\
                             hidden_dropout_ratios=[0.4,0.2,0.4])

m.train(x=train.names[:-1], y=train.names[13], training_frame=train, validation_frame=test)

deeplearning Model Build progress: |██████████████████████████████████████| 100%

m.show()

Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1580951124328_4


Status of Neuron Layers: predicting medv, regression, gaussian distribution, Quadratic loss, 326,801 weights/biases, 3.8 MB, 228,600 training samples, mini-batch size 1


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 4.491494519981133
RMSE: 2.1193146344941645
MAE: 1.5742916415136954
RMSLE: 0.12387445579540285
Mean Residual Deviance: 4.491494519981133

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 8.45884198732803
RMSE: 2.9084088411583457
MAE: 2.234092485453941
RMSLE: 0.15526005315606428
Mean Residual Deviance: 8.45884198732803

Scoring History:

See the whole table with table.as_data_frame()

Variable Importances:

H2O cluster uptime:	14 secs
H2O cluster timezone:	Asia/Seoul
H2O data parsing timezone:	UTC
H2O cluster version:	3.28.0.1
H2O cluster version age:	1 month and 20 days
H2O cluster name:	sparkling-water-carbig_local-1580951039227
H2O cluster total nodes:	1
H2O cluster free memory:	808 Mb
H2O cluster total cores:	2
H2O cluster allowed cores:	2
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://192.168.56.101:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, Amazon S3, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.7.4 final

	layer	units	type	dropout	l1	l2	mean_rate	rate_rms	momentum	mean_weight	weight_rms	mean_bias	bias_rms
0	1	13	Input	0
1	2	400	RectifierDropout	40	0	0	0.0298622	0.0174399	0	-0.0014201	0.13315	0.341532	0.0858229
2	3	400	RectifierDropout	20	0	0	0.0760471	0.0467215	0	-0.0191901	0.0649896	0.934475	0.0971478
3	4	400	RectifierDropout	40	0	0	0.0570879	0.0641115	0	-0.00851255	0.0545392	0.940617	0.0345877
4	5	1	Linear		0	0	0.00146979	0.000868104	0	0.0024622	0.0399774	-0.117518	1.09713e-154

	timestamp	duration	training_speed	epochs	iterations	samples	training_rmse	training_deviance	training_mae	training_r2	validation_rmse	validation_deviance	validation_mae	validation_r2
0	2020-02-06 10:48:15	0.000 sec	None	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2020-02-06 10:48:19	4.595 sec	865 obs/sec	10.0	1	3810.0	4.306655	18.547280	2.803999	0.775459	4.799666	23.036791	2.904183	0.742674
2	2020-02-06 10:48:26	10.973 sec	1071 obs/sec	30.0	3	11430.0	3.785238	14.328024	2.507152	0.826539	4.053157	16.428082	2.620617	0.816495
3	2020-02-06 10:48:31	16.456 sec	1187 obs/sec	50.0	5	19050.0	3.643007	13.271497	2.679611	0.839330	3.850721	14.828051	2.911060	0.834368
4	2020-02-06 10:48:36	21.586 sec	1267 obs/sec	70.0	7	26670.0	3.194291	10.203496	2.265252	0.876472	3.363074	11.310266	2.527003	0.873662
5	2020-02-06 10:48:43	28.481 sec	1368 obs/sec	100.0	10	38100.0	3.270587	10.696737	2.363280	0.870501	3.257141	10.608968	2.478348	0.881496
6	2020-02-06 10:48:49	34.858 sec	1452 obs/sec	130.0	13	49530.0	3.226082	10.407603	2.287425	0.874001	3.468513	12.030583	2.647730	0.865616
7	2020-02-06 10:48:55	40.821 sec	1525 obs/sec	160.0	16	60960.0	3.046344	9.280211	2.163496	0.887650	3.427086	11.744918	2.555101	0.868807
8	2020-02-06 10:49:01	46.777 sec	1581 obs/sec	190.0	19	72390.0	2.877202	8.278294	2.066885	0.899780	3.116319	9.711442	2.432577	0.891521
9	2020-02-06 10:49:07	52.413 sec	1634 obs/sec	220.0	22	83820.0	3.049112	9.297085	2.234404	0.887446	3.379012	11.417720	2.578131	0.872462
10	2020-02-06 10:49:13	58.579 sec	1661 obs/sec	250.0	25	95250.0	2.830604	8.012319	2.068955	0.903000	3.150313	9.924469	2.421001	0.889142
11	2020-02-06 10:49:19	1 min 4.491 sec	1690 obs/sec	280.0	28	106680.0	2.454265	6.023416	1.793329	0.927078	2.935127	8.614973	2.173127	0.903769
12	2020-02-06 10:49:24	1 min 9.860 sec	1727 obs/sec	310.0	31	118110.0	2.589645	6.706263	1.883426	0.918811	3.270033	10.693118	2.472490	0.880556
13	2020-02-06 10:49:30	1 min 14.980 sec	1765 obs/sec	340.0	34	129540.0	2.429125	5.900646	1.783728	0.928564	3.089552	9.545335	2.389783	0.893377
14	2020-02-06 10:49:35	1 min 20.108 sec	1798 obs/sec	370.0	37	140970.0	2.393984	5.731161	1.776052	0.930616	3.104331	9.636872	2.396955	0.892354
15	2020-02-06 10:49:41	1 min 26.759 sec	1839 obs/sec	410.0	41	156210.0	2.119315	4.491495	1.574292	0.945624	2.908409	8.458842	2.234092	0.905513
16	2020-02-06 10:49:46	1 min 31.788 sec	1866 obs/sec	440.0	44	167640.0	2.072595	4.295649	1.529558	0.947995	3.003650	9.021914	2.272905	0.899223
17	2020-02-06 10:49:53	1 min 38.342 sec	1899 obs/sec	480.0	48	182880.0	1.955997	3.825923	1.421920	0.953682	3.003762	9.022586	2.243523	0.899216
18	2020-02-06 10:49:59	1 min 44.825 sec	1931 obs/sec	520.0	52	198120.0	1.890754	3.574952	1.358395	0.956720	2.986179	8.917266	2.229141	0.900392
19	2020-02-06 10:50:06	1 min 51.213 sec	1959 obs/sec	560.0	56	213360.0	1.830215	3.349685	1.327494	0.959447	2.927398	8.569658	2.192749	0.904275

	variable	relative_importance	scaled_importance	percentage
0	lstat	1.000000	1.000000	0.119270
1	rm	0.863339	0.863339	0.102971
2	dis	0.812267	0.812267	0.096879
3	nox	0.794105	0.794105	0.094713
4	age	0.738908	0.738908	0.088130
5	tax	0.619250	0.619250	0.073858
6	crim	0.602778	0.602778	0.071893
7	ptratio	0.567626	0.567626	0.067701
8	rad	0.562254	0.562254	0.067060
9	b	0.560138	0.560138	0.066808
10	indus	0.523724	0.523724	0.062465
11	zn	0.392638	0.392638	0.046830
12	chas	0.347298	0.347298	0.041422

하둡 완전분산 환경 설치 및 설정 (hadoop cluster setup) (1)	2020.12.11
Selenium을 이용한 인스타그램 크롤링 (3)	2020.06.11
Spark SQL(Pyspark) (0)	2020.05.26
Spark DataFrame (PySpark) (0)	2020.04.20
R을 이용한 Bioinformatics (Bioconductor) (1)	2020.04.20

Spark을 이용한 Deeplearning

H2O

SparklingWater 수행 예제

'Data > Bigdata' 카테고리의 다른 글

티스토리툴바