0

I'm experiencing an issue where the h2o.H2OFrame([1,2,3]) command is creating a frame within h2o on an internal backend, but not on an external backend. Instead, the connection is not terminating (the frame is being created but the process hangs).

It would appear that a post to /3/ParseSetup is not returning (where urllib3 seems to get stuck). More specifically, from the h2o logs for a connection to the external backend, an example of this is (where I've shortened the date and IP):

* 10.*.*.15:56565 8120 #7003-141 INFO: Reading byte InputStream into Frame:
* 10.*.*.15:56565 8120 #7003-141 INFO: frameKey: upload_8a440dcf457c1e5deacf76a7ac1a4955
* 10.*.*.15:56565 8120 #7003-141 DEBUG: write-lock upload_8a440dcf457c1e5deacf76a7ac1a4955 by job null
* 10.*.*.15:56565 8120 #7003-141 INFO: totalChunks: 1
* 10.*.*.15:56565 8120 #7003-141 INFO: totalBytes:  21
* 10.*.*.15:56565 8120 #7003-141 DEBUG: unlock upload_8a440dcf457c1e5deacf76a7ac1a4955 by job null
* 10.*.*.15:56565 8120 #7003-141 INFO: Success.
* 10.*.*.15:56565 8120 #7003-135 INFO: POST /3/ParseSetup, parms: {source_frames=["upload_8a440dcf457c1e5deacf76a7ac1a4955"], check_header=1, separator=44}

By comparison, the internal backend completes that call and the log files contain:

** 10.*.*.15:54444 2421 #0581-148 INFO: totalBytes:  21
** 10.*.*.15:54444 2421 #0581-148 INFO: Success.
** 10.*.*.15:54444 2421 #0581-149 INFO: POST /3/ParseSetup, parms: {source_frames=["upload_b985730020211f576ef75143ce0e43f2"], check_header=1, separator=44}
** 10.*.*.15:54444 2421 #0581-150 INFO: POST /3/Parse, parms: {number_columns=1, source_frames=["upload_b985730020211f576ef75143ce0e43f2"], column_types=["Numeric"], single_quotes=False, parse_type=CSV, destination_frame=Key_Frame__upload_b985730020211f576ef75143ce0e43f2.hex, column_names=["C1"], delete_on_done=True, check_header=1, separator=44, blocking=False, chunk_size=4194304}
...

There is a difference in the by job null lock that occurs, but it is released, so I suspect that it is not a critical issue. I've curled that endpoint unsuccessfully on both backends, and am reviewing the source code to determine why.

I am able to view the uploaded frame running h2o.ls(), despite the hanging process, and I'm able to retrieve the frame using h2o.get_frame(frame_id="myframe_id") on the external backend.

I've tried/confirmed the following things:

  • Confirmed that the sparkling water version is correct with respect to the version of spark (i.e. h2o_pysparkling_2.3 - for Spark 2.3.x, as stated in docs.h2o.ai --- in my case sparkling water 2.3.12 - Spark 2.3.0.cloudera2);
  • Downloaded sparkling-water stable to the cluster and ran ./get-extended-h2o.sh cdh5.14, which gave me the h2odriver-sw2.3.0-cdh5.14-extended.jar jar;
  • Various permutations of parameters for the map reduce job. Interestingly, our cluster is quite busy and the base port setting was essential for stability. Also, our sub-nets span switches which messed with the multi-casting. Ultimately the following argument brought up the backend without fail:
    hadoop jar h2odriver-sw2.3.0-cdh5.14-extended.jar -Dmapreduce.job.queuename=root.users.myuser -jobname extback -baseport 56565 -nodes 10 -mapperXmx 10g -network 10.*.*.0/24
  • Confirmed that I could query the backend since h2o.ls() works;
  • Uploaded a spark dataframe instead of a plain list (same issue):
    sdf = session.createDataFrame([
    ('a', 1, 1.0), ('b', 2, 2.0)],
    schema=StructType([StructField("string", StringType()),
                       StructField("int", IntegerType()),
                       StructField("float", FloatType())])) 
    hc.as_h2o_frame(sdf)

From a YARN point of view, I attempted client and cluster mode submissions of the simple test app:

spark2-submit --master yarn --deploy-mode cluster --queue root.users.myuser --conf 'spark.ext.h2o.client.port.base=65656' extreboot.py

and without --master yarn and --deploy-mode cluster for the default client mode.

Lastly, the extreboot.py code is:

    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    from pysparkling import *
    import h2o

    conf = SparkConf().setAll([
    ('spark.ext.h2o.client.verbose', True),
    ('spark.ext.h2o.client.log.level', 'DEBUG'),
    ('spark.ext.h2o.node.log.level', 'DEBUG'),
    ('spark.ext.h2o.client.port.base', '56565'),
    ('spark.driver.memory','8g'),
    ('spark.ext.h2o.backend.cluster.mode', 'external')])

    session = SparkSession.builder.config(conf=conf).getOrCreate() 

    ip_addr='10.10.10.10'  
    port=56565

    conf = H2OConf(session).set_external_cluster_mode().use_manual_cluster_start().set_h2o_cluster(ip_addr, port).set_cloud_name("extback")
    hc = H2OContext.getOrCreate(session, conf)

    print(h2o.ls())
    h2o.H2OFrame([1,2,3])
    print('DONE')

Does anyone know why it may be hanging (in comparison to the internal backend), what I'm doing wrong, or which steps I can take to better debug this? Thanks!

0_0
  • 564
  • 1
  • 9
  • 17

1 Answers1

1

I would recommend upgrading to the latest version of Sparkling Water (currently 2.3.26 and available here), since you are using 2.3.12 and there have been several fixes for hanging issues since then. Hopefully a quick upgrade fixes your issue.

Lauren
  • 5,640
  • 1
  • 13
  • 19