Can't load a 2.3 TB file into sparkling-water cluster with 10 TB memory

Question

Having the following issue with Sparkling-water version 2.2.9. My Hadoop cluster is running CDH 5.13. As per the H2o documentation, I should have roughly 4x the memory as the data size in the H2o/Sparkling-water cluster.

I can import a 750 GB data file (CSV) of size onto a sparkling-water cluster with 4 TB of memory (40 executors, 100GB each). However having problems loading a larger data file. This (CSV) file is roughly 2.2 TB in size (also have it in Parquet/Snappy format, 550GB in size). I have created a Sparkling-water cluster with 100 executors of 100GB/executor. The "parsing" step runs for about 60-70% and then the containers start failing with Error Code 143 and 255. I have bumped up the memory to about 12 TB, still no success.

The python code is:

import h2o
h2o.init(ip='hdchdp01v03', port=9500, strict_version_check=False)
ls_hdfs="hdfs://HDCHDP01ns/h2o_test/csv_20171004"
print("Reading files from ", ls_hdfs)
sum_df = h2o.import_file(path = ls_hdfs, destination_frame="sum_df")

Has anyone run into similar issues? My Hadoop cluster only has 20 TB of memory, so hogging 12 TB memory itself would be a stretch most of the time.

With my first file, I see once the data is imported into the cluster, it seemed to take roughly double the file size in memory, but not sure how to recover the 4x memory I have allocated until the sparkling-water cluster comes down.

So, are there any other workarounds I could do to load this data into H2o for analysis with some due diligence on the available cluster memory?

Shankar

Curious if you were able to solve the problem? I am also having issues loading even smaller files into Sparkling Water. — Alex Popov, Jun 14 '18 at 16:34
I didn't have problem loading up to 100G worth of files into Sparkling Water DataFrame. One thing I noticed was H2O prefers CSV format much better than Parquet. For Parquet files, the memory requirement for the H2O cluster increases exponentially. — VShankar, Jun 18 '18 at 14:11
For loading large Parquet files, I finally settled on reading it into a Spark DF, then convert it into a H2O DF with this. The more memory I gave, lesser it took to load. — VShankar, Jun 18 '18 at 15:06
import h2oContext.implicits._ val ls_file1="hdfs://" val spark_DF = spark.read.parquet(ls_file1) val hf: H2OFrame = spark_DF — VShankar, Jun 18 '18 at 15:06

Can't load a 2.3 TB file into sparkling-water cluster with 10 TB memory

0 Answers0