Having the following issue with Sparkling-water version 2.2.9. My Hadoop cluster is running CDH 5.13. As per the H2o documentation, I should have roughly 4x the memory as the data size in the H2o/Sparkling-water cluster.
I can import a 750 GB data file (CSV) of size onto a sparkling-water cluster with 4 TB of memory (40 executors, 100GB each). However having problems loading a larger data file. This (CSV) file is roughly 2.2 TB in size (also have it in Parquet/Snappy format, 550GB in size). I have created a Sparkling-water cluster with 100 executors of 100GB/executor. The "parsing" step runs for about 60-70% and then the containers start failing with Error Code 143 and 255. I have bumped up the memory to about 12 TB, still no success.
The python code is:
import h2o
h2o.init(ip='hdchdp01v03', port=9500, strict_version_check=False)
ls_hdfs="hdfs://HDCHDP01ns/h2o_test/csv_20171004"
print("Reading files from ", ls_hdfs)
sum_df = h2o.import_file(path = ls_hdfs, destination_frame="sum_df")
Has anyone run into similar issues? My Hadoop cluster only has 20 TB of memory, so hogging 12 TB memory itself would be a stretch most of the time.
With my first file, I see once the data is imported into the cluster, it seemed to take roughly double the file size in memory, but not sure how to recover the 4x memory I have allocated until the sparkling-water cluster comes down.
So, are there any other workarounds I could do to load this data into H2o for analysis with some due diligence on the available cluster memory?
Shankar