0

I am trying to import my 3000 observation & 77 features .csv file as H2O dataframe (while I am on a Spark session):

(1st way)

# Convert pandas dataframe to H2O dataframe
import h2o
h2o.init()
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

However, I am getting the following error:

   Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 102, in __init__
    column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 143, in _upload_python_object
    self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 319, in _upload_parse
    self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 326, in _parse
    return self._parse_raw(setup)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 355, in _parse_raw
    self._ex._cache.fill()
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/expr.py", line 346, in fill
    res = h2o.api("GET " + endpoint % self._id, data=req_params)["frames"][0]
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 103, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 402, in request
    return self._process_response(resp, save_to)
  File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 725, in _process_response
    raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
  Error: Unknown parameter: full_column_count
  Request: GET /3/Frames/Key_Frame__upload_84df978b98892632a7ce19303c4440f3.hex
    params: {u'row_offset': '0', u'row_count': '10', u'full_column_count': '-1', u'column_count': '-1', u'column_offset': '0'}

Let me notice that when I am doing this on my local machine then I am getting no error. I am getting the error above when I am doing the same thing on a Spark/Hadoop cluster.

Alternatively , I tried to do the following in the Spark cluster:

(2nd way)

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = h2o.import_file('/u/users/vn505f6/data.csv')

and then I got the following error:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 414, in import_file
   return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 311, in _import_parse
   rawkey = h2o.lazy_import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 282, in lazy_import
   return _import(path, pattern)
 File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 291, in _import
   if j["fails"]: raise ValueError("ImportFiles of " + path + " failed on " + str(j["fails"]))
ValueError: ImportFiles of /u/users/vn505f6/data.csv failed on [u'/u/users/vn505f6/data.csv']

The column names of the pandas dataframe are strings like the following: u_cnt_days_with_sale_14day.

What is this error about and how can I fix this?

P.S.

These are the command line commands which create the Spark cluster/context:

SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
/u/users/******/spark-2.3.0/bin/pyspark \
--master yarn \
--executor-memory 10g \
--num-executors 128 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name interactive_H2O_MT \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g
Outcast
  • 4,967
  • 5
  • 44
  • 99
  • Are you sure `data_train` is a pandas dataframe ? – IMCoins Sep 25 '18 at 11:18
  • @IMCoins I think so since I get `` for `type(data_train)` – Outcast Sep 25 '18 at 11:21
  • Can you post the full error stack trace, and `data_train.head()` ? – IMCoins Sep 25 '18 at 11:37
  • @IMCoins I edited my post for what you said. – Outcast Sep 25 '18 at 11:52
  • @PoeteMaudit can you provide a fully reproducible code snippet that we can test? This would include how you started your h2o cluster, and what the original data_train looks like. – Lauren Sep 25 '18 at 14:35
  • @Lauren, yes the source code is above; that's all. Moreover, what I said about the data is sufficient - there is nothing so extravagant about the data and this why it loads with no problem on my local machine with either scikit-learn or h2o. – Outcast Sep 25 '18 at 14:53
  • @Lauren, any ideas? I have posted an answer below but I would like to listen to some explanations. – Outcast Sep 25 '18 at 15:57
  • Several notes: It would really help to see full reproducible example. The code above is not as we don't see how you're starting H2O & Spark. Can you please share the command line commands? The second issue is probably because you are using import_file. Import file method expects the file to be already distributed on all H2O nodes in the cluster or stored on some distributed storage such as HDFS. You should use h2o.upload_frame instead if you have the data locally. The first issue is probably because you are using inconsistent versions of H2O & SW. The CLI commands would help. – Jakub Háva Sep 26 '18 at 08:21
  • @JakubHáva. As for your first comment, I think that there are too many sensitive information at the command line commands which create the Spark cluster/context. The H2O context is created in the way I showed above. As for your second comment, it may be this actually since the .csv files were not stored on HDFS but simply on an edge node of a cluster (with an `scp` command from my virtual machine to the edge node). – Outcast Sep 26 '18 at 09:47
  • In order to debug the first issue, can you please make sure your cli does not contain sensitive information and share it? We need that to fully reproduce the issue, otherwise there's not much we can do. Thank you for your understanding – Jakub Háva Sep 26 '18 at 09:56
  • @JakubHáva Ok I did that above. If there is any sensitive information shown then please free to edit my post and replace it with asterisks. – Outcast Sep 26 '18 at 10:02
  • I don't see PySparkling in the python path. But I do see that you are putting installed Python packages in the Python Path here as `/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/` What PySparkling version do you have installed please in Python 2.7 env? – Jakub Háva Sep 26 '18 at 10:28
  • @JakubHáva It's 2.3.0 PySparkling version. – Outcast Sep 26 '18 at 11:41
  • I would suggest properly specifying PySparkling dependency such as downloading the official distribution from the download page and adding the PySparkling zip distribution on the Python Path. – Jakub Háva Sep 27 '18 at 08:56
  • @JakubHáva Ok cool I may do it in the future but for now I want to focus on other things since I managed to import it. Thank you for your help so far :). – Outcast Sep 28 '18 at 09:32

1 Answers1

0

Finally what I did is firstly to import the .csv file as pandas dataframe and then to convert it to H2O dataframe:

from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
import pandas as pd

h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = pd.read_csv('/u/users/vn505f6/data.csv')
data_train = h2o.H2OFrame(data_train)

I do not really know why this worked while directly importing the .csv file as H2O dataframe in two different ways above my post did not work.

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • That's strange, it was already supposed to be a pandas dataframe in your question. :/ – IMCoins Sep 25 '18 at 16:08
  • @IMCoins Hm yes I modified my question quite a bit. But I was doing this pandas conversion only when doing `h2o.init()` and not with the second way which I show above. – Outcast Sep 25 '18 at 16:13