2

I am reading an H2OFrame from a CSV file:

val h2oFrame = new H2OFrame(new File(inputCsvFilePath))

How can I perform an equivalent of a .filter() operation (as available for Spark DataFrame or RDD). For example, how do I get a new H2OFrame where "label" (which is a column name) is >1?

I have tried converting to a org.apache.spark.sql.DataFrame as below (simplified example):

val df = asDataFrame(h2oFrame)
val dff = df.filter(s"label > 1")
print(dff.toString(0,15))

But this seems to throw OutOfMemoryError like below:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Executor task launch worker-2"

S.P.
  • 41
  • 5
  • Okay, looks like the `OutOfMemoryError` can be solved by increasing `-XX:MaxPermSize=92m` to something higher. Would still like an answer to the original question on how to do it directly on `H2OFrame`. – S.P. May 17 '16 at 22:07

1 Answers1

0

I would recommend you do this the way you are - via Spark. From the FAQ:

How do I filter an H2OFrame using Sparkling Water?

Filtering columns is easy: just remove the unnecessary columns or create a new > H2OFrame from the columns you want to include (Frame(String[] names, Vec[] vec)), then make the H2OFrame wrapper around it (new H2OFrame(frame)).

Filtering rows is a little bit harder. There are two ways:

Create an additional binary vector holding 1/0 for the in/out sample (make sure to take this additional vector into account in your computations). This solution is quite cheap, since you do not duplicate data - just create a simple vector in a data walk.

or

Create a new frame with the filtered rows. This is a harder task, since you have to copy data. For reference, look at the #deepSlice call on Frame (H2OFrame)

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94