alternative to `as.h2o()` for small data?

Question

I have the opposite issue to most people with as.h2o(), though the resulting problem is the same. I have to convert and feed a series of single row vectors just 19 columns wide to an h2o autoencoder. Each vector takes 0.29 seconds approx to convert using as.h2o(), which is causing a major bottleneck.

Can anyone suggest an alternative approach that might be faster?
(For various reasons I have no alternative to sending single row vectors one by one, so aggregating the data in matrices before calling as.h2o is not an option.)

Many thanks.

could you save the columns to csv and then use h2o to import them (maybe with a for loop)? Then you can take advantage of h2o's parallel/distributed multi-threaded pull of the data. — Lauren, Dec 12 '17 at 03:38
Thanks for the suggestion, but sadly no I can't. The data is being streamed in automatically row by row in real time so has to be processed in the same way. The predictions need to flow back in real time as the prediction from iteration i is an input to the creation of learning features for iteration i+1. — andy, Dec 12 '17 at 07:45
Are you predicting with the autoencoder, or training the autoencoder one line at a time? — Darren Cook, Dec 12 '17 at 09:40
I am only using it for denoising and to reduce the number of features from 19 to 5 one line at a time, but this reduced feature set is then immediately fed to an SVM which makes the actual prediction. That prediction is then used to trigger an action, but is also needed to feedback into the creation of the feature set for the next time period. — andy, Dec 12 '17 at 11:58
I'm actually using a bit of code from your blog to do it, for which many thanks. — andy, Dec 12 '17 at 12:01

score 1 · Answer 1 · answered Dec 13 '17 at 03:09

If this is creating a bottleneck, you should use a MOJO (or POJO) model for row-wise scoring instead of a model loaded into memory in the H2O cluster. This is what the MOJO/POJOs model format is designed for -- fast scoring without the need to convert between R data.frame and H2OFrame and also does not require running an H2O cluster. You can skip R altogether here.

Alternatively, if your pipeline requires R, you can still use the MOJO/POJO model from R via the h2o.predict_json() function; it just requires you to convert your 1-row data.frame to a JSON string. That might alleviate the bottleneck somewhat, though the straight Java with MOJO/POJO model scoring method (above) will be the fastest.

Here's an example of what this looks like using a GBM MOJO file:

library(h2o)

model_path <- "~/GBM_model_python_1473313897851_6.zip"
json <- '{"V1":1, "V2":3.0, "V3":0}'
pred <- h2o.predict_json(model = model_path, json = json)

Here's how to construct the JSON string from a 1-row data.frame:

df <- data.frame(V1 = 1, V2 = 3.0, V3 = 0)
dfstr <- sapply(1:ncol(df), function(i) paste(paste0('\"', names(df)[i], '\"'), df[1,i], sep = ':'))
json <- paste0('{', paste0(dfstr, collapse = ','), '}')

Many thanks Erin. I'm actually using h2o.deepfeatures rather than making a prediction, so will h2o.predict_json still be applicable? — andy, Dec 13 '17 at 15:09
It can't be used directly, no. You may be able to create a function similar to `h2o.predict_json()` that could pass data to a DL MOJO/POJO and return the deep features but that would require custom code. — Erin LeDell, Dec 18 '17 at 05:52
Many thanks Erin, thanks for confirming, I'll give that a go. — andy, Dec 19 '17 at 11:30

alternative to `as.h2o()` for small data?

1 Answers1

Linked