2

How it is possible that storing data into H2O matrix are slower than in data.table?

#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)

h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
  matrix2[i,1]<-3
}

Thanks!

Dave2e
  • 22,192
  • 18
  • 42
  • 50
Jesus
  • 462
  • 6
  • 13

2 Answers2

4

H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)

So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.

For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().

h2o_frame = as.h2o(matrix1)
head(h2o_frame)

This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)


data.table Tips:

For data.table, prefer the in-place := syntax. This avoids copies. So, for example:

matrix1[i, 3 := 42]

H2O Tips:

The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.

The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.

If you want to watch the network messages between R and H2O, call h2o.startLogging().

TomKraljevic
  • 3,661
  • 11
  • 14
2

I can't answer your question because I don't know h20. However I can make a guess.

Your code to fill the data.table is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.

for(i in 1:1000){ 
  matrix1[i,1]<-3 
}

for(i in 1:1000){ 
  set(matrix1, i, 1L, 3) 
}

With set my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).

I suppose h2o to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?

JRR
  • 3,024
  • 2
  • 13
  • 37
  • And how can I do to use set(matrix1,i,1L,3) if I want to fill all the column instead a simple row [i]?? – Jesus Aug 21 '17 at 07:14
  • 1
    Double loop? Well, actually if you want to fill your table you're better to fill it at the creation not after the creation in a loop. This is not the R way. – JRR Aug 21 '17 at 11:00
  • Imagin I want to assign to a column of a matrix a value of another df, which is the best way to do it? – Jesus Aug 21 '17 at 12:23
  • `matrix[,i] = df$mycol` to copy the entiere column `mycol` in the matrix column `i` or `matrix[,i] = 3` for a single number. – JRR Aug 21 '17 at 15:37
  • And this is not inefficient way? I thought that it was a bad way to do it... It is not possible to do like := or using h2o? Thanks! – Jesus Aug 21 '17 at 20:00
  • Ok well, my mistake I was inaccurate. It is inefficient but only once. Here I replace 1000 data in a single command. The 1000x1000 matrix is copied once for 1000 numbers updated. In your code the 1000x1000 matrix is copied 1000 times for 1000 updates. So yes, it is inefficient because of the copy but 1000 times less inefficient... This is the so-called "copy on modify" semantic. Can't explain much more in a comment. I refer you to Hadley Wickham's book: http://adv-r.had.co.nz/memory.html, http://adv-r.had.co.nz/Performance.html – JRR Aug 21 '17 at 20:18
  • I have create ahoher question about it:https://stackoverflow.com/questions/45814469/best-way-to-store-distances-with-h2o. Thanks for the link, very interesting – Jesus Aug 22 '17 at 09:59