-2

I have a 10025x1417 TFIDF dfm matrix created with quanteda. (The actual class is dfmSparse which is a subclass of dfm-matrix). When I convert to h2o with as.data.frame and then as.h2o, I incorrectly get 10026x1417, with an unwanted extra first row of NaNs. For performance reasons I don't want to create a temporary df with the full dense matrix.

The code is as follows (I was unable to reproduce on small toy data):

library(quanteda)
mat <- quanteda::weight(theDfm, type="tfidf")

# Convert to df then h2o, correctly gives 10025x1417 matrix
mat_df  <- as.data.frame(mat) # this will dispatch quanteda::as.data.frame for dfmSparse
mat_h2o <- as.h2o(mat_df)

# Convert in one go, get 10026x1417, get unwanted extra first row of NaNs
bad_h2o <- as.h2o(as.data.frame(mat))
dim(bad_h2o )
[1] 10026  1417

# Which as.data.frame method this uses
> showMethods(quanteda::as.data.frame)
Function: as.data.frame (package base)
x="ANY"
x="dfm"
x="dfmSparse"
    (inherited from: x="dfm")
x="matrix"
    (inherited from: x="ANY")

#########################################
# Ken Benoit requested sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] h2o_3.8.3.3         statmod_1.4.22      quanteda_0.9.8      RevoUtilsMath_3.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2      lattice_0.20-33  SnowballC_0.5.1  bitops_1.0-6     chron_2.3-47     grid_3.2.3       R6_2.1.1        
 [8] jsonlite_0.9.19  magrittr_1.5     httr_1.0.0       stringi_1.0-1    data.table_1.9.6 ca_0.58          Matrix_1.2-3    
[15] tools_3.2.3      stringr_1.0.0    RCurl_1.95-4.7   parallel_3.2.3 
smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    `sessionInfo()`? And does the conversion to data.frame it work before you apply the `weight()` function? Also if you want to file an issue with a link to the data so I can reproduce it, I should be able to fix pretty quickly. – Ken Benoit Aug 16 '16 at 07:31
  • @KenBenoit sessionInfo added. It also gives the unwanted extra row of NaNs before I apply `weight()`, i.e. on just the raw dfm. I haven't been able to create reproducible data, but you should see it too if you try any nontrivial data – smci Aug 16 '16 at 07:55
  • I tried but could not reproduce it. The mat_df looks ok to me, so possibly an error in `as.h20()`? – Ken Benoit Aug 16 '16 at 08:36

1 Answers1

3

For performance reasons I don't want to create a temporary df with the full dense matrix.

In fact, quanteda will convert your sparse matrix to dense before converting it data.frame: https://github.com/kbenoit/quanteda/blob/master/R/dfm-classes.R#L513-L516

If you need to import sparse matrix to h2o, convert it to svmlight format and use importFile. See this topic: How to use H2o on feature hashed matrix in R

Community
  • 1
  • 1
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • 1
    That's correct, but it's not a **quanteda** behaviour, but the fact that any data.frame is dense. This is a nice solution to avoid the coercion to a dense object. – Ken Benoit Aug 16 '16 at 15:53
  • I agree with you, it makes no sense to convert sparse matrix to dense data.frame. Just pointed. We can convert sparse matrix in triplet form to data.frame, but this is another story. – Dmitriy Selivanov Aug 16 '16 at 16:23