0

At present, we are using text2vec processing large dataset in AWS EC2(single instance), the text data will bigger and bigger in the future, we may try to RHadoop(MapReduce) architecture and don't know if it can be compatibility between text2vec and RHadoop(MapReduce).

Zheng Lu
  • 3
  • 1
  • question is completely not clear. What kind of tasks do you perform with text2vec? What do you want to achieve with RHadoop? – Dmitriy Selivanov Aug 13 '17 at 11:32
  • I am using text2vec + xgboost for text classification, the model can work very well when it within 10 million lines of text data.Otherwise,the memory will overflow when it run with EC2 single instance(32G memory). so I wonder if it can be solved by combining RHdoop. If you have more better advice please kindly let me know.Thank you very much! – Zheng Lu Aug 13 '17 at 14:07
  • memory overflow at which stage? `create_dtm`? – Dmitriy Selivanov Aug 13 '17 at 18:12
  • yes, memory overflow is in create DTM dtm_t1 <- create_dtm(it_train, vectorizer) Error in asMethod(object) : Cholmod error 'out of memory' at file ../Core/cholmod_memory.c, line 147 Error in coerce_matrix(dtm, type) : cannot coerce input to dgCMatrix – Zheng Lu Aug 14 '17 at 03:16

1 Answers1

0

The short answer is yes - if you really want you can make anything work with RHadoop. But I'm pretty sure that effort will be substantial and probably you won't be satisfied with results.

Coming back to real problem. Worth to try text2vec version 0.5 (which was released last week) - it consumes even less ram than before. Also you can easily process data with chunks and in parallel. Check this vignette for example.

Another thing is that for basic tasks like classification you usually don't need all the data in RAM. You can check for example another my package - FTRL for fitting logistic regression (with L1/L2/elasticnet penalty) with SGD incrementally.

Would be great to have report on github from you about memory problem (which is actually coming from Matrix package).

PS tree methods and ensembles usually not good with sparse high-dimensional data.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • 1
    Thank you very much. These advice are invaluable to me, maybe I not try Rhadoop, I'll try another couple of ways you can say. – Zheng Lu Aug 15 '17 at 09:47