1

I have this scenario:

I have a script test.R for testing rmr2. In this script, the map function uses a R library which is only installed in client node, not in cluster.

Obviously, the job fails in each map task because it can't load the library.

My questions are:

  • ¿How can I use this library without installing it in all nodes with administrator privileges?

  • ¿How can I attach, send or share it? I don't want to install each library in each node every time I use a new.

  • Is it possible?

I don't find any parameter similar to --jars in hadoop, or --py-libs parameters in spark with python.

Here is this stupid example wordcount code that uses "tm" library ("stopword" function) which is installed in client but not in all nodes of cluster.

Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.2.jar")

library(rmr2)
library(tm)

map <- function(k,lines) {
  x = stopwords("en")
  y = paste(x, collapse=" ")
  words.list <- strsplit(paste(lines,y,sep=" "), '\\s')
  words <- unlist(words.list)
  sessionInfo()
  return( keyval(words, 1) )
}

reduce <- function(word, counts) {
  keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) {
  mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

## read text files from folder wordcount/data
## save result in folder wordcount/out

## Submit job
hdfs.root <- '/user/ec2-user/ '
hdfs.data <- file.path(hdfs.root, 'input')
hdfs.out <- file.path(hdfs.root, 'output_rmr2_external_test')
out <- wordcount(hdfs.data, hdfs.out)

For executing from client node:

Rscript test.R

Edit:

Comparing this question with Temporarily installing R packages on Hadoop nodes for streaming jobs

1) This question has the same meaning that the other, but I think this is more complete because there is a concrete example and scenario. So I think this is clearer.

2) First answer is yours, piccolbo, but it is from 2012 and we are in 2015 so I think it would be outdated. Rest of answers are helpful but not at all. If you pass the external R library compressed, you must unzip it on each node and add the path to the libPaths of R. Is it right? But I don't know if there is a parameter for that.

3) I would like to know if it's possible and in an easy way.

Thank you

Community
  • 1
  • 1
user2558672
  • 87
  • 1
  • 7
  • Possible duplicate of [Temporarily installing R packages on Hadoop nodes for streaming jobs](http://stackoverflow.com/questions/11143406/temporarily-installing-r-packages-on-hadoop-nodes-for-streaming-jobs) – piccolbo Nov 10 '15 at 18:43
  • As you said, this question is a duplicate of an existing one. You need to focus on 1) How your question is different from that one 2) How the answers provided there are not satisfactory. – piccolbo Nov 10 '15 at 18:45
  • I have done this: - Put on hdfs the library tm.tar.gz (out of map function) - Get it from all nodes (in map function) - Install it in custom path (in map function)) - Load it (in map function) But I am having this error: "error ignoring SIGPIPE signal" I think it's an inefficient solution....I can't believe there is no an easy way for passing an R library. – user2558672 Nov 11 '15 at 16:26
  • I'm joining user2558672. I don't really understand the answer you(piccolbo) gave in your post from 2012. In that time you spoke about a script in developing mode but with parts hardcoded and I can't see any further change. Does it mean this idea is obsolete? I can't believe the only (easy) way to do it work in a 1K nodes cluster is deploying each libraries in each node, each time some developer want to use a new library... What am I misunderstanding? – Cheloute Nov 11 '15 at 20:15
  • @Cheloute yes the idea was abandoned as too brittle -- too dependent on assumptions about the target systems, sysadm policies etc. As far as your disbelief wrt the real world, I am sorry but it's a java world. Does your package fit in a jar -- no problem. R packages don't. There are some distro-specific answers such as Cloudera parcels. People use puppet, globus and a variety of other approaches. They have to install java on all nodes one way or another. Another approach would be an rmr2 container. – piccolbo Nov 12 '15 at 00:33
  • @user2558672 to distribute the tar efficiently don't read in map, you need to use distributed cache feature of streaming. You just install in map. My approach was to install to /tmp/R, so that it doesn't disappear with each job. But even this way one can hit timeout limits, privilege issues, lack of compiler toolchain, you name it. Brittle. – piccolbo Nov 12 '15 at 00:37

0 Answers0