I have this scenario:
- Hadoop Client node (R and rmr2 installed)
- Hadoop cluster (R and rmr2 in all nodes installed)
- No administrator privileges in cluster for installing external libraries
- This question is similar to Temporarily installing R packages on Hadoop nodes for streaming jobs, but I can't add a comment because i'm new here.
I have a script test.R for testing rmr2. In this script, the map function uses a R library which is only installed in client node, not in cluster.
Obviously, the job fails in each map task because it can't load the library.
My questions are:
¿How can I use this library without installing it in all nodes with administrator privileges?
¿How can I attach, send or share it? I don't want to install each library in each node every time I use a new.
Is it possible?
I don't find any parameter similar to --jars in hadoop, or --py-libs parameters in spark with python.
Here is this stupid example wordcount code that uses "tm" library ("stopword" function) which is installed in client but not in all nodes of cluster.
Sys.setenv(HADOOP_CMD="/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.2.jar")
library(rmr2)
library(tm)
map <- function(k,lines) {
x = stopwords("en")
y = paste(x, collapse=" ")
words.list <- strsplit(paste(lines,y,sep=" "), '\\s')
words <- unlist(words.list)
sessionInfo()
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder wordcount/data
## save result in folder wordcount/out
## Submit job
hdfs.root <- '/user/ec2-user/ '
hdfs.data <- file.path(hdfs.root, 'input')
hdfs.out <- file.path(hdfs.root, 'output_rmr2_external_test')
out <- wordcount(hdfs.data, hdfs.out)
For executing from client node:
Rscript test.R
Edit:
Comparing this question with Temporarily installing R packages on Hadoop nodes for streaming jobs
1) This question has the same meaning that the other, but I think this is more complete because there is a concrete example and scenario. So I think this is clearer.
2) First answer is yours, piccolbo, but it is from 2012 and we are in 2015 so I think it would be outdated. Rest of answers are helpful but not at all. If you pass the external R library compressed, you must unzip it on each node and add the path to the libPaths of R. Is it right? But I don't know if there is a parameter for that.
3) I would like to know if it's possible and in an easy way.
Thank you