Rhadoop mapreduce for multiple input files

Question

I'm building a mapreduce program, using R, that extracts the relevant features from a set of features in a dataset using genetic algorithm. I need to put many files as an input to my mapreduce job. My code below is my mapreduce program but it works only for one input file (data.csv).

library(caret)
library(dplyr)
library(rmr2)
Sys.setenv(HADOOP_CMD="/home/rania/hadoop-2.7.3/bin/hadoop")
Sys.getenv("HADOOP_CMD")
Sys.setenv(HADOOP_STREAMING="/home/rania/hadoop-streaming-2.7.3.jar")
library(rhdfs)
hdfs.init()
rmr.options(backend = "hadoop")
hdfs.mkdir("/user/rania/genetic")
hdfs.mkdir("/user/rania/genetic/data")

I put my files in one folder in hdfs

hadoop fs -copyFromLocal /home/rania/Downloads/matrices/*.csv /user/rania/genetic/data/

This is the map function

mon.map <- function(.,data){ 
data <- read.csv("/home/rania/Downloads/dataset.csv", header = T, sep = ";")
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv")   # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data, y = y,
                           iters = 10, # 100 generations of algorithm
                           popSize = 4, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl)
keyval(rf_ga3$ga$final, data[names(data) %in% rf_ga3$ga$final]  ) 
 }

This is the reduce function

mon.reduce <- function(k,v){
keyval(k,v) }

Now i apply the mapreduce job

hdfs.root = 'genetic' 
hdfs.data = file.path(hdfs.root, 'data')
hdfs.out = file.path(hdfs.root, 'out')
csv.format <- make.output.format("csv")
genetic = function (input, output) {mapreduce(input=input, output=output, input.format="csv",output.format=csv.format, map=mon.map,reduce=mon.reduce)}
out = genetic(hdfs.data, hdfs.out)

Then we print the result from hdfs

results <- from.dfs(out, format="csv")
print(results)

OR

hdfs.cat("/genetic/out/part-00000")

I tried to change the map function to make it work for many files but it failed

mon.map <- function(.,data){ 
data <- list.files(path="/home/rania/Downloads/matrices/", full.names=TRUE, pattern="\\.csv") %>% lapply(read.csv, header=TRUE, sep=",") 
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)
for (i in 1:4){
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv")   # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data[[i]], y = y,
                           iters = 10, # 100 generations of algorithm
                           popSize = 4, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl)
               }
keyval(rf_ga3$ga$final, do.call(cbind, Map(`[`, data, c(rf_ga3$ga$final)))  )  
 }

what can i change in the previous map function to make it work for many input files? thanks

I think "*.csv" should be used in the pattern argument. Hadoop handles file path patterns easily. — user238607, Jun 24 '17 at 14:53
This is the error : `Error in if (nrow(x) != length(y)) stop("there should be the same number of samples in x and y) : argument is of length zero` , but if i check the rows number of each data[[i]] and the lenghth of y , i found it the same . What could be wrong ? @user238607 — Rania, Jun 25 '17 at 01:58

Rhadoop mapreduce for multiple input files

0 Answers0