0

I am currently using R to carry out analysis.

I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.

The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.

I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?

I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.

Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?

h.l.m
  • 13,015
  • 22
  • 82
  • 169
  • What do you mean by "converting it to HDFS"? Are your `.csv`s sitting on the HDFS or your local filesystem? Or are you merely trying to make use of MapR processing? If you can process the `.csv`s independently, then do so. – mlegge Feb 13 '15 at 17:23
  • You can't convert a format to a file system, that's nonsense. – piccolbo Feb 13 '15 at 21:18

2 Answers2

0

You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.

In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.

Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.

Michele Usuelli
  • 1,970
  • 13
  • 15
0

HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.

Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.

hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv
MrE
  • 19,584
  • 12
  • 87
  • 105