I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr
here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?