Migrate data to new data format for data already in HDFS

Question

The process and tools to bring in csv data from an external source to HDFS and store it in a particular format is well-known; however, how to convert data formats for data ALREADY EXISTING in HDFS ?

I am working with an existing data set (~ multi TB) on HDFS in Json format/uncompressed. How to convert that data on the cluster into say, Parquet, on the same cluster, while minimizing the cluster resources?

Options:

Temporarily get another cluster of same size, and move all the data over while converting, then move back the data?
Supplement additional nodes on existing cluster temporarily ? How to ensure they are only used for this migration ?
??

Thanks,

Matt

score 1 · Answer 1 · edited May 23 '17 at 11:51

You could write a java code to convert existing csv file to parquet using ParquetOutputFormat class. Look here for Parquet implementation.

Code will be like this:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("CSV to Parquet");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    job.setNumReduceTasks(1);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(ParquetOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/csv"));
    ParquetOutputFormat.setOutputPath(job, new Path("/parquet"));

    job.waitForCompletion(true);
   }

/csv is the HDFS path to csv file and /parquet is the HDFS path to new parquet file.

Source

thanks, but that's not the question.. (i will edit). I know how to code it - the problem is to perform this task on *existing* data already in the cluster (~4TB). It's more of a Dev Ops issue.. — matthieu lieber, May 14 '15 at 19:35

Migrate data to new data format for data already in HDFS

1 Answers1