0

The process and tools to bring in csv data from an external source to HDFS and store it in a particular format is well-known; however, how to convert data formats for data ALREADY EXISTING in HDFS ?

I am working with an existing data set (~ multi TB) on HDFS in Json format/uncompressed. How to convert that data on the cluster into say, Parquet, on the same cluster, while minimizing the cluster resources?

Options:

  • Temporarily get another cluster of same size, and move all the data over while converting, then move back the data?
  • Supplement additional nodes on existing cluster temporarily ? How to ensure they are only used for this migration ?
  • ??

Thanks,

Matt

matthieu lieber
  • 662
  • 1
  • 17
  • 30

1 Answers1

1

You could write a java code to convert existing csv file to parquet using ParquetOutputFormat class. Look here for Parquet implementation.

Code will be like this:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("CSV to Parquet");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    job.setNumReduceTasks(1);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(ParquetOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/csv"));
    ParquetOutputFormat.setOutputPath(job, new Path("/parquet"));

    job.waitForCompletion(true);
   }

/csv is the HDFS path to csv file and /parquet is the HDFS path to new parquet file.

Source

Community
  • 1
  • 1
Rajesh N
  • 2,554
  • 1
  • 13
  • 17
  • thanks, but that's not the question.. (i will edit). I know how to code it - the problem is to perform this task on *existing* data already in the cluster (~4TB). It's more of a Dev Ops issue.. – matthieu lieber May 14 '15 at 19:35