1

I have several files with datas in it.
For example: file01.csv with x lignes in it, file02.csv with y lines in it.

I would like to treat and merge them with mapreduce in order to get a file with the x lines beginning with file01 then line content, and y files beginning with file02 then line content.

I have two issues here:

  • I know how to get lines from a file with mapreduce by setting FileInputFormat.setInputPath(job, new Path(inputFile)); But I don't understand how I can get lines of each file of a folder.
  • Once I have those lines in my mapper, how can I access to the filename corresponding, so that I can create the data I want ?

Thank you for your consideration.

Ambre

DimaSan
  • 12,264
  • 11
  • 65
  • 75
Ambre C.
  • 21
  • 3
  • Check if this helps you - http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop – Amit Jan 12 '17 at 16:02

1 Answers1

0

You do not need map-reduce in your situation. That's because you want to preserve the order of lines in result file. In this case single thread processing will be faster.

Just run java client with code like this:

FileSystem fs = FileSystem.get();
OutputStream os = fs.create(outputPath); // stream for result file
PrintWriter pw = new PrintWriter(new OutputStreamWriter(os));

for (String inputFile : inputs) { // reading input files
    InputStream is = fs.open(new Path(inputFile));
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    String line;
    while ((line = br.readLine()) != null) {
         pw.println(line);
    }
    br.close();
}

pw.close();
AdamSkywalker
  • 11,408
  • 3
  • 38
  • 76