2

I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.

My thought process is something like this (let me know if I am correct) 1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file) 2) For each of these supercsvbeans run my check logic.

Nikhil Das Nomula
  • 1,863
  • 5
  • 31
  • 50

1 Answers1

3

Is the data newline-separated? i.e., if you just split the data on each newline character, will each chunk always be a single, complete record? This depends on how superCSV encodes text, and on whether your actual data contains newline characters.

If yes:

Just use TextInputFormat. It provides you with (I think) the byte offset as the map key, and the whole line as the value. You can ignore the key, and parse the line using superCSV.

If no:

You'll have to write your own custom InputFormat - here's a good tutorial: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat. The specifics of exactly what the key is and what the value is don't matter too much to the mapper input; just make sure one of the two contains the actual data that you want. You can even use NullWritable as the type for one of them.

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • To be clear, if it's text input key is `LongWritable` and the value is `Text`. – Thomas Jungblut Oct 26 '12 at 11:52
  • Yes the actual data is newline-separated. As per what you suggest i need to use byteoffset as the key and the line(object which is parsed by supercsv) as value. So does that mean that each line will be processed on a node in a cluster ? What I was thinking was that hadoop will split the files and I would send the corresponding supercsv objects as values – Nikhil Das Nomula Oct 26 '12 at 14:31
  • Hadoop will split the files and feed many lines to each mapper, and yes, this will be distributed through the cluster. You can do the parsing of the lines into supercsv objects as the immediate first step in the mapper, and have essentially the same result as if you had used a custom input format, but without the hassle of actually writing/debugging that. – Joe K Oct 26 '12 at 23:31
  • As per what you say I tried splitting using NLineInputFormat. Lets say I have 10 lines per split. In that case only the first 10 lines have the CSV header through which CSV populates its respective beans. Now the remaining splits just have data without a header. Now in that case do I need to write a header at the beginning of each split and I am not sure if we can do that – Nikhil Das Nomula Oct 31 '12 at 21:47
  • Can you simply hard-code the headers into the mapper class and remove it from the file? There's also no need to use NLineInputFormat; you should just use TextInputFormat. – Joe K Nov 01 '12 at 21:58