I have to process data in very large text files(like 5 TB in size). The processing logic uses supercsv to parse through the data and run some checks on it. Obviously as the size is quite large, we planned on using hadoop to take advantage of parallel computation. I install hadoop on my machine and I start off to write the mapper and reducer classes and I am stuck. Because the map requires a key value pair, so to read this text file I am not sure what should be the key and value in this particular scenario. Can someone help me out with that.
My thought process is something like this (let me know if I am correct) 1) Read the file using superCSV and hadoop generate the supercsv beans for each chunk of file in hdfs.(I am assuming that hadoop takes care of splitting the file) 2) For each of these supercsvbeans run my check logic.