2

I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?

dvk
  • 111
  • 2
  • 5

3 Answers3

4

I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.

here it is: Processing paraphragraphs in text files as single records with Hadoop

Community
  • 1
  • 1
Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
1

You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.

Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
  • This is more elegant than what I currently did. I built a local iterable that gave me a logical line and used a RecordReader to transmit the entire document as a ByteWritable. Thanks for the tip! – dvk Jun 17 '11 at 04:51
0

Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?

David Medinets
  • 5,160
  • 3
  • 29
  • 42