I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?
Asked
Active
Viewed 3,069 times
2

dvk
- 111
- 2
- 5
3 Answers
4
I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is: Processing paraphragraphs in text files as single records with Hadoop

Community
- 1
- 1

Thomas Jungblut
- 20,854
- 6
- 68
- 91
1
You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.

Niels Basjes
- 10,424
- 9
- 50
- 66
-
This is more elegant than what I currently did. I built a local iterable that gave me a logical line and used a RecordReader to transmit the entire document as a ByteWritable. Thanks for the tip! – dvk Jun 17 '11 at 04:51
0
Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?

David Medinets
- 5,160
- 3
- 29
- 42