Is it possible to process multi-line records using Hadoop Streaming?

Question

I have records like this:

Name: Alan Kay
Email: Alan.Kay@url.com
Date: 09-09-2013

Name: Marvin Minsky
Email: Marvin.Minsky@url.com
City: Boston, MA
Date: 09-10-2013

Name: Alan Turing
City: New York City, NY
Date: 09-10-2013

They're multiline but not always of the same number of lines, and they're usually separated by a newline. How would I convert it to the output below?

Alan Kay|Alan.Kay@url.com||09-09-2013
Marvin Minsky|Marvin.Minsky@url.com|Boston,MA|09-10-2013
Alan Turing||New York City, NY|09-10-2013

Apache Pig treats each line as a record, so it's not suited for this task. I'm aware of this blog post on processing multi-line records, but I'd prefer not to delve into Java if there's a simpler solution. Is there a way to solve this using Hadoop Streaming (or a framework like mrjob)?

score 0 · Answer 1 · answered Apr 25 '14 at 09:09

0

There is no short cut way of doing this. You have to create your own inputFormat and RecordReader class then you can specify those classes in Hadoop streaming command. Follow this:

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

answered Apr 25 '14 at 09:09

Ashish

5,723
2
24
25

thank you for letting us know, link has been replaced with a working one! – nitinr708 Nov 11 '18 at 10:44

Is it possible to process multi-line records using Hadoop Streaming?

1 Answers1