I'm working with Hadoop to process some files distributed across a cluster of JVM instances.
I'm using the Cascading library to interface to Hadoop.
I want to parse a text file where the records cross newlines and are terminated by a period: (.)
(I'm aware this is so small the benefits of Hadoop are not realised - I'm working on a demo).
From what I can see - I'd need to write a custom InputFormat to handle this.
My question is - is it better# to:
(a) have a pre-processing step on my input data to strip out the newlines and then insert a newline after the end of each record?
(b) Write a custom InputFormat?
# By 'better' - I mean less work and more idiomatic.