I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage()
load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.