Sequencefiles which map a single key to multiple values

Question

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format

filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>

Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:

{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}

However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:

{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...

I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as

"filename_1" => "<some partial XML from file 1>
                 <some more partial XML from file 1>
                 <the end of file 1>"

, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.

How are you creating the SequenceFile? E.X. What Pig Latin/UDFs are you using? — mr2ert, Sep 09 '13 at 21:19

score 0 · Answer 1 · answered Sep 09 '13 at 21:19

Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).

You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.

DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}

B = FOREACH (GROUP xml BY filename)
    GENERATE group AS filename, xml.xml_data AS all_xml_data ;

Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:

@outputSchema('xml_complete: chararray')
def stringify(bag):
    delim = ''
    return delim.join(bag)

NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

Sequencefiles which map a single key to multiple values

1 Answers1