parsing XML file in MapReduce

Question

I have a list of XML files, which needs to be parsed using MR code.

A sample of the xml file is give below

<tns:envelope xmlns:tns="http://abcd.com/schemas/envelope/v3_0" xmlns:xsi="http://www.abcd.org/2001/XMLSchema-instance" version="3.0">
    <tns:header>
        <tns:type>response</tns:type>
        <tns:service>
            <tns:name>Value1</tns:name>
            <tns:version>3.0</tns:version>
        </tns:service>
        <tns:originator>Value2</tns:originator>
        <tns:businessProcessName>Value3</tns:businessProcessName>
        <tns:sequenceNumber>value3</tns:sequenceNumber>
        <tns:transactionReference>abcdef12345</tns:transactionReference>
        <tns:expirationSeconds>1200</tns:expirationSeconds>
        <tns:additionalParameters>
            <tns:param>
                <tns:name>notificationURL</tns:name>
                <tns:value>https://url1</tns:value>
            </tns:param>
            <tns:param>
                <tns:name>ConsumingCallbackURL</tns:name>
                <tns:value>https://url2</tns:value>
            </tns:param>
        </tns:additionalParameters>
        <tns:result>
            <tns:status>success</tns:status>
            <tns:provider>ABC</tns:provider>
        </tns:result>
        <tns:requestDateTime>2016-02-16T08:12:17.827Z</tns:requestDateTime>
    </tns:header>
    <tns:body></tns:body>
</tns:envelope>

Now I have a configuration file where the interested tags which needs to be parsed are kept. Sample tag names given like below

/envelope/version
/envelope/header/type
/envelope/header/service/name
/envelope/header/additionalParameters/param/name
/envelope/header/additionalParameters/param/value

The expected output will is like below

/envelope/version /envelope/header/type /envelope/header/service/name /envelope/header/additionalParameters/param/name /envelope/header/additionalParameters/param/value
       3.0               response                   Value1                             notificationURL                                   https://url1
       3.0               response                   Value1                           ConsumingCallbackURL                                https://url2

Can I get a sample code to parse the XML and get the sample desired output.

Your file is not large enough to require mapreduce, nor is there an explicit reduce stage. You are mapping an XML parser across files. — OneCricketeer, Apr 25 '16 at 16:10
This is just an example file. There are files of 300KB size each and we have to parse around 500K such files per day, so we thought MR should be the best option. Can you suggest what else can be done. — Koushik Chandra, Apr 25 '16 at 16:49
Have you created a proof of concept (without mapreduce) on a single file first because that is really all you need. — OneCricketeer, Apr 25 '16 at 16:51
We are trying to use a configuration file where interested tags are kept because all the XML files are not consistent. Means in some xml file few or more tag are missing or not there. In such cases after parsing, the expected value should be NULL or blank. — Koushik Chandra, Apr 25 '16 at 16:53
You haven't specified a language tag, but I assume you are using Java Hadoop MapReduce code? — OneCricketeer, Apr 25 '16 at 16:55
Okay, so 1) Outputting headers into your output won't be [possible or easy](http://stackoverflow.com/a/16331777/2308683) 2) What have you tried to do to wrap MR code around the existing parser? — OneCricketeer, Apr 25 '16 at 17:05
The existing code doesn't have class/packages related to hadoop MR. So I am thinking we have to come up with a brand new code for hadoop MR. — Koushik Chandra, Apr 25 '16 at 17:19
You don't need to integrate the code into MR libraries. You should write MR code "on top of" the existing code. You read each file into a single string. Parse that XML string into a Custom Writable class. Output that custom writable class to HDFS. There is no reduce stage. — OneCricketeer, Apr 25 '16 at 17:29
what will be input and output key type and value type. If we read full file into a single string then input key seems to me LongWritable and input value Text. But what will be the output of the mapper exactly. — Koushik Chandra, Apr 25 '16 at 17:36
(LongWritable, Text) is for reading line-by-line. You'll need a [`WholeFileInputFormat`](http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop). Which means the input and output can be both `NullWritable, Text`. Because there is no key, and you are writing tab-delimited strings as values. You are welcome to define your own custom writable for the mapper output, though. The Hive solution below looks like a better approach, anyway — OneCricketeer, Apr 25 '16 at 18:23
One question: Are all your xpath guaranteed to have one and only one possible result field? — vtd-xml-author, May 05 '16 at 22:54

score 0 · Answer 1 · edited May 23 '17 at 12:16

The format in which your data is stored is very important in case of semi structured data like XML. Looking at the sample XML data, I can only assume that it's some sort of webservice logs. I can give you examples of 2 different scenarios working with XML files in hadoop.

If you have control over how the xml files are stored you can put in the format below (each node is separated by a newline). You can use default hadoop TextInputFormat to read each line.

<tns:envelope .... </tns:envelope>
<tns:envelope .... </tns:envelope>
<tns:envelope .... </tns:envelope>

sample code:

public static class XMLDataMap extends Mapper<LongWritable, Text, Text, Text> {

@Override
protected void map(LongWritable key, 
                       Text value, Mapper.Context context) throws Exception {
  //read each line of XML data
  String xmlDataLine = value.toString();
  String tagName = "";
  String tagValue = "";

 //implement XML parsing logic below
 //I recommend using StAX parser, you can use DOM as well or already implemented parsing logic here

 //tagName = parse logic
 //tagValue = parse logic

 context.write(tagName, tagValue);

 }

Note: If you don't have control over how the data are stored and the XML data comes pretty printed (same format as the provided sample), you can remove the newline characters and make it look like as the format above. This way you can ensure that the xml data is valid (not missing tags) and use available libraries to parse the xml.

If the XML is cascaded as the format below, then it becomes more interesting. You have to implement a custom InputFormat to split the cascading XML into multiple <tns:envelope .... </tns:envelope>. No worries, we have a XmlInputFormat that works with this format of XML; originally created for Apache Mahout project but today there are multiple versions out there.

<cascadedXML>
<tns:envelope .... </tns:envelope>
<tns:envelope .... </tns:envelope>
<tns:envelope .... </tns:envelope>
.....
</cascadedXML>

OR

<cascadedXML><tns:envelope .... </tns:envelope><tns:envelope ....</tns:envelope><tns:envelope .... </tns:envelope> ..........</cascadedXML>

Note: I recommend looking at the stackoverflow link (Not executing my hadoop mapper class while parsing xml in hadoop using XMLInputFormat) where I have answered similar question few months back.

Also, refer to Alex Holmes' book Hadoop in Practice and sample codes from the book (Hadoop In Practice Github) to get more insight.

I don't think stax is ideal as so many complex xpath expression implies a large chuck of hard-to-recognize, brittle code — vtd-xml-author, May 05 '16 at 23:05

parsing XML file in MapReduce

1 Answers1