XOM canonicalization takes too long

Question

I have an XML file that can be as big as 1GB. I am using XOM to avoid OutOfMemory Exceptions.

I need to canonicalize the entire document, but the canonicalization takes a long time, even for a 1.5 MB file.

Here is what I have done:

I have this sample XML file and I increase the size of the document by replicating the Item node.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Packet id="some" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Head>
<PacketId>a34567890</PacketId>
<PacketHeadItem1>12345</PacketHeadItem1>
<PacketHeadItem2>1</PacketHeadItem2>
<PacketHeadItem3>18</PacketHeadItem3>
<PacketHeadItem4/>
<PacketHeadItem5>12082011111408</PacketHeadItem5>
<PacketHeadItem6>1</PacketHeadItem6>
</Head>
<List id="list">
    <Item>
        <Item1>item1</Item1>
        <Item2>item2</Item2>
        <Item3>item3</Item3>
        <Item4>item4</Item4>
        <Item5>item5</Item5>
        <Item6>item6</Item6>
        <Item7>item7</Item7>
    </Item>
</List>
</Packet>

The code I am using for canonicalization is as follows:

private static void canonXOM() throws Exception {
    String file = "D:\\PACKET.xml";
    FileInputStream xmlFile = new FileInputStream(file);

    Builder builder = new Builder(false);
    Document doc = builder.build(xmlFile);

    FileOutputStream fos = new FileOutputStream("D:\\canon.xml");
    Canonicalizer outputter = new Canonicalizer(fos);

    System.out.println("Query");
    Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");

    System.out.println("Canon");
    outputter.write(nodes);

    fos.close();
}

Even though this code works well for small files, the canonicalization part takes about 7 minutes for a 1.5mb file on my development environment (4gb ram, 64bit, eclipse, windows)

Any pointers to the cause of this delay is highly appreciated.

PS. I need to canonicalize segments from a whole XML document, as well as the whole document itself. So, using the document itself as the argument does not work for me.

Best

Your original XML file looks canonical to me. What is not canonical in it? — gibertoni, Dec 04 '12 at 16:38

whunmr · Answer 1 · 2012-12-05T13:34:16.727

1

memory is not restriction

memory is not restriction

main thread is green and no blocking

main thread is green and no blocking. it is using as much cpu as it can. 
because my machine has multi-cores , so the CPU total usage is not full.
But it will be full for a single CPU the main thread is running on.

Nodes.contains is the most busy one

Nodes.contains is the most busy one

internally nodes was managed in List, and compared linearly. More items in the List, the 'contains' will slower.

private final List nodes;
public boolean contains(Node node) {
    return nodes.contains(node);
}

so

try to modify the lib's code to using HashMap to hold the nodes.
or using multiple-thread to utilize more CPUs, if your XML can be splited into small xmls.

tool: JVisualVM. http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/index.html

edited Dec 05 '12 at 13:34

answered Dec 05 '12 at 10:37

whunmr

2,435
2
22
35

thanks for your answer. Very insightful. Do you think there is a more efficient way of canonicalizing the nodes? – artsince Dec 05 '12 at 12:59
not sure. If you still want using this library, maybe you can try to modify the lib's code by using HashMap to hold the nodes. or try some other libs or methods. :) – whunmr Dec 05 '12 at 13:21
and If your XML can be split into small XMLs, then you can using multiple thread to utilize more CPUs. because currently you app is already nearly 100% in one CPU. using more CPU apparently can reduce your app's running time. – whunmr Dec 05 '12 at 13:29

score 0 · Answer 2 · answered Dec 05 '12 at 13:58

0

Since you want the whole document serialized, can you just replace

Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);

with

outputter.write(doc);

?

It looks like Canonicalizer does extra work (such as the nodes.contains() calls mentioned by whunmr) when given a node list instead of just a root node to canonicalize.

If that doesn't work or is not enough, I would fork Canonicalizer and make optimizations there as suggested by profiling.

answered Dec 05 '12 at 13:58

xan

7,511
2
32
45

I actually need to canonicalize segments from an XML document. I realize that this example is a little misleading; I will have to canonicalize the List node. So I am not necessarily looking to canonicalize the entire document. – artsince Dec 06 '12 at 11:00

mayconbordin · Answer 3 · 2012-12-06T14:17:59.563

I may have a solution to your problem, if you're willing to give up on XOM. My solution consists of using the XPath API and Apache Santuario.

The difference in performance is impressive, but I thought it would be good to provide a comparison.

For the tests I've used the XML file you provided in your question with 1.5MB.

The XOM Test

FileInputStream xmlFile = new FileInputStream("input.xml");

Builder builder = new Builder(false);
Document doc = builder.build(xmlFile);

FileOutputStream fos = new FileOutputStream("output.xml");
nu.xom.canonical.Canonicalizer outputter = new nu.xom.canonical.Canonicalizer(fos);

Nodes nodes = doc.getRootElement().query("./descendant-or-self::node()|./@*");
outputter.write(nodes);

fos.close();

The XPath/Santuario Test

org.apache.xml.security.Init.init();

DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
org.w3c.dom.Document doc = builder.parse("input.xml");

XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();

org.w3c.dom.NodeList result = (org.w3c.dom.NodeList) xpath.evaluate("./descendant-or-self::node()|./@*", doc, XPathConstants.NODESET);

Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte canonXmlBytes[] = canon.canonicalizeXPathNodeSet(result);

IOUtils.write(canonXmlBytes, new FileOutputStream(new File("output.xml")));

The Results

graphic result

Below is a table with the results in seconds. Tests were performed 16 times.

╔═════════════════╦═════════╦═══════════╗
║      Test       ║ Average ║ Std. Dev. ║
╠═════════════════╬═════════╬═══════════╣
║ XOM             ║ 140.433 ║   4.851   ║
╠═════════════════╬═════════╬═══════════╣
║ XPath/Santuario ║ 2.4585  ║  0.11187  ║
╚═════════════════╩═════════╩═══════════╝

The difference in performance is huge and it is related with the implementation of the XML Path Language. The downside of using XPath/Santuario is that they're not as simple as XOM.

Test Details

Machine: Intel Core i5 4GB RAM
SO: Debian 6.0 64bit
Java: OpenJDK 1.6.0_18 64bit
XOM: 1.2.8
Apache Santuario: 1.5.3

XOM canonicalization takes too long

3 Answers3

The XOM Test

The XPath/Santuario Test

The Results

Test Details