1

Here is the problem that I am trying to solve.

  1. I have two folders which contain XML files.
  2. One folder - lets say "source" folder - contains around 350,000 XML files.
  3. Another folder - lets say "compare" folder - contains the same 350,000 XML files and a few more.
  4. The 350,000 files that are present in both have the same names. Exact same.
  5. However, the files in "source" are slightly different from the files in "compare". The files in compare may (or may not) have some extra nodes.
  6. I need to compare the "identically named files" from "source" and "compare". If - for each file in "source" - all the nodes that are present in file of "source" are present in the file of "compare" - I need to produce a Ok report.
  7. If not, i.e.
  8. there is some file in "source" that is not present in "compare"
  9. in any file of "source" there is some node that is not present in the corresponding file of "compare"
  10. Then I need to create a error report with the details of what is missing.

I am currently pursuing Java + XMLUnit for this problem and am not sure if that can solve it. Even if it is, I am definitely not sure if this is the most optimal choice of tool.

Any help / suggestion will be much appreciated.

partha
  • 2,286
  • 5
  • 27
  • 37

4 Answers4

2

Personally I would just do a file compare on the whole folder, and then when I had located the files that had the same name but were a different size of checksum, then check the nodes. There is no point checking a file if it has the same name, same size and same checksum.

Woody
  • 5,052
  • 2
  • 22
  • 28
  • I like the checksum idea. I had not thought of it. What is a quick way for me to create checksum of the files? – partha Jun 01 '12 at 09:32
  • Last time I did it, my files were quite small, so I just went through them, opened them and made a checksum manually, but I was dealing with closer to 4,000 than 350,000 - I would imagine there is a fast md5 library – Woody Jun 01 '12 at 12:30
1

You need to proceed by steps.

  1. List your 350,000 files. These extra files in your "compare" folder are not relevant in your problem.
  2. Narrow down the number of files to compare by considering identical those which are exacly the same. You can simply load them and compare the resulting Strings, as Stirng compares using hashcodes.
  3. Compare the instances of your xml files in both of your folders. I think the best way to do that is to use XMLUnit. Should look like:

    Diff diff = new Diff(sourceXml, compareXml); if (diff.identical()) { // whatever you want to do }

Of course, this works best if your files are not too big.

Alexis Dufrenoy
  • 11,784
  • 12
  • 82
  • 124
  • 1
    Yes, to points 1 and 2. Some issue with point 3. The 1.xml is really a subset of 2.xml. There are a few nodes in 2.xml that do not exist in 1.xml. We just need to ensure that all the nodes in 1.xml exist in 2.xml. And I am struggling to make XMLUnit do that. Any suggestions how. – partha Jun 01 '12 at 15:15
1

Take a look at the DeltaXML product; it's probably cheaper than writing the code yourself.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

First things first. Let me go on record and say that XMLUnit is a gem. I loved it. If you are looking at some unit testing of XML values / attributes / structure etc. chances are that you will find a readymade solution with XMLUnit. This is a good place to start from.

It is quite extensible. It already comes with an identity check (as in the XMLs have the same elements and attributes in the same order) or similarity check (as in the XMLs have the same elements and attributes regardless of the order).

However, in my case I was looking for a slightly different usage. I had a big-ish XML (a few hundred nodes), and a bunch of XML files (around 350,000 of them). I needed to not compare certain particular nodes, that I could identify with XPATH. They were not necessarily always in the same position in the XML but there were some generic way of identifying them with XPATH. Sometimes, some nodes were to be ignored based on values of some other nodes. Just to give some idea

  1. The logic here is on the node that I want to ignore i.e price. /bookstore/book[price>35]/price

  2. The logic here is on a node that is at a relative position. I want to ignore author based on the value of price. And these two are related by position. /bookstore/book[price=30]/./author

After much tinkering around, I settled for a low tech solution. Before using XMLUnit to compare the files, I used XPATH to mask the values of the nodes that were to be ignored.

    public static int massageData(File xmlFile, Set<String> xpaths, String mask)
        throws JDOMException, IOException {
    logger.debug("Data massaging started for " + xmlFile.getAbsolutePath());
    int counter = 0;

    Document doc = (Document) new SAXBuilder().build(xmlFile
            .getAbsolutePath());

    for (String xpath : xpaths) {
        logger.debug(xpath);
        XPathExpression<Element> xpathInstance = XPathFactory.instance()
                .compile(xpath, Filters.element());
        List<Element> elements = xpathInstance.evaluate(doc);
        // element = xpathInstance.evaluateFirst(doc);
        if (elements != null) {
            if (elements.size() > 1) {
                logger.warn("Multiple matches were found for " + xpath
                        + " in " + xmlFile.getAbsolutePath()
                        + ". This could be a *potential* error.");
            }
            for (Element element : elements) {
                logger.debug(element.getText());
                element.setText(mask);
                counter++;
            }
        }
    }

Hope this helps.

partha
  • 2,286
  • 5
  • 27
  • 37