How to parse XMP in java when not valid XML?

Question

I am extracting metadata from PNG images using javax.imageio. This works fine. But the getAsTree method to get to the actual metadata returns XML that is invalid. So I don't know how to parse this XML in order to get certain metadata:

run:
Format name: javax_imageio_png_1.0
<javax_imageio_png_1.0>
    <IHDR width="256" height="256" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
    <cHRM whitePointX="31269" whitePointY="32899" redX="63999" redY="33001" greenX="30000" greenY="60000" blueX="15000" blueY="5999"/>
    <gAMA value="45454"/>
    <iTXt>
        <iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="FALSE" compressionMethod="0" languageTag="" translatedKeyword="" text="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
    xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmp:MetadataDate="2012-12-05T21:36:19+01:00"
   xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
   xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
   xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
   <xmpMM:History>
    <rdf:Seq>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
      stEvt:when="2012-12-04T00:23:34+01:00"
      stEvt:changed="/metadata"/>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
      stEvt:when="2012-12-05T21:36:19+01:00"
      stEvt:changed="/metadata"/>
    </rdf:Seq>
   </xmpMM:History>
   <lr:hierarchicalSubject>
    <rdf:Bag>
     <rdf:li>Component|Software</rdf:li>
     <rdf:li>Places|Paris</rdf:li>
     <rdf:li>Product|Christensen</rdf:li>
     <rdf:li>Product|Simba</rdf:li>
    </rdf:Bag>
   </lr:hierarchicalSubject>
   <dc:subject>
    <rdf:Bag>
     <rdf:li>Christensen</rdf:li>
     <rdf:li>Paris</rdf:li>
     <rdf:li>Simba</rdf:li>
     <rdf:li>Software</rdf:li>
    </rdf:Bag>
   </dc:subject>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>"/>
    </iTXt>
    <pHYs pixelsPerUnitXAxis="2835" pixelsPerUnitYAxis="2835" unitSpecifier="meter"/>
</javax_imageio_png_1.0>
Format name: javax_imageio_1.0
<javax_imageio_1.0>
    <Chroma>
        <ColorSpaceType name="RGB"/>
        <NumChannels value="4"/>
        <Gamma value="0.45453998"/>
        <BlackIsZero value="TRUE"/>
    </Chroma>
    <Compression>
        <CompressionTypeName value="deflate"/>
        <Lossless value="TRUE"/>
        <NumProgressiveScans value="1"/>
    </Compression>
    <Data>
        <PlanarConfiguration value="PixelInterleaved"/>
        <SampleFormat value="UnsignedIntegral"/>
        <BitsPerSample value="8 8 8 8"/>
    </Data>
    <Dimension>
        <PixelAspectRatio value="1.0"/>
        <ImageOrientation value="Normal"/>
        <HorizontalPixelSize value="0.35273367"/>
        <VerticalPixelSize value="0.35273367"/>
    </Dimension>
    <Text>
        <TextEntry keyword="XML:com.adobe.xmp" value="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
    xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmp:MetadataDate="2012-12-05T21:36:19+01:00"
   xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
   xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
   xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
   <xmpMM:History>
    <rdf:Seq>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
      stEvt:when="2012-12-04T00:23:34+01:00"
      stEvt:changed="/metadata"/>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
      stEvt:when="2012-12-05T21:36:19+01:00"
      stEvt:changed="/metadata"/>
    </rdf:Seq>
   </xmpMM:History>
   <lr:hierarchicalSubject>
    <rdf:Bag>
     <rdf:li>Component|Software</rdf:li>
     <rdf:li>Places|Paris</rdf:li>
     <rdf:li>Product|Christensen</rdf:li>
     <rdf:li>Product|Simba</rdf:li>
    </rdf:Bag>
   </lr:hierarchicalSubject>
   <dc:subject>
    <rdf:Bag>
     <rdf:li>Christensen</rdf:li>
     <rdf:li>Paris</rdf:li>
     <rdf:li>Simba</rdf:li>
     <rdf:li>Software</rdf:li>
    </rdf:Bag>
   </dc:subject>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>" language="" compression="none"/>
    </Text>
    <Transparency>
        <Alpha value="nonpremultipled"/>
    </Transparency>
</javax_imageio_1.0>
BUILD SUCCESSFUL (total time: 3 seconds)

The invalid XML starts at the iTXtEntry element, which has the xpacket bit and encloses child elements although it has a self-closing tag format, instead of an end tag. So when I try to parse this using a DOM document and xpath, I get an error saying that this element cannot contain ">" in the content of the element.

I have disabled DTD validation on the DocumentBuilderFactory. This doesn't help. I feel like I'm down to using regex, but that doesn't seem right. Why do I get invalid XML in the first place from the getAsTree method in imageio, and what can I do about this?

Wouldn't you have to make it valid first before parsing? How you do that will all depend on what is making it invalid. — Hovercraft Full Of Eels, Dec 05 '12 at 22:58
According to [the getAsTree interface docs](http://docs.oracle.com/javase/6/docs/api/javax/imageio/metadata/IIOMetadata.html#getAsTree(java.lang.String)), this method returns a DOM node, not an xml string, so I'm not sure what you mean by it being "invalid"? — Francis Avila, Dec 05 '12 at 23:52
In other words, the metadata from `javax.imageio` is not an xml string that is parsed, it is an in-memory DOM tree, so it can't be invalid because invalidity pertains to parsing. If you mean that "this string I have posted to this page is invalid", then the problem is with whatever you doing to serialize the DOM node you get. You should show that code. — Francis Avila, Dec 06 '12 at 00:09
@FrancisAvila: Yes, but as you already know, it's the getAsTree method on the IIOMetadata object that's creating the node, that's all. So there isn't much I can do there. Then I wanted to parse it, but discovered it was invalid. The errors are quite small, easily fixed if I paste it into an XML editor and do a couple of edits, but of course I cannot do that... — Anders, Dec 06 '12 at 00:13
@HovercraftFullOfEels: Yes, and as I mentioned in my previous comment, it's the getAsTree method that creates the node. And I would definitely like to make it valid first (preferably have it valid directly from the method of course), but I don't know how to make an invalid node valid in code... — Anders, Dec 06 '12 at 00:14
A node cannot be invalid. Only *serializations* are invalid. How are you turning the DOM into the text you have posted here? **That code** is the code you need to fix. There is nothing wrong with the `Node` tree from `getAsTree`. Show your serialization code, or use [this code](http://stackoverflow.com/a/2223105/1002469) as an example. — Francis Avila, Dec 06 '12 at 04:01
@FrancisAvila: Ah, ok, I see what you're saying now. I checked the code printing the xml, and in fact it does it correctly, but the problem is that the metadata I'm after is for some reason xml tags, and processing instructions called xpacket INSIDE an attribute... Which makes no sense. That's why the parser is complaining, because xml tags are inside an attribute (the "text" attribute of the element iTXtEntry). — Anders, Dec 06 '12 at 13:48
If the code printing the xml produces invalid xml, then it is *not* printing it correctly. A correct way to print it would be like `text="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>`, but it is clearly not doing that. Yes, it is odd to put a complete xml document as text into the value of an attribute, but that is *not* a barrier to validity. — Francis Avila, Dec 06 '12 at 14:35

Francis Avila · Accepted Answer · 2012-12-07T03:23:52.087

Your question is nonsensical because IIOMetaData.getAsTree() returns a DOM Node object which is the root of a Node tree. This is an in-memory representation of XML. It's not parsed from anywhere, so it can't be invalid. An xml document string can be invalid, but there's no string here that's being parsed. The getAsTree method created the XML directly, in-memory.

The problem is with your output producing invalid XML. Whatever is serializing your Node from getAsTree() is doing so incorrectly. Namely, it is not properly escaping the value of the text attribute, which is itself an XML document string.

Below is a complete example that demonstrates how to get image metadata and serialize to a (valid) XML string.

import java.io.*;
import java.util.*;

// for imageio metadata
import javax.imageio.*;
import javax.imageio.stream.*;
import javax.imageio.metadata.*;

// for xml handling
import org.w3c.dom.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

public class imgmeta {
    // Very lazy exception handling
    // This is just a quick example
    public static void main(String[] args) throws Exception {
        String filename = args[0];

        File file = new File(filename);
        ImageInputStream imagestream = ImageIO.createImageInputStream(file);

        // get a reader which is able to read this file
        Iterator<ImageReader> readers = ImageIO.getImageReaders(imagestream);
        ImageReader reader = readers.next();

        // feed image to reader
        reader.setInput(imagestream, true);

        // get metadata of first image
        IIOMetadata metadata = reader.getImageMetadata(0);

        // get any metadata format name
        // (you should prefer the native one, but not all images have one)
        // String mdataname = metadata.getNativeMetadataFormatName(); // might be null
        String[] mdatanames = metadata.getMetadataFormatNames();

        String mdataname = mdatanames[0];

        Node metadatadom = metadata.getAsTree(mdataname);

        // metadatadom is now a DOM Node root of a DOM tree
        // representing metadata in the image
        // Since it's in-memory, it can't be "invalid"
        // because it's already been parsed


        // now let's serialize to an XML string
        // javax.xml.transform.Transformer takes xml sources
        // in one representation and transforms them to xml
        // in another representation
        // Representations include: DOM, JAXB, SAX, stream, etc
        DOMSource source = new DOMSource(metadatadom);

        StringWriter writer = new StringWriter();
        StreamResult result = new StreamResult(writer);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(source, result);

        // THIS is what you want:
        String metadata_in_xml = writer.toString();

        // now print it:
        System.out.print(metadata_in_xml);
    }
}

This is test output run using an image I had around:

$ java imgtest testimage.png | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<javax_imageio_png_1.0>
  <IHDR width="149" height="237" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
  <iTXt>
    <iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="0" compressionMethod="0" languageTag="" translatedKeyword="" text="&lt;?xpacket begin=&quot;?&quot; id=&quot;W5M0MpCehiHzreSzNTczkc9d&quot;?&gt; &lt;x:xmpmeta xmlns:x=&quot;adobe:ns:meta/&quot; x:xmptk=&quot;Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        &quot;&gt; &lt;rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;&gt; &lt;rdf:Description rdf:about=&quot;&quot; xmlns:xmp=&quot;http://ns.adobe.com/xap/1.0/&quot; xmlns:xmpMM=&quot;http://ns.adobe.com/xap/1.0/mm/&quot; xmlns:stRef=&quot;http://ns.adobe.com/xap/1.0/sType/ResourceRef#&quot; xmp:CreatorTool=&quot;Adobe Photoshop CS5.1 Macintosh&quot; xmpMM:InstanceID=&quot;xmp.iid:D281E43D34DC11E2BFE69DA1E5D17E5F&quot; xmpMM:DocumentID=&quot;xmp.did:D281E43E34DC11E2BFE69DA1E5D17E5F&quot;&gt; &lt;xmpMM:DerivedFrom stRef:instanceID=&quot;xmp.iid:D281E43B34DC11E2BFE69DA1E5D17E5F&quot; stRef:documentID=&quot;xmp.did:D281E43C34DC11E2BFE69DA1E5D17E5F&quot;/&gt; &lt;/rdf:Description&gt; &lt;/rdf:RDF&gt; &lt;/x:xmpmeta&gt; &lt;?xpacket end=&quot;r&quot;?&gt;"/>
  </iTXt>
  <tEXt>
    <tEXtEntry keyword="Software" value="Adobe ImageReady"/>
  </tEXt>
</javax_imageio_png_1.0>

The XML produced is valid:

$ java imgmeta testimage.png | xmllint --noout -
$

(No output means valid.)

Notice how the value of the iTXtEntry's text attribute is escaped. If you want to retrieve the data inside this attribute, you need to retrieve the string and then parse that as its own XML document to get another DOM (or whatever) tree. This attribute: keyword="XML:com.adobe.xmp" is a signal that the value of the text attribute is an XML document with XMP data in it.

UPDATE: parsing XMP data

Here is some sample code demonstrating extracting the attribute value and parsing it to and from XML and a DOM tree.

public class XMPExample {
public static String transformXML(Node xml) throws Exception {
    StringWriter writer = new StringWriter();

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.transform(new DOMSource(xml), new StreamResult(writer));

    return writer.toString();
}

public static Document transformXML(String xml) throws Exception {
    StringReader reader = new StringReader(xml);
    Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Transformer transformer = TransformerFactory.newInstance().newTransformer();

    transformer.transform(new StreamSource(reader), new DOMResult(doc));
    return doc;
}

public static String getXMP(Element metadata_dom) throws Exception {
            // (Element) type because getElementsByTagName() method is required

    // There are many more robust ways of selecting nodes
    // (e.g. javax.xml.xpath), but this is for a simple example
    // that only uses the native DOM methods

    // This is very brittle because we're making assumptions about
    // the metadata_dom structure. There are two sources of brittleness:

    // 1. The metadata format from `metadata.getMetadataFormatNames()`.
    //    You should probably settle on a standard one you know will
    //    exist, like 'javax_imageio_1.0'
    // 2. How the image stores the metadata. Usually XMP data will
    //    be in a text field with keyword 'XML:com.adobe.xmp', but
    //    I don't know that this is *always* the case.

    // the code below assumes "javax_imageio_png_1.0" format
    NodeList iTXtEntries = metadata_dom.getElementsByTagName("iTXtEntry");
    Element iTXtEntry = null;
    Element entry = null;
    for (int i = 0; i < iTXtEntries.getLength(); i++) {
        entry = (Element) iTXtEntries.item(i);
        if (entry.getAttribute("keyword").equals("XML:com.adobe.xmp")) {
            iTXtEntry = entry;
            break;
        }
    }
    if (iTXtEntry == null) {
        return null;
    }

    String xmp_xml_doc = iTXtEntry.getAttribute("text");

    return xmp_xml_doc;

}
}

// Use like so:
Node metadatanode = metadata.getAsTree(metadataname);

String xmp_xml = XMPExample.getXMP((Element) metadatanode);

// xmp_xml is now an xml document STRING
System.out.print(xmp_xml);

// If you want to parse it as an XML document, use an XML parser.
Document xmp_dom = XMPExample.transformXML(xmp_xml);

// ...and you can serialize it again when you are done.
String xmp_xml_roundtripped = XMPExample.transformXML(xmp_dom);

Ok, thanks. Yes this makes sense, and it works as you show it. So this explains my problem...but I'm not sure how to solve it still, because just as you guess, the stuff I want is inside that text attribute. But how can I parse that as an XML document without reverting to regex or something? It starts and ends with the xpacket processing instruction, do I need to remove those by regex to parse this as an xml document? And how about unescaping the escape characters back into tags? — Anders, Dec 06 '12 at 18:40
With the DOM representation (or other parsed XML representation of your choice), grab the value of the attribute (which will be a string). Then feed that string into an XML parser. I will expand this answer with an example of that if you want. (Because XMP is an rdf-based metadata system, you should probably look for a library that models RDF or XMP directly as it will be much easier to work with than mucking with that XML.) — Francis Avila, Dec 06 '12 at 19:28

How to parse XMP in java when not valid XML?

1 Answers1

UPDATE: parsing XMP data