I am extracting metadata from PNG images using javax.imageio. This works fine. But the getAsTree method to get to the actual metadata returns XML that is invalid. So I don't know how to parse this XML in order to get certain metadata:
run:
Format name: javax_imageio_png_1.0
<javax_imageio_png_1.0>
<IHDR width="256" height="256" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
<cHRM whitePointX="31269" whitePointY="32899" redX="63999" redY="33001" greenX="30000" greenY="60000" blueX="15000" blueY="5999"/>
<gAMA value="45454"/>
<iTXt>
<iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="FALSE" compressionMethod="0" languageTag="" translatedKeyword="" text="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmp:MetadataDate="2012-12-05T21:36:19+01:00"
xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
<xmpMM:History>
<rdf:Seq>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
stEvt:when="2012-12-04T00:23:34+01:00"
stEvt:changed="/metadata"/>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
stEvt:when="2012-12-05T21:36:19+01:00"
stEvt:changed="/metadata"/>
</rdf:Seq>
</xmpMM:History>
<lr:hierarchicalSubject>
<rdf:Bag>
<rdf:li>Component|Software</rdf:li>
<rdf:li>Places|Paris</rdf:li>
<rdf:li>Product|Christensen</rdf:li>
<rdf:li>Product|Simba</rdf:li>
</rdf:Bag>
</lr:hierarchicalSubject>
<dc:subject>
<rdf:Bag>
<rdf:li>Christensen</rdf:li>
<rdf:li>Paris</rdf:li>
<rdf:li>Simba</rdf:li>
<rdf:li>Software</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>"/>
</iTXt>
<pHYs pixelsPerUnitXAxis="2835" pixelsPerUnitYAxis="2835" unitSpecifier="meter"/>
</javax_imageio_png_1.0>
Format name: javax_imageio_1.0
<javax_imageio_1.0>
<Chroma>
<ColorSpaceType name="RGB"/>
<NumChannels value="4"/>
<Gamma value="0.45453998"/>
<BlackIsZero value="TRUE"/>
</Chroma>
<Compression>
<CompressionTypeName value="deflate"/>
<Lossless value="TRUE"/>
<NumProgressiveScans value="1"/>
</Compression>
<Data>
<PlanarConfiguration value="PixelInterleaved"/>
<SampleFormat value="UnsignedIntegral"/>
<BitsPerSample value="8 8 8 8"/>
</Data>
<Dimension>
<PixelAspectRatio value="1.0"/>
<ImageOrientation value="Normal"/>
<HorizontalPixelSize value="0.35273367"/>
<VerticalPixelSize value="0.35273367"/>
</Dimension>
<Text>
<TextEntry keyword="XML:com.adobe.xmp" value="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmp:MetadataDate="2012-12-05T21:36:19+01:00"
xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
<xmpMM:History>
<rdf:Seq>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
stEvt:when="2012-12-04T00:23:34+01:00"
stEvt:changed="/metadata"/>
<rdf:li
stEvt:action="saved"
stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
stEvt:when="2012-12-05T21:36:19+01:00"
stEvt:changed="/metadata"/>
</rdf:Seq>
</xmpMM:History>
<lr:hierarchicalSubject>
<rdf:Bag>
<rdf:li>Component|Software</rdf:li>
<rdf:li>Places|Paris</rdf:li>
<rdf:li>Product|Christensen</rdf:li>
<rdf:li>Product|Simba</rdf:li>
</rdf:Bag>
</lr:hierarchicalSubject>
<dc:subject>
<rdf:Bag>
<rdf:li>Christensen</rdf:li>
<rdf:li>Paris</rdf:li>
<rdf:li>Simba</rdf:li>
<rdf:li>Software</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>" language="" compression="none"/>
</Text>
<Transparency>
<Alpha value="nonpremultipled"/>
</Transparency>
</javax_imageio_1.0>
BUILD SUCCESSFUL (total time: 3 seconds)
The invalid XML starts at the iTXtEntry element, which has the xpacket bit and encloses child elements although it has a self-closing tag format, instead of an end tag. So when I try to parse this using a DOM document and xpath, I get an error saying that this element cannot contain ">" in the content of the element.
I have disabled DTD validation on the DocumentBuilderFactory. This doesn't help. I feel like I'm down to using regex, but that doesn't seem right. Why do I get invalid XML in the first place from the getAsTree method in imageio, and what can I do about this?