0

I am trying to process a product data feed that i download from the web, the download is done like so:

URL website = new URL("http://some.products.com/format/xml/compression/gzip/");
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream("/opt/some/file.xml.gz");
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
fos.close();

when it is saved on the file systems' and the file type seems to be ansii.

when i read the file with a streaming processor like so:

GZIPInputStream gzis = new GZIPInputStream(new FileInputStream("/opt/some/file.xml.gz"));
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inputFactory.createXMLEventReader(gzis);
while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    ...
}

Somewhere along the way, part of the text gets decoded because in it end up like this '

that is the escaped ampersand gets unescaped, but then it would seem the second level of escaping is not dealt with. and i cant workout how or where Im supposed to deal with it. Should i be trying to decode it when im reading the file? or should i do it after the xml as been parsed?

Edit: I should note that these characters appear in text fields, not urls.

user779420
  • 359
  • 2
  • 3
  • 12
  • I appear to have corrected it by passing the text to `StringEscapeUtils.unescapeXml(text)`, from apache commons, after I have parsed it through Stax. this appears to have solved the issue, but im not sure if it is the correct way. should i use `unescapeHtml()`? – user779420 May 09 '14 at 01:04
  • You can better put this as an answer to your question, then it dissapears from the unanswered questions. – Bernd Ebertz May 09 '14 at 16:25

0 Answers0