3

The problem is that there are some xml files which have no header information available in the xml. When parsing an xml file vtd-xml uses utf-8 by default and throws an exception saying cannot parse document. (the xml encoding is 8859-2 but no header information for that.)

I tried to use -Dfile.encoding=iso-8859-2 but it does not help.

Question: How can I set (default) encoding for the xml file ?

Gaurav Dave
  • 6,838
  • 9
  • 25
  • 39
Ferenc Turi
  • 271
  • 3
  • 13

1 Answers1

0

For single byte encoding other than ut8, XML spec mandates the encoding declaration. Otherwise, it is not a valid XML document.

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
  • 1
    Ok It's not valid but can be parsed. The only missing thing to do that is the missing encoding. I don't want to make ugly workarounds like adding missing header into the xml. This could be very slowly time consuming operation if you try to create a 4 gb xml file. Vtd-xml should provide an API method for setting the missing encoding, – Ferenc Turi Mar 20 '15 at 12:33
  • I believe that adding the missing header is the right way and easiest way to fix this... adding options to set the encoding is not a rigorous approach. Not having an encoding declaration, you are implicitly setting it to UTF-8 encoding.What if you have a coworker who has to take over your work and have to spend extra hours to find it out, but he doesn't know the encoding... is it simpler for you just to fix the issue permanently? – vtd-xml-author Mar 21 '15 at 03:11
  • 1
    When you have to process data that is xml but doesnt have the header, it isnt helpful to know that it isn't 100% valid. It still has to be processed and even if I know the encoding, I can't set it for the api explicitly. – Ahto Luuri Aug 03 '15 at 11:01
  • If you have to do it for each file, then you are not generating those XML correctly, you got a bigger issue than I thought initially... – vtd-xml-author Aug 12 '15 at 04:16
  • Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: – kimbert Apr 05 '17 at 21:22
  • The encoding declaration is never mandatory. You can use 'external character encoding information' to supply the encoding. In the text above, 'MIME headers' is just one example. A constant value in a program would be another valid source of external encoding information. – kimbert Apr 05 '17 at 21:25