1

Wondering if it is possible to load a utf-16 xml file with REXML 3.1.7.3 and ruby 1.9.3.

Here is the xml data in a file u16.xml w/o BOM:

<?xml version="1.0" encoding="utf-16"?>
<ArrayOfCatalogItem>
    <CatalogItem>
        <ID>bbe9b897-5d3b-4340-914b-fce8d6022bd9</ID>
        <Name>EmployeeReport</Name>
    </CatalogItem>
</ArrayOfCatalogItem>

Use the following code to load it:

require "rexml/document"
file = File.new( "u16.xml" )
begin
  doc = REXML::Document.new(file)
  puts "doc = #{doc.to_s}"
rescue => err
  puts "err = #{err.message}"
end

And the output of the testing:

err = #<REXML::ParseException: malformed XML: missing tag start
Line: 8
Position: 420
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?> <>
/Users/lucy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:367:in `pull_event'
/Users/lucy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:183:in `pull'
/Users/lucy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:22:in `parse'
/Users/lucy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/rexml/document.rb:245:in `build'
/Users/lucy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/rexml/document.rb:43:in `initialize'
xml-test.rb:6:in `new'
xml-test.rb:6:in `<main>'
...
malformed XML: missing tag start
Line: 8
Position: 420
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?> <
Line: 8
Position: 420
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?> <

If I just change the xml file to utf-8 encoding, it is loaded successfully with the same code:

doc = <?xml version='1.0' encoding='UTF-8'?>
<ArrayOfCatalogItem>
    <CatalogItem>
        <ID>bbe9b897-5d3b-4340-914b-fce8d6022bd9</ID>
        <Name>EmployeeReport</Name>
    </CatalogItem>
</ArrayOfCatalogItem>

So is it possible to load utf-8 xml file with REXML? I have to use REXML as the parser in this case. Any suggestion will be appreciated.

Rock Hyrax
  • 11
  • 3
  • Is the file actually in UTF-16 if all you did was edit the DOCTYPE then it probably has the wrong encoding in the doctype – mmmmmm Nov 25 '13 at 16:06
  • The file is in UTF-16. When changing to UTF-8, I need to modify the encoding to UTF-8 in the first line of the file, then save the file with encoding UTF-8 in the text editor. – Rock Hyrax Nov 25 '13 at 17:54

0 Answers0