0

Can anyone please explain this result for me?

#!/usr/bin/env ruby
# encoding: utf-8

require 'rexml/document'

doc = REXML::Document.new(DATA)
puts "doc: #{doc.encoding}"
REXML::XPath.each(doc, '//item') do |item|
  puts "  #{item}: #{item.to_s.encoding}"
end

__END__
<doc>
  <item>Test</item>
  <item>Über</item>
  <item>8</item>
</doc>

Output:

doc: UTF-8
  <item>Test</item>: US-ASCII
  <item>Über</item>: UTF-8
  <item>8</item>: US-ASCII

It seems as if REXML doesn't care what the document encoding is, and starts autodetecting encoding for each item... Am I doomed to encode('UTF-8') each string I pull out of REXML, even though UTF-8 is the original encoding? What is happening here?

Amadan
  • 191,408
  • 23
  • 240
  • 301

1 Answers1

1

You're calling Node.to_s() on your Element. To get the actual text, add Element.get_text() to your chain (and call Text.to_s() on that):

puts "  #{item}: #{item.get_text.to_s.encoding}"

Output:

doc: UTF-8
  <item>Test</item>: UTF-8
  <item>Über</item>: UTF-8
  <item>8</item>: UTF-8
Darshan Rivka Whittle
  • 32,989
  • 7
  • 91
  • 109
  • Erm, that does not do what I want. I am trying to simulate `inner_html` (which AFAIK is missing in REXML), so I don't want the text node, I want the XML representation of the `item` element, which `to_s` does. Your encoding does not match the encoding of what is before the colon (which *is* an implicit `to_s`). (Also, AFAIK, if I did want the text, `.text` should be equivalent to `.get_text.to_s`...) – Amadan Apr 10 '13 at 07:18
  • Correct, REXML doesn't have Nokogiri's `inner_html`. `Element.text()` is equivalent to `Element.get_text().value()` which would indeed be better if you did want the text node. When `Node.to_s()` generates the string, it's doing it from scratch without regard to the encoding of the original file. Poking around the source, I see no way around that. (It's essentially `"" + "<" + node.name + ">" + ...`) – Darshan Rivka Whittle Apr 10 '13 at 08:11
  • Depending on what you need, you don't necessarily have a problem, by the way... the bits are the same, you just don't have the String metadata showing UTF-8. – Darshan Rivka Whittle Apr 10 '13 at 08:15
  • Yeah, I ended up with `encode` route. (The problem was that in some cases a wonky encoding got detected, and then when `join`ing the pieces together I'd get incompatible encoding error.) Still, I would have expected `to_s` to respect `Document.encoding`. :( Thanks. – Amadan Apr 10 '13 at 08:58