1

Should REXML ignore identation or whitespacing?

I am debugging an issue with a simple HTML to Markdown convertor. For some reason it fails on

<blockquote><p>foo</p></blockquote>

But not on

<blockquote>
  <p>foo</p>
</blockquote>

The reason is, that in the first case, type.children.first.value is not set, in the latter case it is. The original code can be found at link above, but a condensed snipped to show the problem is below:

require 'rexml/document'
include REXML

def parse_string(string)
  doc = Document.new("<root>\n"+string+"\n</root>")
  root = doc.root
  root.elements.each do |element|
    parse_element(element, :root)
  end
end

def parse_element(element, parent)
  @output = ''
  # ...
  @output << opening(element, parent)
  #...
end

def opening(type, parent)
  case type.name.to_sym
    #...
    when :blockquote
       # remove leading newline
      type.children.first.value = ""
      "> "
  end
end

#Parses just fine
puts parse_string("<blockquote>\n<p>foo</p>\n</blockquote>")

# Fails with undefined method `value=' for <p> ... </>:REXML::Element (NoMethodError)
puts parse_string("<blockquote><p>foo</p></blockquote>")

I am quite certain, this is due to some parameter that makes REXML require whitespacing and identation: why else would it parse the first XML different from the latter?

Can I force REXML to parse both the same? Or am I looking at a whole different kind of bug?

berkes
  • 26,996
  • 27
  • 115
  • 206
  • 1
    Show a code sample demonstrating the problem. Also, you probably should use [Nokogiri](http://nokogiri.org). It's a great XML/HTML parser that is rapidly becoming the defacto choice. – the Tin Man Mar 16 '11 at 19:15
  • I have added a condensed example. And about Nokogiri: I prefer that one too. But this is a script not by me, and I would like to simply fix it, instead of rewriting it to use a different XML library :) – berkes Mar 16 '11 at 19:38

1 Answers1

1

Try passing the option :ignore_whitespace_nodes=>:all to Document.new().