0

I have a html which I am parsing using Nokogiri and then generating a html out of this like this

htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////

File.open(output.html, 'w+')  do |file|
file.write(h_doc)
end

Question is how to prevent NOkogiri from printing HTML character entities (< >, &  ) in the final generated html file.

Instead of HTML character entities (&lt; &gt; &amp; &nbsp;) I want to print actual character (< ,> etc).

As an example it is printing the html like
 <title>&lt;%= ("/emailclient=sometext") %&gt;</title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
user1788294
  • 1,823
  • 4
  • 24
  • 30

2 Answers2

1

So... you want Nokogiri to output incorrect or invalid XML/HTML?

Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:

<div>To write "&amp;", you need to write "&amp;amp;".</div>

This renders:

To write "&", you need to write "&amp;".

If you had your way, you'd get this HTML:

<div>To write "&", you need to write "&amp;".</div>

which would render as:

To write "&", you need to write "&".

Even worse in this scenario, say, in XHTML:

<div>Use the &lt;script&gt; tag for JavaScript</div>

if you replace the entities, you get undisplayable file, due to unclosed <script> tag:

<div>Use the <script> tag for JavaScript</div>

EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:

doc.traverse do |node|
  if node.text?
    node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
                                     "\\1<%= \\2 %>\\3")
  end
end
puts doc.to_html.gsub('&lt;%=', '<%=').gsub('%&gt;', '%>')
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • I think there must be some way to do this .The original html is in the form sometext and I want it to be replaced like this <%sometext%> . But I am getting like this &lt%;sometext%> . I seriously feel there must be some way. – user1788294 Sep 02 '14 at 06:34
  • http://stackoverflow.com/questions/4476047/how-to-make-nokogiri-not-to-convert-nbsp-to-space . This links talks about doing the reverse of what I want to do. – user1788294 Sep 02 '14 at 06:35
  • Just to add more info I am changing the html variable text like this h_doc.traverse do |x| if x.text? x.content ="<%" + x.content + "%>" end end end – user1788294 Sep 02 '14 at 06:40
1

You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.

When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.

It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.

   require 'nokogiri'
   noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
   xpath = '<selector_for_element>'
   noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&amp;&amp;&amp;&amp;&amp;')
   puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &amp;&amp;&amp;&amp;&amp;

As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

JackChance
  • 520
  • 3
  • 11