20

i fetch one html fragment like

"<li>市&nbsp;场&nbsp;价"

which contains "&nbsp;", but after calling to_s of Nokogiri NodeSet, it becomes

"<li>市 场 价"

, i want to keep the original html fragment, and tried to set :save_with option for to_s method, but failed.

can someone encounter the same problem and give me help? thank you in advance.

animuson
  • 53,861
  • 28
  • 137
  • 147
ywenbo
  • 3,051
  • 6
  • 31
  • 46

2 Answers2

33

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML("&nbsp;").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a "&nbsp;", so you could do something like:

nbsp = Nokogiri::HTML("&nbsp;").text
html.gsub(nbsp, "&nbsp;")
CEGRD
  • 7,787
  • 5
  • 25
  • 35
Mike Dotterer
  • 1,198
  • 1
  • 11
  • 19
  • 1
    That worked perfectly, and I've been busting my head against this for a few hours. Thank you! – Mike A Mar 21 '11 at 22:24
  • def strip_html(str) nbsp = Nokogiri::HTML(" ").text str.gsub(nbsp,'') end – leosok Apr 20 '13 at 23:06
  • 2
    this is not a generic solution, what if you want it to preserve all html entities? Like — " etc etc – Zack Xu May 02 '13 at 16:05
  • 2
    If it bothers you to run the superfluous `nbsp = Nokogiri::HTML(" ").text`, you can get the pattern for your gsub with `nbsp = 160.chr(Encoding::UTF_8)`. 160 is the extended ASCII code for a nonbreaking space; it's what Nokogiri returns when it parses '&nbsp'. – JellicleCat Nov 20 '13 at 20:00
11

I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市&nbsp;场&nbsp;价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

A regular space is 32, 0x20 or ' '. 160 is the decimal value for a non-breaking-space, which is what &nbsp; converts to after you use Nokogiri's various inner_text, content, text or to_s tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.

There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.

Now, all that said, I think the to_html method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.


http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

Okay, I can explain the behavior now. Basically, the problem boils down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to Nokogiri. If the input string is "utf-8", the document is assumed to be a UTF-8 document. When you output the document, since " " can be represented as a UTF-8 character, it is output as that UTF-8 character.

In 1.8, since we cannot detect the encoding of the document, we assume binary encoding and allow libxml2 to detect the encoding. If you set the encoding of the input document to binary, it will give you back the entities you want. Here is some code to demo:

 require 'nokogiri' 
 html = '<body>hello &nbsp; world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html 

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Aaron Patterson

Your sample text isn't ASCII-8BIT so try changing that encoding string to the Unicode character set name and see if inner_html will return an entity-encoded value.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thank you i will try it according to what you point "to_html" – ywenbo Dec 18 '10 at 03:16
  • i just debuged and found that: nbsp; is converted by the code below Nokogiri::HTML(fragment, nil, @encoding, NOKOGIRI_OPTIONS).search(rule_body), so i need to dig into and find why or you can give my help – ywenbo Dec 18 '10 at 03:35
  • I recommend asking at nokogiri-talk. The developers monitor that list and can give you the definitive answer. – the Tin Man Dec 18 '10 at 03:38
  • ok, firstly i try to change the Nokogiri parse options and see the results, if that's no help, i will asking at nokogiri-talk, :-) i thank you very very much. – ywenbo Dec 18 '10 at 03:44
  • failed again, i need to ask nokogiri-talk, but in China i can not access http://groups.google.com/group/nokogiri-talk google groups, that is blocked by gov, could you tell me if there is other ways which i can use to ask? thank u. – ywenbo Dec 18 '10 at 03:47
  • The options control how liberal or strict the engine is when parsing the document. See [`Nokogiri::HTML::Document.parse`](http://nokogiri.org/Nokogiri/HTML/Document.html#method-c-parse) and [`Nokogiri::XML::ParseOptions`](http://nokogiri.org/Nokogiri/XML/ParseOptions.html). – the Tin Man Dec 18 '10 at 03:51
  • OK, I think I already have an answer, found in a response to a similar question by Aaron Patterson. See my edited answer. – the Tin Man Dec 18 '10 at 03:56
  • Great, really there is STRICT parse mode, i will add it to parse option to see, but no effect, and i can not find literal parse mode. – ywenbo Dec 18 '10 at 04:02
  • By default HTML parsing is liberal and XML parsing is strict, mostly because HTML can be such a mess that using strict parsing will fail. The parsing mode won't affect the character set or whether entities are decoded though. – the Tin Man Dec 18 '10 at 04:07
  • i use Nokogiri::HTML(fragment, nil, @encoding, NOKOGIRI_OPTIONS).search(rule_body), the NOKOGIRI_OPTIONS=Nokogiri::XML::ParseOptions::NOERROR | Nokogiri::XML::ParseOptions::NOWARNING, fragment is html string, is there any place i changed to xml parse mode? become confused. – ywenbo Dec 18 '10 at 04:09
  • You do NOT want XML parsing because you are parsing HTML. Read the links in the sixth comment to my answer for more info. – the Tin Man Dec 18 '10 at 04:11
  • yes, i just read Aaron Patterson's answer and understand it's really caused by encoding. But now i don't want nbsp to go back entity space, if i don't set the encoding of html, nokogori can not parse it properly, so i must set the encoding for html string before passing to Nokoigir, once i set encoding just as Patterson said nbsp will go back to entity, but what i want is original nbsp even though i set the encoding for html string. – ywenbo Dec 18 '10 at 04:38
  • "what i want is original nbsp even though i set the encoding for html string." Then I'd suggest using [Nokogiri issues](https://github.com/tenderlove/nokogiri/issues) to raise the question as I don't think it's possible but the developers are the final authority. – the Tin Man Dec 18 '10 at 05:37
  • ok, really really thank you very much for many times reply on my problem, i will try. – ywenbo Dec 21 '10 at 01:37