227

I am trying to decode some HTML entities, such as '&amp;lt;' becoming '<'.

I have an old gem (html_helpers) but it seems to have been abandoned twice.

Any recommendations? I will need to use it in a model.

wehal3001
  • 761
  • 5
  • 12
Kostas
  • 8,356
  • 11
  • 47
  • 63
  • 6
    Just found 'htmlentities' (http://htmlentities.rubyforge.org/) – Kostas Oct 21 '09 at 12:45
  • I should specify that I get the html from a bunch of different sites and need to save it as plain text in the database – Kostas Oct 26 '09 at 13:18
  • 1
    While the most votes went to using CGI, don't. That's like pulling in all of Active Support to get a single method. Instead, use HTMLEntities, as mentioned in the selected answer. – the Tin Man May 15 '17 at 22:27

8 Answers8

329

To encode the characters, you can use CGI.escapeHTML:

string = CGI.escapeHTML('test "escaping" <characters>')

To decode them, there is CGI.unescapeHTML:

CGI.unescapeHTML("test &quot;unescaping&quot; &lt;characters&gt;")

Of course, before that you need to include the CGI library:

require 'cgi'

And if you're in Rails, you don't need to use CGI to encode the string. There's the h method.

<%= h 'escaping <html>' %>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Damien MATHIEU
  • 31,924
  • 13
  • 86
  • 94
  • 9
    I tried this approach first but it does not turn entities like " " into " ". I guess I should specify that I get the html from a bunch of different sites and need to save it as plain text in the database. – Kostas Oct 26 '09 at 12:59
  • 2
    If you are decoding HTML entities for storage as plain text in a database, then expect your database to do a lot of complaining about bad characters. Encoded entities are encoded to allow them to transfer as plain text. Decoding them can, and most likely will, revert them to upper-bit-set characters, AKA binary. Almost as likely, you could end up with multibyte characters which will really irritate a DB that is expecting plain text. You're better off decoding until nothing changes, then encode once so everything is normalized, then store them. – the Tin Man Dec 01 '10 at 21:13
  • 1
    I've encountered a lot of HTML with entities that have been encoded multiple times, really making a mess of things. Check out [loofah](https://github.com/flavorjones/loofah); Its scrubbers were designed for this if I remember right. – the Tin Man Dec 01 '10 at 21:16
  • 3
    We have set our database to save Unicode so I doubt it will complain at all. And loofah is not what I am looking for, I don't want to get rid of the html tags - not at this point anyway. – Kostas Jan 11 '11 at 00:46
  • 2
    it's 2015, unescapeHTML still omits some of the entities such as A acute – nurettin Jan 06 '15 at 10:13
  • How does the j method compare to the h method? The j method might be useful if you're going to put the string into a view using Javascript. Example: <%=j raw(@blog.title) %> – user1515295 Oct 28 '15 at 05:25
  • 1
    It's 2017, unescapeHTML still omits a bunch of textual entities such as `á`, ` `, etc. It's suited only for pure HTML-related entities, or hex-based ones. – igorsantos07 Mar 09 '17 at 19:17
  • Note: `CGI::escapeHTML` **doesn't escape German** characters like äöüß, and maybe more ... – Beauty Apr 06 '17 at 20:29
167

HTMLEntities can do it:

: jmglov@laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov@laurana;  irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "&iexcl;I&#39;m highly&nbsp;annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
dstarh
  • 4,976
  • 5
  • 36
  • 68
Ivailo Bardarov
  • 3,775
  • 1
  • 28
  • 25
  • Zdrasti Ivailo. Thanks for your comment; it solved my problem over on [How can I render XML character entity references in Ruby?](http://stackoverflow.com/questions/5262265/how-can-i-render-xml-character-entity-references-in-ruby) as well! – Josh Glover Mar 11 '11 at 09:41
  • 5
    Yup, the `HTMLEntities` gem deals with cases such as `å` and `—` which `CGI.unescapeHTML` does not. – thomax Dec 01 '14 at 08:14
56

I think Nokogiri gem is also a good choice. It is very stable and has a huge contributing community.

Samples:

a = Nokogiri::HTML.parse "foo&nbsp;b&auml;r"    
a.text 
=> "foo bär"

or

a = Nokogiri::HTML.parse "&iexcl;I&#39;m highly&nbsp;annoyed with character references!"
a.text
=> "¡I'm highly annoyed with character references!"
Masa Sakano
  • 1,921
  • 20
  • 32
Hoang Le
  • 1,341
  • 14
  • 14
  • 3
    @theTinMan, yeah I think it depends on the demand. As you can see through the discussions in this topic, `CGI.escapeHTML` maybe unable to solve some cases. In the other hand, if you need a full set of support, I'm sure `Nokogiri` is a good choice. – Hoang Le Oct 09 '15 at 04:30
  • 6
    Plus if you're already using Nokogiri for some HTML parsing, it's unreasonable to install yet another gem solely for that purpose. For instance, I'm using Sanitize gem for cleaning up HTML. Turns out this gem is using Nokogiri under the hood and so it'd be a shame not to take adventage of that. Thanks @HoangLe for the tip! – Tomalla Sep 07 '16 at 10:40
  • 1
    Note: `CGI::escapeHTML` doesn't escape German characters like äöüß, and maybe more ... With Nokogiri I didn't checked yet, but this would be a plus point. – Beauty Apr 06 '17 at 20:34
  • 1
    HTMLEntities would be a lightweight, and capable choice. I use Nokogiri a lot, and, unless I already have it loaded, I'd go with HTMLEntities. CGI is out of date. – the Tin Man May 15 '17 at 22:28
40

To decode characters in Rails use:

<%= raw '<html>' %>

So,

<%= raw '&lt;br&gt;' %>

would output

<br>
Sidhannowe
  • 465
  • 5
  • 11
memonk
  • 467
  • 4
  • 3
  • 5
    This only works in the view though. I need something that works in ActiveRecord too. – Kostas Jan 11 '11 at 00:45
  • 4
    Just tested in debugger - raw '&lt br &gt' ==> '&lt br &gt'. – Will Tomlins Dec 14 '11 at 12:36
  • 16
    `#raw` doesn't decode anything. It tells the view *not* to encode the string. It does this by wrapping the string in a `ActiveSupport::SafeBuffer`, which in turn has a flag (`html_safe?`), set to true. The view uses this flag to determine that the string can be injected directly into the HTML without being escaped. I like to think of `html_safe` as an indication by the programmer that the string in question has already been properly escaped. – Moxley Stratton Nov 27 '13 at 00:20
9

If you don't want to add a new dependency just to do this (like HTMLEntities) and you're already using Hpricot, it can both escape and unescape for you. It handles much more than CGI:

Hpricot.uxs "foo&nbsp;b&auml;r"
=> "foo bär"
Jason L Perry
  • 1,225
  • 11
  • 7
0

You can use htmlascii gem:

Htmlascii.convert string
Nakilon
  • 34,866
  • 14
  • 107
  • 142
kartouch
  • 11
  • 1
0

In Rails we can use: ERB::Util.html_escape and ERB::Util.url_encode.
In views, these are aliased as h and u

http://ruby-doc.org/stdlib-1.9.3/libdoc/erb/rdoc/ERB/Util.html

Timothy Alexis Vass
  • 2,526
  • 2
  • 11
  • 30
-4
<% str="<h1> Test </h1>" %>

result: &lt; h1 &gt; Test &lt; /h1 &gt;

<%= CGI.unescapeHTML(str).html_safe %>
Usman
  • 1,116
  • 11
  • 13
  • I think that by adding html_safe on any user-entered text, you are telling the view that it is safe when it's possible that it's not safe. This would put your users at risk when they load that view. – user1515295 Oct 28 '15 at 05:22
  • I don't know why so negative. I tried all solutions in this question. Only this works fine. About HTML safe, the user WANTS to render the HTML, then HTML_SAFE is correct. – Diego Somar Sep 16 '17 at 16:09