How do I encode/decode HTML entities in Ruby?

Question

I am trying to decode some HTML entities, such as '&lt;' becoming '<'.

I have an old gem (html_helpers) but it seems to have been abandoned twice.

Any recommendations? I will need to use it in a model.

Just found 'htmlentities' (http://htmlentities.rubyforge.org/) — Kostas, Oct 21 '09 at 12:45
I should specify that I get the html from a bunch of different sites and need to save it as plain text in the database — Kostas, Oct 26 '09 at 13:18
While the most votes went to using CGI, don't. That's like pulling in all of Active Support to get a single method. Instead, use HTMLEntities, as mentioned in the selected answer. — the Tin Man, May 15 '17 at 22:27

score 329 · Answer 1 · edited Nov 29 '14 at 22:55

329

To encode the characters, you can use CGI.escapeHTML:

string = CGI.escapeHTML('test "escaping" <characters>')

To decode them, there is CGI.unescapeHTML:

CGI.unescapeHTML("test &quot;unescaping&quot; &lt;characters&gt;")

Of course, before that you need to include the CGI library:

require 'cgi'

And if you're in Rails, you don't need to use CGI to encode the string. There's the h method.

<%= h 'escaping <html>' %>

edited Nov 29 '14 at 22:55

the Tin Man

158,662
42
215
303

answered Oct 21 '09 at 12:46

Damien MATHIEU

31,924
13
86
94

9

I tried this approach first but it does not turn entities like " " into " ". I guess I should specify that I get the html from a bunch of different sites and need to save it as plain text in the database. – Kostas Oct 26 '09 at 12:59
2

If you are decoding HTML entities for storage as plain text in a database, then expect your database to do a lot of complaining about bad characters. Encoded entities are encoded to allow them to transfer as plain text. Decoding them can, and most likely will, revert them to upper-bit-set characters, AKA binary. Almost as likely, you could end up with multibyte characters which will really irritate a DB that is expecting plain text. You're better off decoding until nothing changes, then encode once so everything is normalized, then store them. – the Tin Man Dec 01 '10 at 21:13
1

I've encountered a lot of HTML with entities that have been encoded multiple times, really making a mess of things. Check out [loofah](https://github.com/flavorjones/loofah); Its scrubbers were designed for this if I remember right. – the Tin Man Dec 01 '10 at 21:16
3

We have set our database to save Unicode so I doubt it will complain at all. And loofah is not what I am looking for, I don't want to get rid of the html tags - not at this point anyway. – Kostas Jan 11 '11 at 00:46
2

it's 2015, unescapeHTML still omits some of the entities such as A acute – nurettin Jan 06 '15 at 10:13
How does the j method compare to the h method? The j method might be useful if you're going to put the string into a view using Javascript. Example: <%=j raw(@blog.title) %> – user1515295 Oct 28 '15 at 05:25
1

It's 2017, unescapeHTML still omits a bunch of textual entities such as `á`, ` `, etc. It's suited only for pure HTML-related entities, or hex-based ones. – igorsantos07 Mar 09 '17 at 19:17
Note: `CGI::escapeHTML` **doesn't escape German** characters like äöüß, and maybe more ... – Beauty Apr 06 '17 at 20:29

score 167 · Accepted Answer · edited Jul 16 '14 at 14:31

167

HTMLEntities can do it:

: jmglov@laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov@laurana;  irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "&iexcl;I&#39;m highly&nbsp;annoyed with character references!"
=> "¡I'm highly annoyed with character references!"

edited Jul 16 '14 at 14:31

dstarh

4,976
5
36
68

answered Mar 06 '11 at 14:19

Ivailo Bardarov

3,775
1
28
25

Zdrasti Ivailo. Thanks for your comment; it solved my problem over on [How can I render XML character entity references in Ruby?](http://stackoverflow.com/questions/5262265/how-can-i-render-xml-character-entity-references-in-ruby) as well! – Josh Glover Mar 11 '11 at 09:41
5

Yup, the `HTMLEntities` gem deals with cases such as `å` and `—` which `CGI.unescapeHTML` does not. – thomax Dec 01 '14 at 08:14

score 56 · Answer 3 · edited Jul 18 '15 at 01:35

56

I think Nokogiri gem is also a good choice. It is very stable and has a huge contributing community.

Samples:

a = Nokogiri::HTML.parse "foo&nbsp;b&auml;r"    
a.text 
=> "foo bär"

or

a = Nokogiri::HTML.parse "&iexcl;I&#39;m highly&nbsp;annoyed with character references!"
a.text
=> "¡I'm highly annoyed with character references!"

edited Jul 18 '15 at 01:35

Masa Sakano

1,921
20
32

answered Dec 18 '14 at 08:27

Hoang Le

1,341
14
14

3

@theTinMan, yeah I think it depends on the demand. As you can see through the discussions in this topic, `CGI.escapeHTML` maybe unable to solve some cases. In the other hand, if you need a full set of support, I'm sure `Nokogiri` is a good choice. – Hoang Le Oct 09 '15 at 04:30
6

Plus if you're already using Nokogiri for some HTML parsing, it's unreasonable to install yet another gem solely for that purpose. For instance, I'm using Sanitize gem for cleaning up HTML. Turns out this gem is using Nokogiri under the hood and so it'd be a shame not to take adventage of that. Thanks @HoangLe for the tip! – Tomalla Sep 07 '16 at 10:40
1

Note: `CGI::escapeHTML` doesn't escape German characters like äöüß, and maybe more ... With Nokogiri I didn't checked yet, but this would be a plus point. – Beauty Apr 06 '17 at 20:34
1

HTMLEntities would be a lightweight, and capable choice. I use Nokogiri a lot, and, unless I already have it loaded, I'd go with HTMLEntities. CGI is out of date. – the Tin Man May 15 '17 at 22:28

score 40 · Answer 4 · edited Aug 02 '16 at 13:27

40

To decode characters in Rails use:

<%= raw '<html>' %>

So,

<%= raw '&lt;br&gt;' %>

would output

<br>

edited Aug 02 '16 at 13:27

Sidhannowe

465
5
11

answered Nov 20 '10 at 21:59

memonk

467
4
3

5

This only works in the view though. I need something that works in ActiveRecord too. – Kostas Jan 11 '11 at 00:45
4

Just tested in debugger - raw '&lt br &gt' ==> '&lt br &gt'. – Will Tomlins Dec 14 '11 at 12:36
16

`#raw` doesn't decode anything. It tells the view *not* to encode the string. It does this by wrapping the string in a `ActiveSupport::SafeBuffer`, which in turn has a flag (`html_safe?`), set to true. The view uses this flag to determine that the string can be injected directly into the HTML without being escaped. I like to think of `html_safe` as an indication by the programmer that the string in question has already been properly escaped. – Moxley Stratton Nov 27 '13 at 00:20

score 9 · Answer 5 · answered Dec 06 '11 at 18:13

9

If you don't want to add a new dependency just to do this (like HTMLEntities) and you're already using Hpricot, it can both escape and unescape for you. It handles much more than CGI:

Hpricot.uxs "foo&nbsp;b&auml;r"
=> "foo bär"

answered Dec 06 '11 at 18:13

Jason L Perry

1,225
11
7

5

Note for people looking at this now - Hpricot is no longer maintained. – SamStephens Jun 02 '13 at 02:20
2

Use [Nokogiri](http://nokogiri.org), which is the defacto standard for XML/HTML parsing, instead of Hpricot. – the Tin Man Sep 02 '14 at 22:32

score 0 · Answer 6 · edited Feb 10 '14 at 15:16

0

You can use htmlascii gem:

Htmlascii.convert string

edited Feb 10 '14 at 15:16

Nakilon

34,866
14
107
142

answered Dec 03 '13 at 09:04

kartouch

11
1

score 0 · Answer 7 · answered Jan 02 '23 at 12:44

0

In Rails we can use: ERB::Util.html_escape and ERB::Util.url_encode.
In views, these are aliased as h and u

http://ruby-doc.org/stdlib-1.9.3/libdoc/erb/rdoc/ERB/Util.html

answered Jan 02 '23 at 12:44

Timothy Alexis Vass

2,526
2
11
30

score -4 · Answer 8 · answered Jan 01 '15 at 11:47

-4

<% str="<h1> Test </h1>" %>

result: &lt; h1 &gt; Test &lt; /h1 &gt;

<%= CGI.unescapeHTML(str).html_safe %>

answered Jan 01 '15 at 11:47

Usman

1,116
11
13

I think that by adding html_safe on any user-entered text, you are telling the view that it is safe when it's possible that it's not safe. This would put your users at risk when they load that view. – user1515295 Oct 28 '15 at 05:22
I don't know why so negative. I tried all solutions in this question. Only this works fine. About HTML safe, the user WANTS to render the HTML, then HTML_SAFE is correct. – Diego Somar Sep 16 '17 at 16:09

How do I encode/decode HTML entities in Ruby?

8 Answers8

Linked

Related