Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Question

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" )

In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated. "€foo\xA0".encode('UTF-8') doesn't do anything, since it is already UTF-8. I tried:

"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

which yields

"foo"

But that also loses the valid multibyte character €

See https://stackoverflow.com/questions/12147449/delete-non-utf-characters-from-a-string-in-ruby/12149403#comment133367536_12149403 for some more newer 2.1+ options. — rogerdpack, Feb 28 '23 at 16:28

score 35 · Accepted Answer · edited Mar 23 '15 at 13:17

35

"€foo\xA0".chars.select(&:valid_encoding?).join

edited Mar 23 '15 at 13:17

Dorian

22,759
8
120
116

answered Jan 03 '12 at 10:50

Evgenii

36,389
27
134
170

1

It doesn't remove the `\xF1` in this string `"eEspa\xF1a;FB"` – Dorian Sep 24 '14 at 15:12
2

@Dorian, on 1.9.3 IRB console, `"eEspa\xF1a;FB".chars.select{|i| i.valid_encoding?}.join` returns `"eEspaa;FB" ` ...do you not get that behavior or have I misunderstood? – acobster Mar 20 '15 at 17:40

score 35 · Answer 2 · edited Feb 28 '23 at 15:58

35

"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8')

edited Feb 28 '23 at 15:58

rogerdpack

62,887
36
269
388

answered Jan 03 '12 at 10:50

Van der Hoorn

1,023
8
11

2

I was under the impression it has a larger character set than UTF-8, meaning you don't loose any valid data. Unfortunately the following doesn't work: `"€foo\xA0".encode('UTF-8', :invalid => :replace, :replace => '')` because the string is already UTF-8, so it will not be encoded again. – Van der Hoorn Apr 29 '12 at 18:09
FWIW, running a test on a large file I found this method to be an order of magnitude faster than the `valid_encoding` approach. – jwadsack Oct 04 '12 at 20:37
2

UTF-8 and UTF-16 can both represent all Unicode characters. The only difference is the way the characters are encoded. – Zr40 Nov 10 '12 at 11:10
`UTF-32` is also an option, but `UTF-16` seems to work well enough. The new [emoji characters](http://www.grumdrig.com/emoji-list/) might need the extra space. – tadman Dec 12 '12 at 21:06
1

All UTF encodings are equally capable of encoding all possible Unicode characters; there's no difference in that regard between UTF-8, UTF-16 and UTF-32. The only practical difference is the [output size](http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Eight-bit_environments). – Zr40 Jun 02 '13 at 07:09
1

Throws an error with this string: `"eEspa\xF1a;FB"` – Dorian Sep 24 '14 at 15:12
@Dorian: what Ruby version? – Van der Hoorn Mar 05 '15 at 23:50
1

@VanderHoorn: it was ruby < 2.1 because it works with ruby 2.1+ – Dorian Mar 10 '15 at 13:05
@Dorian: I see. Could it be a Ruby 2.0.x issue? Because I think I used Ruby 1.9.3 when I answered the original question. – Van der Hoorn Mar 11 '15 at 13:56
With ruby 2.1 encoding from "the same encoding to the same encoding" it no longer a no-op FWIW so doing the double encoding trick hopefully is no longer necessary? – rogerdpack Feb 28 '23 at 15:58

score 4 · Answer 3 · edited Feb 28 '23 at 15:53

4

Ruby 2.0 and 1.9.3

"€foo\xA0".encode(Encoding::UTF_8, Encoding::UTF_8, :invalid => :replace)

Ruby 2.1+

"€foo\xA0".scrub

These replace the \xA0 with a � symbol by default, you can specify a different replacement parameter.

edited Feb 28 '23 at 15:53

rogerdpack

62,887
36
269
388

answered Apr 26 '17 at 16:22

Ethan J. Brown

2,308
3
20
27

score -2 · Answer 4 · answered Oct 11 '14 at 07:37

-2

    data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

answered Oct 11 '14 at 07:37

RedDeath

175
1
8

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). – Severin Oct 11 '14 at 12:08
@Severin how come not? It looks like an (incorrect) answer to the question. It removes all invalid byte sequence from a string. It just removes all valid ones as well. – John Dvorak Oct 11 '14 at 15:46

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

4 Answers4

Linked