41

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.

This:

text = x = "foo\xC2bar"
text.gsub!(/\xC2/, '')

returns an error:

incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Wojtek B.
  • 927
  • 2
  • 9
  • 17

7 Answers7

118

You can use encode for that. text.encode('UTF-8', :invalid => :replace, :undef => :replace)

Or text.scrub

For more info look into Ruby-Docs, replaces it by default with a question mark box.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Iuri G.
  • 10,460
  • 4
  • 22
  • 39
11

You could do it like this

# encoding: utf-8

class String
  def validate_encoding
    chars.select(&:valid_encoding?).join 
  end
end

puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
peter
  • 41,770
  • 5
  • 64
  • 108
7

You text have ASCII-8BIT encoding, instead you should use this:

String.delete!("^\u{0000}-\u{007F}"); 

It will serve the same purpose.

XtraSimplicity
  • 5,704
  • 1
  • 28
  • 28
CharlesC
  • 1,310
  • 14
  • 26
5

You can use /n, as in

text.gsub!(/\xC2/n, '')

to force the Regexp to operate on bytes.

Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.

ephemient
  • 198,619
  • 38
  • 280
  • 391
4

Try Iconv

1.9.3p194 :001 > require 'iconv'
# => true 
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string" 
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290> 
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string" 
Pritesh Jain
  • 9,106
  • 4
  • 37
  • 51
3

The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.

In short: "€foo\xA0".chars.select(&:valid_encoding?).join

Community
  • 1
  • 1
Ivaylo Novakov
  • 775
  • 6
  • 18
-2
data = '' if not (data.force_encoding("UTF-8").valid_encoding?)
RedDeath
  • 175
  • 1
  • 8