gsub :: ArgumentError (invalid byte sequence in UTF-8)

Question

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.

# <div>This is a test<a href="">测试</a></div>
div[0].to_html.gsub(/test/, "")

When that is run, it spits out this error (pointing at gsub):

ArgumentError (invalid byte sequence in UTF-8)

How can we fix this issue?

Are you sure they're utf-8 in the source? What do the actual bytes look like? — Wooble, Feb 22 '12 at 19:29
Yes, or else it wouldn't be saying "UTF-8". Here is what it actually checks: `test...Café testing`. — Artem Kalinchuk, Feb 23 '12 at 14:25
`.to_html.gsub` with values you specified, it works.Could you give us more details ? — louiscoquio, Feb 23 '12 at 15:31
@ArtemKalinchuk: the error message suggests that what you're passing in isn't, in fact, valid UTF-8. This probably means the characters are in another encoding. — Wooble, Feb 23 '12 at 15:41
@wooble Yes, I know that. My question is how can I make it valid? — Artem Kalinchuk, Feb 23 '12 at 17:23
Find out what encoding it's *really* using, then convert it from that to something you can use. — Wooble, Feb 23 '12 at 17:44
@Wooble It's easy to say that but doing it is another story. `"test".encoding #=> UTF-8` and `"test".force_encoding("UTF-8")`. But that doesn't fix the problem. — Artem Kalinchuk, Feb 23 '12 at 19:03

score 3 · Accepted Answer · answered Feb 23 '12 at 17:23

3

Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:

ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }

And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.

answered Feb 23 '12 at 17:23

Artem Kalinchuk

6,502
7
43
57

This works fine but deprecated since Ruby 1.9.3. Can you suggest solution via String built-in encodings? – Bogdan Gusiev Mar 13 '12 at 08:01
http://stackoverflow.com/questions/8710444/is-there-a-way-in-ruby-1-9-to-remove-invalid-byte-sequences-from-strings – lulalala Apr 23 '12 at 04:02
Maybe this will help: http://stackoverflow.com/questions/11016328/hpricot-utf-8-issues – Artem Kalinchuk Jun 15 '12 at 14:36

score 0 · Answer 2 · answered Feb 22 '12 at 19:58

0

The to_html looks to return a non-utf8 string in this case.

I had same problem with file containing some non-utf8 characters. The fix I found is not really beautiful, but it could also works for your case :

the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')

Be careful, I'm not sure there is no one data lost.

answered Feb 22 '12 at 19:58

louiscoquio

10,638
3
33
51

The unpack+pack I've seen elsewhere was used to "latinize" (remove apostrophes etc) which means data is lost. If the solution above is the same, then it won't be useful here. – Simon B. Jun 23 '13 at 08:49

gsub :: ArgumentError (invalid byte sequence in UTF-8)

2 Answers2