3

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.

# <div>This is a test<a href="">测试</a></div>
div[0].to_html.gsub(/test/, "")

When that is run, it spits out this error (pointing at gsub):

ArgumentError (invalid byte sequence in UTF-8)

How can we fix this issue?

Artem Kalinchuk
  • 6,502
  • 7
  • 43
  • 57

2 Answers2

3

Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:

ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }

And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.

Artem Kalinchuk
  • 6,502
  • 7
  • 43
  • 57
0

The to_html looks to return a non-utf8 string in this case.

I had same problem with file containing some non-utf8 characters. The fix I found is not really beautiful, but it could also works for your case :

the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')

Be careful, I'm not sure there is no one data lost.

louiscoquio
  • 10,638
  • 3
  • 33
  • 51
  • The unpack+pack I've seen elsewhere was used to "latinize" (remove apostrophes etc) which means data is lost. If the solution above is the same, then it won't be useful here. – Simon B. Jun 23 '13 at 08:49