0

I use https://github.com/ging/linkser to load url :

1.9.3p374 :001 > require 'linkser'
 => true 
1.9.3p374 :002 > l = Linkser.parse 'http://sports.163.com/nba/'
 => #<Linkser::Objects::HTML:0x007f92019e99c8 @url="http://sports.163.com/nba/", @last_url="http://sports.163.com/nba/", @head=#<Net::HTTPOK 200 OK readbody=true>, @options={}> 
1.9.3p374 :003 > l.title
encoding error : input conversion failed due to input error, bytes 0xC4 0x4E 0x42 0x41
 => "NBA,NBAֱҥ,\xD7钭ㄒ档" 

Is it possible to convert the byte sequence to correct utf8 string ?

why
  • 23,923
  • 29
  • 97
  • 142
  • this might help. http://stackoverflow.com/questions/12147449/delete-non-utf-characters-from-a-string-in-ruby/12149403#12149403 – Iuri G. Feb 14 '13 at 16:17

1 Answers1

0

The actual encoding of the page is GBK, aka gb2312. A quick glance at linkser source shows no handling of encodings, so it is left to Net::HTTP, which has a long standing bug about that, targeted for Ruby 2.0.0.

Martin Vidner
  • 2,307
  • 16
  • 31