1

I am trying to parse GB2312 encoded page (http://news.qq.com/a/20140824/015032.htm), and this is my code.

I am not yet into the parsing part, just in the open and read, and I got error.

This is my code:

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read

And this is the error:

Encoding::InvalidByteSequenceError: "\x8B" on GB2312

I am using Ruby 2.0.0p247

Any solution?

VHanded
  • 2,079
  • 4
  • 30
  • 55

3 Answers3

1

I don't know exactly why this happens when calling .read, but you can work around it if you are using Nokogiri. Just pass the file object directly to Nokogiri without calling .read:

require 'open-uri'
file = open("http://news.qq.com/a/20140824/015032.htm")
document = Nokogiri(file)
infused
  • 24,000
  • 13
  • 68
  • 78
0

I cannot duplicate the error using 2.0.0p247,

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read

Works fine.

However

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read.encode('utf-8')

will raise the error

Encoding::InvalidByteSequenceError: "\x8B" on GB2312

Are you trying to do some encoding conversion?

Gordon Yuan Gao
  • 694
  • 6
  • 20
0

you can try this

document = Nokogiri::HTML(open("http://news.qq.com/a/20140824/015032.htm"), nil, "GB18030")
rrrrong
  • 371
  • 2
  • 4