7

I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.

open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
 => ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>] 

I am guessing this is because the webpage has the byte (or character) \x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.

I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.

mkhettry
  • 71
  • 1
  • 2
  • What version of Ruby are you using? – the Tin Man Apr 19 '11 at 05:45
  • I'm using 1.9.2. Yes, \x92 means CP1252. I was looking for a more general solution or ideas on how to parse html when the encoding is unknown or doesn't agree with the html header – mkhettry Apr 21 '11 at 15:14
  • Maybe http://stackoverflow.com/questions/7821853/trouble-opening-utf-8-uris-with-rubys-open-uri will help. – DavidGamba Mar 11 '13 at 16:50

1 Answers1

9
  • ASCII-8BIT is an alias for BINARY
  • open-uri does a funny thing: if the file is less than 10kb (or something like that), it returns a String and if it's bigger then it returns a StringIO. That can be confusing if you're trying to deal with encoding issues.

If the files aren't huge, I would recommend manually loading them into strings:

require 'uri'
require 'net/http'
require 'net/https'

uri = URI.parse url_to_file

http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
  http.use_ssl = true
  # possibly useful if you see ssl errors
  # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body

Then you can use the https://rubygems.org/gems/ensure-encoding gem

require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)

I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com

Note that you can also say :invalid_characters => :ignore instead of :transcode.

Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff

Seamus Abshere
  • 8,326
  • 4
  • 44
  • 61