mime body guess charset (and convert to UTF-8)

Question

I'm trying to parse incoming e-mails and want to store the body as a UTF-8 encoded string in a database, however I've quickly noticed that not all e-mails send charset information in the Content-Type header. After trying some manual quick fixes with String.force_encoding and String.encode I decided to ask the friendly people of SO.

To be honest I was secretly hoping for String.encoding to automagically return the encoding used in the string, however it always appears ASCII-8BIT after I sent a test e-mail to it. I started having this problem when I was implementing quoted-printable as an option which seemed to work if I had also gotten some ;charset=blabla info.

input = input.gsub(/\r\n/, "\n").unpack("M*").first
if( charset )
  return input.force_encoding(charset).encode("utf-8")
end

# This is obviously wrong as the string is not always ISO-8859-1 encoded:
return input.force_encoding("ISO-8859-1").encode("utf-8")

I've been experimenting with several "solutions" i found on the internet, however most seemed to relate to file reading/writing, and experimented with a few gems for detecting encoding (however none really seemed to do the trick or were incredibly outdated). It should be possible, and it feels as if the answer is staring me right in the face, hopefully someone here will be able to shine some light on my situation and tell me what I've been doing completely wrong.

using ruby 1.9.3

score 0 · Accepted Answer · answered May 29 '12 at 09:54

0

You may use https://github.com/janx/chardet to detect the origin encoding of you email text.

Example Here:

irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'UniversalDetector'
=> false
irb(main):003:0> p UniversalDetector::chardet('hello')
{"encoding"=>"ascii", "confidence"=>1.0}
=> nil

answered May 29 '12 at 09:54

Hooopo

1,380
10
16

Hm, this seems to be a port of the actual `chardet` gem. The original one breaks as soon as you try to load it, but I'll give this one a try when I get back home and will post the results. – CharlesLeaf May 29 '12 at 18:03
This version of the gem also seems a bit outdated, depending on how I insert my test string it either just says ASCII with 1.0 confidence, or it gives a `can't convert String into Integer` which seems to originate in `CodingStateMachine.rb` on this line `byteCls = @_mModel['classTable'][c]` but i'll need further investigation to see if I can resolve that. – CharlesLeaf May 29 '12 at 19:12
Initial testing seems to be promising. Had a few troubles installing the ICU library on my local machine (Mac) but worked out in the end, and it seems to be fairly smart. It's not perfect when the string is very small but for real-world it might prove useful enough. Thanks for the help! – CharlesLeaf May 30 '12 at 18:51

score 0 · Answer 2 · answered Mar 09 '13 at 19:12

Have you tried https://github.com/fac/cmess ?

== DESCRIPTION

CMess bundles several tools under its hood that aim at dealing with various problems occurring in the context of character sets and encodings. Currently, there are:

guess_encoding:: Simple helper to identify the encoding of a given string. Includes the ability to automatically detect the encoding of an input.

[...]

mime body guess charset (and convert to UTF-8)

2 Answers2