1

I have a problem with my app that reads e-mails from external server using mailman gem (which is also using mail).

ruby 1.9.2p0
mail (2.3.0)
mailman (0.4.0) 
actionmailer (= 3.1.3)

database.yml

production:
  adapter: mysql2
  encoding: utf8

Here is a simple method to receive 'mail'. I build @message_body from text_part of multipart email (for ex. with attachments) or from the whole body (decoded).

def self.receive_mail(message)
    # some code here 
    @message_body = message.multipart? ? message.text_part.body.to_s : message.body.decoded
    # some code here, to save message in database

My problem is that if the message doesn't have attachments but have diacritics, like ą ś ł ń ż ź ó ... body is split just before first diacricit. So if body is: "test żłóbek test" I will get only "test " in @message_body.

My question is how to save such a message in an elegant way, so that text part is saved in database with all diacritics.

EDIT: to make it cleaner, I get e-mails that look like this one (it's just a part of e-mail sent from gmail)

--20cf307ac4372d830104c11c8cc6 Date: Mon, 28 May 2012 20:06:16 +0200 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: base64 Content-ID: <4fc3be989b76e_794650c25f6625e3@vk1057.some_domain>

dGVzdCC/s7zm8bbzsSB0ZXN0Cg==

So we have this 'body' : dGVzdCC/s7zm8bbzsSB0ZXN0Cg==

After decoding we get : 'test \xbf\xb3\xbc\xe6\xf1\xb6\xf3\xb1 test\n'

And the problem is that starting from '\xbf' data is not saved in database.

UPDATE

another example, I think this is the problem here:

irb(main):008:0* require 'base64'
=> true
irb(main):009:0> a = "test źćłżąńś"
=> "test źćłżąńś"
irb(main):010:0> b = Base64.encode64(a)
=> "dGVzdCDFusSHxYLFvMSFxYTFmw==\n"
irb(main):011:0> Base64.decode64(b)
=> "test \xC5\xBA\xC4\x87\xC5\x82\xC5\xBC\xC4\x85\xC5\x84\xC5\x9B"

see, after decode64 my diacritics are LOST, what to do to get them back?

januszm
  • 1,166
  • 13
  • 24
  • Which version(s) of Ruby and the gems (relevant snippets of Gemfile.lock) are you using? –  May 28 '12 at 16:53
  • ruby 1.9.2p0 gems: mail (2.3.0) mailman (0.4.0) actionmailer (= 3.1.3) I think this is important: when diacritics are in SUBJECT of the email everything is saved fine. But the same characters in body cause Rails to not save them and cut string here. Maybe it is related to Rails-MySQL only, because what application should store at this moment in MySQL is something like: test \xbc\xe6\xbf\xf1\xb6" should I experiment with force_encoding or something like this? – januszm May 28 '12 at 18:02
  • Might be right, I usually enforce unicode/utf8 w/ all databases to prevent headaches. –  May 28 '12 at 18:22
  • Other links to checkout: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ http://stackoverflow.com/questions/5122765/mysql-changes-utf-8-to-ascii-8bit I also recommend the mysql2 gem. –  May 28 '12 at 18:30
  • I'm using mysql2 gem and encoding utf-8. what i've found is the fact that mail body comes in ASCII-8BIT – januszm May 28 '12 at 18:44
  • adding force_encoding doesn't help, message.text_part.body.to_s.force_encoding("utf-8") , still everything starting with first \xab character is lost in database – januszm May 28 '12 at 18:45

2 Answers2

1
force_encoding('utf-8')

Doesn't work because the data isn't utf-8 - your mail headers clearly states that the message body is ISO 8859-2.

Mysql2 assumes everything is utf8 but can't convert the bytes to utf8 (because ruby doesn't know the original encoding) so your non ascii characters are thrown away by mysql

For that one string you could try

body.force_encoding('ISO-8859-2').encode('utf-8')

But really you want to be working out what encoding to use from the content type header. I'm surprised the mail gem isn't doing that for you

Frederick Cheung
  • 83,189
  • 8
  • 152
  • 174
  • wow, what a coincidence. I've found the same solution in the same moment and was writing answer when you were doing the same ;) Thanks anyway! – januszm May 28 '12 at 20:23
  • and you are right, mail gem should do this. maybe I should write about this issue to mikel/mail github. As you can see above, I'm checking charset manually. This whole problem came out of the fact, that gem was not doing it automatically. – januszm May 28 '12 at 20:25
0

I have the solution. Concatenation of

.force_encoding("ORIGINAL_CHARSET").encode("UTF-8")

methods on E-Mail body object is the solution.

I had to change my receive_mail() definition from previous 'one liner' to:

if message.multipart?
    charset = message.text_part.content_type_parameters[:charset]
    @message_body = message.text_part.body.to_s.force_encoding(charset).encode("UTF-8")
else
    charset = message.content_type_parameters[:charset]
    @message_body = message.body.decoded.force_encoding(charset).encode("UTF-8")
end

With this construct I can detect what was the charset of original e-mail and then force it and encode back to UTF-8. This ensures proper decoding from base64 from original to utf-8.

If anyone has more elegant solution, please share.

januszm
  • 1,166
  • 13
  • 24