1

# The Problem #

When parsing millions of emails, the method Mail.read_from_string(mail_as_string) is too slow.

# The Question #

How can I speed up email parsing?

# The Context #

I have put in enough context for you to understand my use case.

Fetching an email

I connect to some external IMAP server via Rubys Net::IMAP.

@imap = Net::IMAP.new("imap.gmail.com", 993, true) # A few login steps are omitted here

I fetch an email:

email = @imap.uid_fetch("85113", ["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]", "RFC822"]) # => #<struct Net::IMAP::FetchData seqno=55395, attr={"UID"=>85113, "RFC822"=>"Delivered-To: my@email.com\r\nReceived: by 10.223.148.78 with SMTP id o14csp218630fav;\r\n        Tue, 18 Dec 2012 16:55:50 -0800 (PST)\r\nX-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414;\r\n        Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReturn-Path: <noreply@128secure.net>\r\nReceived: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162])\r\n        by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47;\r\n        Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReceived-SPF: pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) smtp.mail=noreply@128secure.net\r\nReceived: by exproxy-1.exserver.dk (Postfix, from userid 65534)\r\n\tid 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62])\r\n\tby exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E\r\n\tfor <my_email.com@exfwd01.scannet.dk>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from front07.exserver.dk (195.69.129.92) by\r\n EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id\r\n 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100\r\nReceived: from localhost (front07.exserver.dk [127.0.0.1])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015\tfor\r\n <my@email.com>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)\r\nX-Virus-Scanned: amavisd-new at exserver.dk\r\nReceived: from front07.exserver.dk ([127.0.0.1])\tby localhost\r\n (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024)\twith ESMTP id\r\n vrjzzlpsuXn6 for <my@email.com>;\tWed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id A6F797B4002\tfor\r\n <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from WebSrv100 (unknown [193.239.97.100])\tby shopmail.scannet.dk\r\n (Postfix) with ESMTP id 6DFEF7FE4E\tfor <my@email.com>; Wed, 19 Dec\r\n 2012 01:55:34 +0100 (CET)\r\nMIME-Version: 1.0\r\nFrom: me <noreply@128secure.net>\r\nTo: me <my@email.com>\r\nReply-To: <my@email.com>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\nContent-Type: text/html; charset=\"utf-8\"\r\nContent-Transfer-Encoding: base64\r\nMessage-ID: <20121219005542.A6F797B4002@front07.exserver.dk>\r\nX-ScanNet-Forward: TTL=5\r\n\r\n\r\nT3JkcmUgZnJhIEdhbWVQSU1QOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+DQpPcmRy\r\nZWRhdG86IDE5LTEyLTIwMTIgMDE6NTU6NDM8YnI+DQpPcmRyZW51bW1lcjogMTA4NjU0PGJy\r\nPg0KVHJhbnNha3Rpb25zSUQ6IDE2NzI4Ng0KPGJyPjxicj4NCkZha3R1cmVyaW5nc2FkcmVz\r\nc2U6PGJyPg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLTxicj48YnI+DQpMYXJzIFBldGVyc2VuPGJy\r\nIC8+QnlzdMOmdm5ldmVqIDY2LCBCw7hqZGVuPGJyIC8+NTYwMCBGYWFib3JnPGJyIC8+RGVu\r\nbWFyazxiciAvPlRMRjo6IDYwNjczNzY3PGJyIC8+PGEgaHJlZj0ibWFpbHRvOmZhc3RodWdv\r\nQGhvdG1haWwuY29tIj5mYXN0aHVnb0Bob3RtYWlsLmNvbTwvYT48YnIgLz4NCjxicj48YnI+\r\nDQpMZXZlcmluZ3NhZHJlc3NlOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+PGJyPg0K\r\nTGFycyBQZXRlcnNlbjxiciAvPkJ5c3TDpnZuZXZlaiA2NiwgQsO4amRlbjxiciAvPjU2MDAg\r\nRmFhYm9yZzxiciAvPkRlbm1hcms8YnIgLz5UTEY6OiA2MDY3Mzc2NzxiciAvPjxhIGhyZWY9\r\nIm1haWx0bzpmYXN0aHVnb0Bob3RtYWlsLmNvbSI+ZmFzdGh1Z29AaG90bWFpbC5jb208L2E+\r\nPGJyIC8+DQo8YnI+PGJyPg0KT3JkcmVkYXRhOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08\r\nYnI+DQoNCiAgMSwwMCBzdGsuIFgzOiBUZXJyYW4gQ29uZmxpY3QgUEMgKDMxNDMyKSDDoSBE\r\nS0sgMjk4LDM5IC0gSWFsdDogREtLIDM3Miw5OQ0KPGJyPg0KICAxLDAwIHN0ay4gQ3J5c2lz\r\nIE1heGltdW0gRWRpdGlvbiBQQyAoNDgwNDgpIMOhIERLSyAxNTcsNTkgLSBJYWx0OiBES0sg\r\nMTk2LDk5DQo8YnI+DQo8YnI+DQpCZXRhbGluZzogMjogRGFuc2tlIGtyZWRpdGtvcnQgW3Ry\r\nYW5zYWt0aW9uc2dlYnlyIDEsMjUlXSAoREtLIDcsMTMpDQo8YnI+DQpGb3JzZW5kZWxzZTog\r\nIChES0sgMCwwMCkNCjxicj48YnI+DQpTYW1sZXQgcHJpcyA6IERLSyA1NzcsMTENCjxicj4N\r\nCkhlcmFmIG1vbXM6IERLSyAxMTUsNDMNCg==\r\n\r\n", "BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"=>"From: me <noreply@128secure.net>\r\nTo: me <my@email.com>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\n\r\n"}>

Getting the email information

header_attr = email.attr["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"]
header      = Mail.read_from_string(header_attr) # => #<Mail::Message:70179653144480, Multipart: false, Headers: <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <noreply@128secure.net>>, <To: me <my@email.com>>, <Subject: Ordre (Kopi)>>

I can then access the following:

header.date.to_time # => 2012-12-18 16:55:44 -0800
header.from.first   # => noreply@128secure.net
header.to.first     # => my@email.com
header.subject      # => Ordre (Kopi)

Delay 1: Getting header takes 0.010000 seconds:

puts Benchmark.measure { Mail.read_from_string(header_attr) } # => 0.010000   0.000000   0.010000 (  0.004163)

Getting the email message (Body)

message_attr = email.attr["RFC822"]
message      = Mail.read_from_string(message_attr) # => #<Mail::Message:70179643743140, Multipart: false, Headers: <Return-Path: <noreply@128secure.net>>, <Received: by 10.223.148.78 with SMTP id o14csp218630fav; Tue, 18 Dec 2012 16:55:50 -0800 (PST)>, <Received: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162]) by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received: by exproxy-1.exserver.dk (Postfix, from userid 65534) id 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62]) by exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E for <my_email.com@exfwd01.scannet.dk>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from front07.exserver.dk (195.69.129.92) by EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100>, <Received: from localhost (front07.exserver.dk [127.0.0.1]) by front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015 for <my@email.com>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)>, <Received: from front07.exserver.dk ([127.0.0.1]) by localhost (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vrjzzlpsuXn6 for <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120]) by front07.exserver.dk (Postfix) with ESMTP id A6F797B4002 for <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from WebSrv100 (unknown [193.239.97.100]) by shopmail.scannet.dk (Postfix) with ESMTP id 6DFEF7FE4E for <my@email.com>; Wed, 19 Dec 2012 01:55:34 +0100 (CET)>, <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <noreply@128secure.net>>, <Reply-To: <my@email.com>>, <To: me <my@email.com>>, <Message-ID: <20121219005542.A6F797B4002@front07.exserver.dk>>, <Subject: Ordre (Kopi)>, <Mime-Version: 1.0>, <Content-Type: text/html; charset="utf-8">, <Content-Transfer-Encoding: base64>, <Delivered-To: my@email.com>, <X-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received-SPF: pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;>, <Authentication-Results: mx.google.com; spf=pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) smtp.mail=noreply@128secure.net>, <X-Virus-Scanned: amavisd-new at exserver.dk>, <X-ScanNet-Forward: TTL=5>>

In order to ensure UTF-8 encoding I do the following:

  if message.multipart?
    body = message.text_part.decoded.force_encoding("UTF-8").encode("UTF-8")
  else
    body = message.body.decoded.force_encoding(message.charset).encode("UTF-8") # => "Ordre fra mig:<br>\r\n-------------------------------------------------------------------------<br>\r\nOrdredato: 19-12-2012 01:55:43<br>\r\nOrdrenummer: 108654<br>\r\nTransaktionsID: 167286\r\n<br><br>\r\nFaktureringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66, Bøjden<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:din@email.com\">din@email.com</a><br />\r\n<br><br>\r\nLeveringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:en@mail.com\">en@mail.com</a><br />\r\n<br><br>\r\nOrdredata:<br>\r\n-------------------------------------------------------------------------<br>\r\n\r\n  1,00 stk. X3: Terran Conflict PC (31432) á DKK 298,39 - Ialt: DKK 372,99\r\n<br>\r\n  1,00 stk. Crysis Maximum Edition PC (48048) á DKK 157,59 - Ialt: DKK 196,99\r\n<br>\r\n<br>\r\nBetaling: 2: Danske kreditkort [transaktionsgebyr 1,25%] (DKK 7,13)\r\n<br>\r\nForsendelse:  (DKK 0,00)\r\n<br><br>\r\nSamlet pris : DKK 577,11\r\n<br>\r\nHeraf moms: DKK 115,43\r\n"
  end

Delay 2: Getting message takes 0.050000 seconds:

puts Benchmark.measure { Mail.read_from_string(message_attr) } # => 0.050000   0.000000   0.050000 (  0.054013)
arnt
  • 8,949
  • 5
  • 24
  • 32
Cjoerg
  • 1,271
  • 3
  • 21
  • 63
  • 4
    What I would do: read the source. Find out what the Mail gem is doing. Optimize. Submit a pull request. – Mark Thomas Dec 21 '14 at 16:47
  • Thanks, but I suppose you could post that type of comment on any SO question ever asked. – Cjoerg Dec 21 '14 at 17:13
  • 2
    @ChristofferJoergensen, I suppose you could. What Mark proposes, it's also a strategy with higher chance of success than waiting when (if ever) someone on stackoverflow will have enough free time to do this. – Sergio Tulentsev Dec 21 '14 at 19:30
  • I'm sorry, but is there something wrong with the question? Isn't this what SO is for or am I missing the point of SO? – Cjoerg Dec 21 '14 at 19:32
  • 2
    @ChristofferJoergensen: this question is on-topic here, I guess. It's just that a person who already knows the answer to it (reimplemented email parsing or whatever) hasn't popped by yet. And for the rest of us, producing the answer requires doing the work. – Sergio Tulentsev Dec 21 '14 at 19:36
  • @SergioTulentsev, thanks for elaborating. But writing a coherent question like this takes several hours, so of course I have tried to find a solution on my own first. – Cjoerg Dec 21 '14 at 19:41
  • 1
    @ChristofferJoergensen: yes, the question is not badly written. Although I'm not sure if the encoding part is relevant. Have you tried profiling this (the whole email parsing thing), so that you know exactly how much time each line takes to execute? [rblineprof](https://github.com/tmm1/rblineprof) is good. – Sergio Tulentsev Dec 21 '14 at 19:47

1 Answers1

0

If you've got the email already parsed into fields...

header.date.to_time # => 2012-12-18 16:55:44 -0800
header.from.first   # => noreply@128secure.net
header.to.first     # => my@email.com
header.subject      # => Ordre (Kopi)

...then why make Mail::new parse it out all over again? Instead of calling Mail.read_from_string(message_attr) , try something like this:

message = Mail.new({to: header.date.to_time,
                    from: header.from.first,
                    subject: header.subject,
                    body: body })
David Hempy
  • 5,373
  • 2
  • 40
  • 68
  • Hi @DavidHempty. Thanks for your answer. But I'm not sure I understand it. The reason why I call the `Mail.read_from_string(message_attr)` method twice is because I'm calling it with 2 different arguments. In your suggestion you use the `Mail.new` method, but where do you get the `body` variable from? – Cjoerg Jan 31 '15 at 00:25