1

I'm trying to achieve at-most-once processing of email messages retrieved over IMAP. (I asked a related question about it.)

Is it reliable to compute a cryptographic hash code of the MIME messages retrieved over IMAP to deduplicate them?

In other words, why would the same email result in a different result when retrieved over IMAP multiple times? Can an email change it's contents for example when it's moved across folders, or marked as read or for some other reason?

I'm using hMailserver on Windows with Mailkit.NET as the client. Not sure this matters, though.

Community
  • 1
  • 1
boot4life
  • 4,966
  • 7
  • 25
  • 47
  • If you're hashing just the contents, no it can't change, but actual duplicates can exist (messages can be copied in IMAP). I think you're over engineering the problem though, the UID should be sufficient. – Max Jun 12 '16 at 13:48
  • http://crypto.stackexchange.com/questions/2583/is-it-fair-to-assume-that-sha1-collisions-wont-occur-on-a-set-of-100k-strings/2584 – Hans Passant Jun 12 '16 at 14:57

1 Answers1

2

Many mailing lists append a footer, so mail sent both to me and a list arrives with two different signatures.

Most people consider this to be one message.

I suggest using the message-id header field for at-most-once processing. AFAICT it's been reliably unique for the last ten years (the last collision I've seen was from around 2000).

arnt
  • 8,949
  • 5
  • 24
  • 32
  • It might be worth checking the subject or a few other headers as well to make sure that it's not a collision (keep in mind that sometimes mailing lists also prepend `"[list-name] "` to the Subject value). – jstedfast Jun 13 '16 at 12:28
  • I was worried clients might not send one at all. – boot4life Jun 15 '16 at 11:41
  • Clients have learnt long ago that if they don't their mail gets caught in spam filters. Spam Assassin has a half-dozen rules to match mail with bad or (partly) missing message-ids, for example. – arnt Jun 15 '16 at 12:21