1

I'm currently deduplication emails on a per user (emailaccount) basis. I'm creating an sha512 hash of several headers (message-id, subject, from, date, to). And after that I'm storing full email (mime string) in a file and insert the metadata (subject, from, to, cc ...) combined with a "userID" field in Elasticsearch.

This is working fine on a per user basis, but I would be able to reduce storage costs greatly by deduplicating them on a global basis. The problem is that sometimes when UserA and UserB both received the same message some headers can be different. And like the headers of the sender itself are also different.

Any tips on how to create this are greatly appreciated.

P.S. 1 solution would be to save the MIME file without headers and save the headers separated per user. So to get the full email of userA I get the MIME file combined with the headers of that file linked to userA. But this solution seems a bit like inefficient for me?

Floris
  • 299
  • 3
  • 17

1 Answers1

0

I work in an industry (litigation discovery) that involves dealing with the exact same question you pose - see this blog post (you can skip down about halfway to the 'quick primer on deduplication' numbered list and comments) for basically the exact same dilemma that you mention, i.e., that some email header fields will invariably, um, vary, making it virtually impossible to globally dedupe based on all headers.

To deal with this, the software I use for this purpose only hashes the fields shown in the 'key generation' section here. The comments section of the blog post I mentioned offers a good example of hashing based on a subset of fields. Basically, it would be something like the following:

  • Attachments (list)
  • Body (plain text)
  • CC
  • From
  • Subject
  • To

Also, the default hash setting (which I use) is 128 bit MD5, versus SHA-1. You may want to try generating both and comparing the results (how many are 'deduped' using each algorithm) based on the following article on hash collision probabilities (stack overflow won't let me post more than two links, sorry):

preshing.com/20110504/hash-collision-probabilities/

bencassedy
  • 111
  • 4