I'm currently deduplication emails on a per user (emailaccount) basis. I'm creating an sha512 hash of several headers (message-id, subject, from, date, to). And after that I'm storing full email (mime string) in a file and insert the metadata (subject, from, to, cc ...) combined with a "userID" field in Elasticsearch.
This is working fine on a per user basis, but I would be able to reduce storage costs greatly by deduplicating them on a global basis. The problem is that sometimes when UserA and UserB both received the same message some headers can be different. And like the headers of the sender itself are also different.
Any tips on how to create this are greatly appreciated.
P.S. 1 solution would be to save the MIME file without headers and save the headers separated per user. So to get the full email of userA I get the MIME file combined with the headers of that file linked to userA. But this solution seems a bit like inefficient for me?