Deduplication Suggestions for Email Storage

Question

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.

This way, we can perform attachment-level deduplication.

But how can the deduplication go further? Here are my thoughts:

All attachments and emails could be compressed (byte-level deduplicated) individually of course.
I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
The MIME messages can also be compressed in sets.
I am not concerned about search efficiency because there will be full text indexing used.
Searching of the emails would of course use a type of full text indexing, that would not be compressed.
Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.

Do you have any advice in this area? What is normal for an email storage system?

If your "Show source" is going to show something else than exactly what the message looked like when you received it, all sorts of spam-reporting systems will be extremely unhappy with you. We already suggest for people to switch from Outlook to *anything* else for this reason. — tripleee, Jan 25 '12 at 21:12
It is going to show the exact same thing. Guaranteed. The reference will automatically be replaced with the attachment, and the file will be encoded exactly the same. That is where I would say base64, x characters per line. If it was some unusual encoding, then the reference will not be used. This is how we can always ensure exact re-creation of the original MIME message. Let me know if this clarification is not clear. — 700 Software, Jan 25 '12 at 21:15

score 0 · Answer 1 · answered Jan 29 '22 at 12:35

decode all base64 mime parts, not only attachments
calculate secure hash of its content
replace part with reference in email body, or create custom header with list of extracted mime parts
store in blob storage under secure hash (content addresable storage)
use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/@andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
or store each reference relation hash-emailid in db
carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
store encoding parameters (folds, tail) in reference in email body for exact reconstruction
compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
wav files can be compressed using flac, etc.
content-type is sender specified, same attachment can have different content-types
quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
small mime parts may not be worth extracting before specific number of duplicities are present in system

Deduplication Suggestions for Email Storage

1 Answers1