1

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.

This way, we can perform attachment-level deduplication.

But how can the deduplication go further? Here are my thoughts:

  • All attachments and emails could be compressed (byte-level deduplicated) individually of course.
  • I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
  • The MIME messages can also be compressed in sets.
  • I am not concerned about search efficiency because there will be full text indexing used.
  • Searching of the emails would of course use a type of full text indexing, that would not be compressed.
  • Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.

Do you have any advice in this area? What is normal for an email storage system?

700 Software
  • 85,281
  • 83
  • 234
  • 341
  • If your "Show source" is going to show something else than exactly what the message looked like when you received it, all sorts of spam-reporting systems will be extremely unhappy with you. We already suggest for people to switch from Outlook to *anything* else for this reason. – tripleee Jan 25 '12 at 21:12
  • It is going to show the exact same thing. Guaranteed. The reference will automatically be replaced with the attachment, and the file will be encoded exactly the same. That is where I would say base64, x characters per line. If it was some unusual encoding, then the reference will not be used. This is how we can always ensure exact re-creation of the original MIME message. Let me know if this clarification is not clear. – 700 Software Jan 25 '12 at 21:15

1 Answers1

0
  1. decode all base64 mime parts, not only attachments
  2. calculate secure hash of its content
  3. replace part with reference in email body, or create custom header with list of extracted mime parts
  4. store in blob storage under secure hash (content addresable storage)
  5. use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/@andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
  6. or store each reference relation hash-emailid in db
  7. carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
  8. store encoding parameters (folds, tail) in reference in email body for exact reconstruction
  9. compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
  10. jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
  11. wav files can be compressed using flac, etc.
  12. content-type is sender specified, same attachment can have different content-types
  13. quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
  14. be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
  15. small mime parts may not be worth extracting before specific number of duplicities are present in system