1

I'm working on a mail gateway that would automatically provide (among other things) "view in browser" functionality for all emails that are being sent through it. This raises the need to store all emails somewhere so that they can be easily accessed. Even though time period is limited, and even applying gzip before saving the message, we're looking at ~500GB of storage required to just keep recent messages.

Since all emails are mostly identical (except for a few personal variables), I was thinking if there is a more efficient way to store. Something that deduplicates stuff across multiple records, or something like that. Any suggestions on that?

Alternate way would be to save the template, and save only variables for each email sent, but we don't want to do that, as this process should be transparent to the sender. This means that this information would not be accessible, and it needs to be deduced after the fact.

Sergey
  • 1,181
  • 7
  • 18
  • Not an option to save the template and then merge the data in dynamically when viewed for a given user? – ryan1234 Sep 04 '13 at 17:29
  • Only if I can write an algorithm to identify the template myself, as this needs to be transparent to the sender of the message. So I'd like to use a solution that already achieves as much efficiency as possible, without having to get into this. – Sergey Sep 04 '13 at 17:33

2 Answers2

1
  1. If there are duplicated images/attachments/parts you can implement deduplication of parts based on their content hash.

  2. You could pack multiple messages in TAR or MBOX file format and then compress them before storing. Compression ratio would be better, because of more duplicate bytes in one file. Random email access would be harder depending on how many emails are compressed in 1 file.

  3. Train custom compression dictionary and compress each email independently. Zstd for example: https://facebook.github.io/zstd/#small-data

EDIT: added third solution

0

This should all be done dynamically. Store the email once as it existed before you added your subscriber specific content/merge tags (variables). In the email you would need to have the 'view in browser' link unique to each subscriber. Based on the link you would then serve up their unique variables in the browser based version.

If there is a lot of unique content, you might want to use a database, otherwise if it is just their name for example, you could pass that as a url parameter itself.

John
  • 11,985
  • 3
  • 45
  • 60
  • I mentioned in my original post and in the comment that this is not possible with the way we want it to be implemented. – Sergey Sep 04 '13 at 17:38