3

I've got a mbox mailbox containing duplicate copies of messages, which differ only in their "X-Evolution:" header.

I want to remove the duplicate ones, in as quick and simple a way as possible. It seems like this would have been written already, but I haven't found it, although I've looked at the Python mailbox module, the various perl mbox parsers, formail, and so forth.

Does anyone have any suggestions?

JesseW
  • 1,255
  • 11
  • 19

3 Answers3

7

This a small script, which I used for it:

#!/bin/bash
IDCACHE=$(mktemp -p /tmp)
formail -D $((1024*1024*10)) ${IDCACHE} -s
rm ${IDCACHE}

The mailbox needs to be piped through it, and in the meantime it will be deduplicated.

-D $((1024*1024*10)) sets a 10 Mebibyte cache, which is more than 10x the amount needed to deduplicate an entire year of my mail. YMMV, so adjust it accordingly. Setting it too high will cause some performance loss, setting it to low will let it slip duplicates.

formail is part of the procmail utility bundle, mktemp is part of coreutils.

asdmin
  • 173
  • 1
  • 7
0

I didn't look at formail (part of procmail) in enough detail. It does have such such an option, as mentioned in places like: http://hints.macworld.com/comment.php?mode=view&cid=115683 and http://us.generation-nt.com/answer/deleting-duplicate-mail-messages-help-172481881.html

JesseW
  • 1,255
  • 11
  • 19
0

'formail -D' and 'reformail -D' can only process one email per execution. Each mail needs to be separated from mbox first before being processed. I use reformail from maildrop instead since it's still in active development.

  1. remove old idcache, tmpmail, nmbox
  2. run dedup.sh .
  3. nmbox is the output with duplicate messages removed.

dedup.sh

#! /bin/sh
# $1 = mbox, thunderbird mailbox
# wmbox.sh is called for each mail.

cat $1 | reformail -s ./wmbox.sh

wmbox.sh

#! /bin/sh
# stdin: a email
# called by dedup.sh

TM=tmpmail
if [ -f $TM ] ; then
   echo error!
   exit 1
fi
cat > $TM
# mbox format, each mail end with a blank line
echo "" >> $TM

cat $TM | reformail -D 99999999 idcache

# if this mail isn't a dup (reformail return 1 if message-id is not found)
if [ $? != 0 ]; then
   # each mail shall have a message-id
   if grep -q -i '^message-id:' $TM; then
      cat tmpmail >> nmbox
   fi
fi

rm $TM