2

I already decoded a lot of email attachments filenames in my code.

But this particular filename breaks my code.

Here is a minimal example:

from email.header import decode_header
encoded_filename='=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
decoded_header=decode_header(encoded_filename) # --> [('SalesInvoiceQ1|\x04\xb5I\x95\xc1\xbd\xc9\xd0\xb9\xc1\x91\x98', 'utf-8')]
filename=str(decoded_header[0][0]).decode(decoded_header[0][1])

Exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 16: invalid start byte

Don't ask my how, but Thunderbird is able to decode this filename to: SalesInvoice-Report.pdf

How can I decode this with python like email clients apparently are able to?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
guettli
  • 25,042
  • 81
  • 346
  • 663

1 Answers1

3

There are two Encoded-Word sections in that header. You'd have to detect where one ends and one begins:

>>> print  decode_header(encoded_filename[:28])[0]
('SalesInvoice', 'utf-8')
>>> print  decode_header(encoded_filename[28:])[0]
('-Report.pdf', 'utf-8')

Apparently that's what Thunderbird does in this case; split the string into =?encoding?data?= chunks. Normally these should be separated by \r\n (CARRIAGE RETURN + LINE FEED) characters, but in your case they are mashed up together. If you re-introduce the \r\n separator the value decodes correctly:

>>> decode_header(encoded_filename[:28] + '\r\n' + encoded_filename[28:])[0]
('SalesInvoice-Report.pdf', 'utf-8')

You could use a regular expression to extract the parts and re-introduce the separator:

import re
from email.header import decode_header

quopri_entry = re.compile(r'=\?[\w-]+\?[QB]\?[^?]+?\?=')

def decode_multiple(encoded, _pattern=quopri_entry):
    fixed = '\r\n'.join(_pattern.findall(encoded))
    output = [b.decode(c) for b, c in decode_header(fixed)]
    return ''.join(output)

Demo:

>>> encoded_filename = '=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
>>> decode_multiple(encoded_filename)
u'SalesInvoice-Report.pdf'

Of course, it could be that you have a bug in how you read the header in the first place. Make sure you don't accidentally destroy an existing \r\n separator when extracting the encoded_filename value.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I gave you a bounty for "Reward existing answer". Thank you very much! ... needs 23 hours ... strange restriction of Stackoverflow. – guettli Mar 18 '15 at 10:40
  • @guettli: thank you! You didn't really have to, but it is *much* appreciated. The restrictions are in place to combat fraud. If you could immediately award the community would not get a chance to audit bounties used to transfer rep for nefarious purposes. And look at it this way: the extra attention over the bounty period *can* result in extra upvotes, and perhaps someone knows an even better solution to the problem. :-) – Martijn Pieters Mar 18 '15 at 10:50
  • BTW, we already parsed several thousand file names in mail headers. I guess the client creates the broken encoding. – guettli Mar 18 '15 at 10:54
  • @guettli: yes, if a client is producing those headers *without* a CRLF in between then that's a broken client. It could also be that it produces the value with a line separator (correct or somehow malformed) and that a MTA in between then 'repairs' the value, etc. Either way, you now have to deal with the fall-out. – Martijn Pieters Mar 18 '15 at 10:59