How to get email.Header.decode_header to work with non-ASCII characters?

Question

I'm borrowing the following code to parse email headers, and additionally to add a header further down the line. Admittedly, I don't fully understand the reason for all the scaffolding around what should be straightforward usage of the email.Headers module.

Noteworthy is that Headers is not instantiated; rather its decode_header function is called:

class DecodedHeader(object):
    def __init__(self, s, folder):
        self.msg=email.message_from_string(s[1])
        self.info=parseList(s[0])
        self.folder=folder

    def __getitem__(self,name):
        if name.lower()=='folder': return self.folder
        elif name.lower()=='uid': return self.info[1][3]
        elif name.lower()=='flags': return ','.join(self.info[1][1])
        elif name.lower()=='internal-date':
            ds= self.info[1][5]
            if Options.dateFormat:
                ds= time.strftime(Options.dateFormat,imaplib.Internaldate2tuple('INTERNALDATE "'+ds+'"'))
            return ds
        elif name.lower()=='size': return self.info[1][7]
        val= self.msg.__getitem__(name)
        if val==None: return None
        return self._convert(email.Header.decode_header(val),name)
    def get(self,key,default=None):
        return self.__getitem__(key)

    def _convert(self, list, name):
        l=[]
        for s, encoding in list:
            try:    
                if (encoding!=None):
                    s=unicode(s,encoding, 'replace').encode(Options.encoding,'replace')
            except Exception, e:
                print >>sys.stderr, "Encoding error", e
            l.append(s)

        res= "".join(l)
        if Options.addr and name.lower() in ('from','to', 'cc', 'return-path','reply-to' ): res=self._modifyAddr(res)
        if Options.dateFormat and name.lower() in ('date'): res = self._formatDate(res)
        return res

Here's the problem: When the header (val) contains non-ASCII characters such as Ä and ä, I get:

Traceback (most recent call last):
  File "v12.py", line 434, in <module>
    main()
  File "v12.py", line 396, in main
    writer.writerow(msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 152, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 149, in _dict_to_list
    return [rowdict.get(key, self.restval) for key in self.fieldnames]
  File "v12.py", line 198, in get
    return self.__getitem__(key)
  File "v12.py", line 196, in __getitem__
    return self._convert(email.Header.decode_header(val),name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/email/header.py", line 76, in decode_header
    header = str(header)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)

where u'\xe4' is ä.

I've tried a few things:

Adding # -- coding: utf-8 -- to the top of header.py
Calling unicode() on val before passing it to decode_header()
Calling .encode('utf-8') on val before passing it to decode_header()
Calling .encode('ISO-8859-1') on val before passing it to decode_header()

No joy with any of the above. What is at cause here? Given that I'm looking to maintain the usage of email.Header as above (with Header not instantiated directly), how do we ensure that non-ASCII characters get successfully decoded by decode_header?

It makes it a little offputting to try to help when the example you have provided is hard to run. You might want to see http://stackoverflow.com/help/mcve . I would approach it by leaping in and mucking around to experience the problem and try to fix it. But ... need a more MCVE than this to start from. — GreenAsJade, Jun 18 '15 at 06:22
@GreenAsJade Fair point. Yet to include the rest of the code as well as the other modules called, just to have it run, we're are into hundreds of lines. I would have thought that a Python guru would be able to pinpoint the error on code review alone. Like I've said, I've tried various things (see above) and studied several other SO posts around email.Header and encoding/encoding, to no avail. — Pyderman, Jun 18 '15 at 06:39
@GreenAsJade Here is the link to the full code: http://old.zderadicka.eu/mailexp.py In order to reproduce the issue, you'd need to have (or create) an email that has at least one header containing a character like ä. — Pyderman, Jun 18 '15 at 07:01
The thing is that you're asking us to do work so you don't have to. You'll often get better help if you make it easier on the helpers buy doing as much work as possible yourself. Often you will find that if you set yourself to coding up a simple example that causes the problem you find the solution. And if you don't, then it at least makes it easy for us. Posting the link to the whole code is often no use, because a big chunk of code is not obvious how to run. What you need is a small example of passing a unicode to decode_header, separate from the rest of your irrelevant code. — GreenAsJade, Jun 18 '15 at 09:07
I've poked around a bit, and I wonder if you are passing valid input to `decode_header()`. [This](https://docs.python.org/2/library/email.header.html#email.header.decode_header) shows decode_header being passed a valid header that contains the encoding information. You might want to print out what your `val`s are and check if they are even valid things to be passing to `decode_header()` — GreenAsJade, Jun 18 '15 at 09:31
This looks like an MCVE for this problem. It makes the assumption, as you are, that you can pass anything you like to `decode_headers()`. `import email.Header print email.Header.decode_header("a") print email.Header.decode_header(u'\xe4') ` — GreenAsJade, Jun 18 '15 at 09:33

score 1 · Accepted Answer · answered Jun 19 '15 at 03:15

The header has to be encoded correctly in order to be decoded. It looks like val comes from an already existing message, so maybe that message is bad. The error indicates it is a Unicode string, but it should be a byte string at that point. The examples on in the Python help for email.header are straightforward.

Below encodes two headers that don't even use the same encoding:

>>> import email.header
>>> h = email.header.Header(u'To: Märk'.encode('iso-8859-1'),'iso-8859-1')
>>> h.append(u'From: Jòhñ'.encode('utf8'),'utf8')
>>> h
<email.header.Header instance at 0x00559F58>
>>> s = h.encode()
>>> s
'=?iso-8859-1?q?To=3A_M=E4rk?= =?utf-8?b?RnJvbTogSsOyaMOx?='

Note that the correctly encoded header is a byte string with the encoding names embedded, and it uses no non-ASCII characters.

This decodes them:

>>> email.header.decode_header(s)
[('To: M\xe4rk', 'iso-8859-1'), ('From: J\xc3\xb2h\xc3\xb1', 'utf-8')]
>>> d = email.header.decode_header(s)
>>> for s,e in d:
...  print s.decode(e)
...
To: Märk
From: Jòhñ

How to get email.Header.decode_header to work with non-ASCII characters?

1 Answers1

Linked