0

A certain Python API returns u'J\xe4rvenp\xe4\xe4' for the finish word Järvenpää.

where \xe4 == ä

I then am calling email.header to add this field to a header to be printed.

email.header falls over when it tries to decode the umlaut:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/email/header.py", line 73, in decode_header
    header = str(header)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)

I've tried a couple of things:

  • Addding # -*- coding: utf-8 -*- to the top of header.py
  • Calling unicode() on the Finnish string before passing it to email.header
  • Calling .encode('utf-8') on the Finnish string before passing it to email.header

None have solved the problem. What I am doing wrong? I'd imagine that a solution won't involve modifying header.py (a core Python module).

Python version: 2.7.10

UPDATE:

Header() is not being instantiated directly. Rather I'm callind the decode_header() function on the string:

email.Header.decode_header(theString)

It seems now that simply extend this thus:

email.Header.decode_header(theString.encode('utf-8'))

solves the problem

Pyderman
  • 14,809
  • 13
  • 61
  • 106
  • How are you using the `email` module to add the header? Please include your code in the question, ideally a [MCVE](http://stackoverflow.com/help/mcve). – Lukas Graf Jun 17 '15 at 13:30
  • @LukasGraf see my UPDATE. Would you trust this solution as a reliable one? Or would you suggest something different? – Pyderman Jun 17 '15 at 14:40
  • Wait - are you trying to *create* a header (sending email) or *parse* a header (reading email)? – Lukas Graf Jun 17 '15 at 14:55
  • `decode_header` is a helper function for turning an [RFC 2047](https://www.ietf.org/rfc/rfc2047.txt) *email header* into a Python string by decoding it into a list of `(decoded_string, charset)` tuples. Can you please update your question with a complete example for what you're trying to do (not just snippets)? – Lukas Graf Jun 17 '15 at 15:00
  • @LukasGraf a clearer and more thorough problem description here: http://stackoverflow.com/questions/30907708/how-to-get-email-header-decode-header-to-work-with-non-ascii-characters – Pyderman Jun 18 '15 at 06:15

2 Answers2

2

In order to have the email.header module handle encoding for you and create a proper header, you have to create an instance of email.header.Header with your string and the charset it should be encoded in:

>>> h = Header(text, charset)

For example:

>>> t = u'J\xe4rvenp\xe4\xe4'
>>> print t
Järvenpää
>>> from email.header import Header
>>> h = Header(t, 'utf-8')
>>> h
<email.header.Header instance at 0x7fc2636e7950>
>>> print h
=?utf-8?b?SsOkcnZlbnDDpMOk?=
>>> h = Header(t, 'iso-8859-1')
>>> print h
=?iso-8859-1?q?J=E4rvenp=E4=E4?=

The string can be either a unicode string or a byte string.

  • If you use a unicode string, the charset will only affect what encoding the header is encoded with.
  • If you use a byte string, the charset will both determine what encoding the byte string is assumed to be in, and what encoding will be used to encode the header. If the byte string you provide can't be decoded with that charset, an exception will be raised.
Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
Klaus D.
  • 13,874
  • 5
  • 41
  • 48
-1

AFAIK, str() deals with ascii that's why you get an error. If your string is unicode you should do header = unicode(header), if not it should be decoded first.

#!/usr/bin/python
# -*- coding: utf-8 -*-

header = unicode("Järvenpää".decode('UTF-8'))
print header

Output

Järvenpää
Alex Ivanov
  • 695
  • 4
  • 6
  • `"bytestring".decode(charset)` will already return a `unicode` instance - that additional `unicode()` call doesn't do anything at all. And if it were (when you do `unicode(bytestring)` for example), it will always try to implicitly decode `bytestring` with the system default encoding, which is `ascii` in Python 2.x - so it will fail for anything that isn't ASCII. So never use `unicode(bytestring)` please. – Lukas Graf Jun 17 '15 at 13:50
  • You wish. Normal encoding for Finnish is ISO-8859-15. That's why I gave an example. I tried that string in my terminal. – Alex Ivanov Jun 17 '15 at 14:00
  • There is no such thing as "normal encoding for Finnish" - if you're dealing with byte strings, you always have to know the encoding that was used. The problem with your answer is that you completely ignore the fact that the question is about *email headers*, and that you state things that are simply wrong. – Lukas Graf Jun 17 '15 at 14:03
  • Yeah, right. The problem with your answer is that you ignore the OP's code where "header = str(header)" – Alex Ivanov Jun 17 '15 at 14:08
  • That is not the OP's code, that's in the standard library's `email/header.py` module. – Lukas Graf Jun 17 '15 at 14:08
  • 1
    OK, OK. You are the smartest one. I put a minus to you anyway. – Alex Ivanov Jun 17 '15 at 14:10