1

Python email header decoder for python2.7 or python3 seems to have some strange behavior in switching between encoded and unencoded text.

from email.header import decode_header
print decode_header("=?ISO-8859-1?B?QA==?=example.com");
print decode_header("=?ISO-8859-1?B?QA==?= example.com");
print decode_header("=?ISO-8859-1?Q?=40example?= .com");
print decode_header("=?ISO-8859-1?Q?=40example?=.com");

Here is the result

[('=?ISO-8859-1?B?QA==?=example.com', None)]
[('@', 'iso-8859-1'), ('example.com', None)]
[('@example', 'iso-8859-1'), ('.com', None)]
[('=?ISO-8859-1?Q?=40example?=.com', None)]

In all the example inputs the encoded-text is just @ sign and it should get interpreted properly but it does not. I think the interpretation of RFC 1342 seems incorrect to me. Python expects a space or newline to be the end of an encoded text. I don't see this in the RFC, RFC only says space is needed between multiple encoded-texts as I read it and not between encoded-text and unencoded portions of the text. So whenever you see "?=" you need to treat that as the end of encoded text which python does not do. I want to ask the experts if this is a bug here OR if I got this wrong?

Vijay

DYZ
  • 55,249
  • 10
  • 64
  • 93
Vijay
  • 157
  • 1
  • 9
  • Your problem is reproducible only in Python 2.7. I ran your code in 3.x and got correct results. Must be a bug in 2.7. As a side note, do not add semicolons at the end of lines, they are not necessary in either version of Python. – DYZ Sep 14 '18 at 17:28
  • Correction to my previous comment: Apparently, 2.7 produces correct results but 3.x does not. – DYZ Sep 14 '18 at 17:35

2 Answers2

2

RFC 2047 defines 3 locations in which an 'encoded-word' may appear. It requires separating whitespace in almost all cases, even between an 'encoded-word' and unencoded text, and most of the cases where separating whitespace is not required appear to be errors. The text looks like this (without errata applied, and with formatting manually adjusted):

An 'encoded-word' may appear in a message header or body part header according to the following rules:

  1. An 'encoded-word' may replace a 'text' token (as defined by RFC 822) in any Subject or Comments header field, any extension message header field, or any MIME body part field for which the field body is defined as '*text'. An 'encoded-word' may also appear in any user-defined ("X-") message or body part header field.

    Ordinary ASCII text and 'encoded-word's may appear together in the same header field. *However, an 'encoded-word' that appears in a header field defined as 'text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

  2. An 'encoded-word' may appear within a 'comment' delimited by "(" and ")", i.e., wherever a 'ctext' is allowed. More precisely, the RFC 822 ABNF definition for 'comment' is amended as follows:

     comment = "(" *(ctext / quoted-pair / comment / encoded-word) ")"
    

    A "Q"-encoded 'encoded-word' which appears in a 'comment' MUST NOT contain the characters "(", ")" or " 'encoded-word' that appears in a 'comment' MUST be separated from any adjacent 'encoded-word' or 'ctext' by 'linear-white-space'.

    It is important to note that 'comment's are only recognized inside "structured" field bodies. In fields whose bodies are defined as '*text', "(" and ")" are treated as ordinary characters rather than comment delimiters, and rule (1) of this section applies. (See RFC 822, sections 3.1.2 and 3.1.3)

  3. As a replacement for a 'word' entity within a 'phrase', for example, one that precedes an address in a From, To, or Cc header. The ABNF definition for 'phrase' from RFC 822 thus becomes:

     phrase = 1*( encoded-word / word )
    

    In this case the set of characters that may be used in a "Q"-encoded 'encoded-word' is restricted to: <upper and lower case ASCII letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_" (underscore, ASCII 95.)>. An 'encoded-word' that appears within a 'phrase' MUST be separated from any adjacent 'word', 'text' or 'special' by 'linear-white-space'.

Community
  • 1
  • 1
user2357112
  • 260,549
  • 28
  • 431
  • 505
  • And yet, Python 3.x "correctly" (though, admittedly, illegally) decodes headers without linear white space separators. – DYZ Sep 14 '18 at 17:33
  • Thanks for the detailed read and clarification. some email user agents that do forwarding seem to encode just "@" sign and different webmail software show different outputs. I think this is because in these strange casese, between Perl, PHP, Python (versions) I see different behaviors. It looks like Python 2.7 behavior is close to normal! – Vijay Sep 14 '18 at 18:09
1

This is from page 6 of RFC1342:

An encoded-word may be distinguished from an ordinary "word", "text", or "ctext", as follows: An encoded-word begins with "=?", ends with "?=", contains exactly four "?" characters including the delimiters, and is followed by a SPACE or newline. If the "word", "text", or "ctext" does not meet the above tests, it should be displayed as it appears in the message header.

So space or newline are required after encoded text.

Examples of encoded headers from the same RFC:

   From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
   CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Community
  • 1
  • 1
mx0
  • 6,445
  • 12
  • 49
  • 54
  • While that RFC is obsoleted, it does demonstrate the general intent for encoded-words to be followed by whitespace. The whitespace issue seems to have gotten a lot more complicated in later RFCs. – user2357112 Sep 14 '18 at 18:07