5

Here's an example email header,

header = """
From: Media Temple user (mt.kb.user@gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user@example.com
Return-Path: <mt.kb.user@gmail.com>
Envelope-To: user@example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""

The header is stored as a string, how do I parse this header, so that i can map it to a dictionary as the header fields be the key and the values be the values in the dictionary?

I want a dictionary like this,

header_dict = {
'From': 'Media Temple user (mt.kb.user@gmail.com)',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. . 
 . . . . .. . . . ..  . . . . .
} 

I made a list of fields required,

header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']

This can list items can likely be the keys for the dictionary.

All Іѕ Vаиітy
  • 24,861
  • 16
  • 87
  • 111

4 Answers4

6

It seems most of these answers have overlooked the Python email parser and the output results are not correct with prefix spaces in the values. Also the OP has perhaps made a typo by including a preceding newline in the header string which requires stripped for the email parser to work.

from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)

Output (truncated):

>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': 'January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
 ...
 'Subject': 'article: A sample header',
 'To': 'user@example.com',
 'X-Spam-Level': '***',
 'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

Duplicate header keys

Be aware that email message headers can contain duplicate keys as mentioned in the Python documentation for email.message

Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.

For example converting the following email message to a Python dict only the first Received key would be retained.

headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")

dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}

Use the get_all method to check for duplicates:

headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <example@example.comom>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <example@example.comom>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']
Cas
  • 6,123
  • 3
  • 36
  • 35
1

you can split string on newline, then split each line on ":"

>>> my_header = {}
>>> for x in header.strip().split("\n"):
...     x = x.split(":", 1)
...     my_header[x[0]] = x[1]
... 
Hackaholic
  • 19,069
  • 5
  • 54
  • 72
  • `'Date': 'January 25, 2011 3:30:58 PM PDT'` this will working according to your code? because after split `x[0]` is key and `x[1]` is value, So result will be `'Date': ' January 25, 2011 3'` – Vivek Sable May 14 '15 at 14:50
  • @VivekSable havent seen that date format , now updated :), thanks – Hackaholic May 14 '15 at 14:59
1

split will work for you:

Demo:

>>> result = {}
>>> for i in header.split("\n"):
...    i = i.strip()
...    if i :
...       k, v = i.split(":", 1)
...       result[k] = v

output:

>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' user@example.com',
 'From': ' Media Temple user (mt.kb.user@gmail.com)',
 'Message Body': ' **The email message body**',
 'Message-Id': ' <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <mt.kb.user@gmail.com>',
 'Subject': ' article: A sample header',
 'To': ' user@example.com',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
Vivek Sable
  • 9,938
  • 3
  • 40
  • 56
  • You can use `header.splitlines()` and it will remove the newlines too. – Padraic Cunningham May 14 '15 at 14:59
  • @PadraicCunningham: yes. It is removing last blank new line but not first. e.g. `>>> s = """\n1\n2\n3\n""" >>> s.splitlines() ['', '1', '2', '3'] >>> ` So best to do strip before split. Correct? – Vivek Sable May 14 '15 at 15:04
  • the first newline is probably not actually there, it is just how the OP poseted the input. `"""From: Media Temple user (mt.kb.user@gmail.com)` would be the actual start of the string. plus1 anyway, you got the split correct – Padraic Cunningham May 14 '15 at 15:06
  • @PadraicCunningham: ok. Can explain more about your code? means any link. generator object is create and then you create dictionary. – Vivek Sable May 14 '15 at 15:10
  • each line is split into lists, `Subject: article: A sample header -> ["Subject:", "article: A sample header"]` , try running `dict([["Subject:", "article: A sample header"]])` from an interpreter and you will see what happens, what happens in my code is you have multiple sublists – Padraic Cunningham May 14 '15 at 15:12
  • @PadraicCunningham: yes. – Vivek Sable May 14 '15 at 15:16
1
header = """From: Media Temple user (mt.kb.user@gmail.com)
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: user@example.com
Return-Path: <mt.kb.user@gmail.com>
Envelope-To: user@example.com
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for user@example.com; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""   

Split into individual lines then split each line once on :

from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))

Output:

{'Content-Type': ' multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
                   's=gamma; '
                   'h=domainkey-signature:received:received:message-id:date:from:to '
                   ':subject:mime-version:content-type; '
                   'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
                   'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
                   'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
                   'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
                        'h=message-id:date:from:to:subject:mime-version:content-type; '
                        'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
                        '36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
                        '6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' user@example.com',
 'From': ' Media Temple user (mt.kb.user@gmail.com)',
 'Message Body': ' **The email message body**',
 'Message-Id': ' '
               '<c8f49cec0807011530k11196ad4p7cb4b9420f2ae752@mail.gmail.com>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
             'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
             '<mt.kb.user@gmail.com>) id 1KDoNH-0000f0-RL for '
             'user@example.com; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <mt.kb.user@gmail.com>',
 'Subject': ' article: A sample header',
 'To': ' user@example.com',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

line.split(":",1) makes sure we only split once on : so if there are any : in the values we won't end up splitting that also. You end up with sublists that are key/value pairings so calling dict creates the dict create from each pairing.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321