-1

Here is my code for extracting fields that i want to.
But, I don't think it works effectively because extracting is depends on count of fields.
Surely It's not important in small data however, I want to know better way.
So I want to extract at once or more effectively
Sorry for my stupidity.

import re

data="""
Message-ID: <1608636066635.7f830.79689714@crcvmail15.nm>
Received: from 125.209.x.x (net58.219.x-x.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
Received: from 125.209.x.x (net58.219.x-18.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
 Tue, 22 Dec 2020 11:20:58 -0000
From: "test"<from@google.com>
To: test@google.com
Subject:example email
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
"""

def searchHeader(field):
    form = re.search(r'('+field+'\W+(.*?)\n)',data)
    if form:
        print(form.group())

fields = ['From','To','Cc','Subject','Message-ID','Date','(Return-Path|Reply-To)']
for field in fields:
    res = searchHeader(field)
python_user
  • 5,375
  • 2
  • 13
  • 32
user11230064
  • 63
  • 1
  • 6
  • Why `'(Return-Path|Reply-To)'` and not `'Return-Path', 'Reply-To'`? It is not a good idea to define regex and plain text inside a single list. – Wiktor Stribiżew Dec 30 '20 at 15:13
  • Try declaring `fields = ['From','To','Cc','Subject','Message-ID','Date','Return-Path', 'Reply-To']` and then using `dict(re.findall(fr'\b({"|".join(map(re.escape, fields))})\W+(.*)',data))`, see https://ideone.com/8Aa2Zs. If the matches are always at the start of the line, use `^` instead of `\b`, and add `re.M` flag, see https://ideone.com/piVaBs – Wiktor Stribiżew Dec 30 '20 at 15:15
  • Thank you for your reply. I didn't know that... – user11230064 Dec 30 '20 at 15:17
  • Or, do you want to parse all, even unknown field and values? – Wiktor Stribiżew Dec 30 '20 at 15:20
  • Honestly I don't care but I didn't think all fields was have same pattern. this is why I define fields. – user11230064 Dec 30 '20 at 15:26
  • There are Python modules that parse emails. Why not use one of them over regex? Here is a recent post on parsing emails: https://stackoverflow.com/questions/65164218/parse-emails-body-in-python – Life is complex Dec 30 '20 at 15:39
  • @Lifeiscomplex I am delightfully surprised at how well my regex works against the email in the post you linked :-) https://regex101.com/r/IkNlAE/1 – MonkeyZeus Dec 30 '20 at 16:00
  • @MonkeyZeus Yes, the regex worked on regex101 against the message in that other post. When I tested it locally using re.search, it didn't work correctly. re.search(r"(?P^[\w-]+): *(?P[\s\S]+?)(?=^[\w-]+: *|\Z)", data, re.MULTILINE) – Life is complex Dec 30 '20 at 16:26
  • @Lifeiscomplex Sorry, I've never done Python programming. The website has a "code generator" button which takes you to https://regex101.com/r/IkNlAE/1/codegen?language=python so you can figure out what went wrong. – MonkeyZeus Dec 30 '20 at 16:28
  • 1
    @MonkeyZeus thanks, I didn't know regex101 had a "code generator." I will save your regex in my own bag-of-tricks. – Life is complex Dec 30 '20 at 16:34
  • @Lifeiscomplex I appreciate the compliment :-) but my bag of tricks are child's play compared to the stuff I've seen [Wiktor Stribiżew](https://stackoverflow.com/users/3832970) put out. – MonkeyZeus Dec 30 '20 at 16:36
  • @MonkeyZeus thanks again for pointing me toward Wiktor's account. – Life is complex Dec 30 '20 at 16:49

1 Answers1

2

Depending on your definition of "effective" you can make use of named capture groups:

(?P<field>^[\w-]+): *(?P<value>[\s\S]+?)(?=^[\w-]+: *|\Z)
  • (?P<field>^[\w-]+) - name a capture group "field" and capture everything from the beginning of the line which is a \w char or - dash.
  • : * - capture a colon followed by optional spaces.
  • (?P<value>[\s\S]+?) - name a capture group "value" and capture everything (including newlines). If you enable the dotall modifier then .+? could be used in place of [\s\S]+?. This ensures we capture the multiline values which can be found after Received:.
  • (?=^[\w-]+: *|\Z) - continue capturing the "value" until we hit a new "field" or the end of the string.

https://regex101.com/r/rBBRfM/1

You can see performance stats in the upper right at regex101.

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • 1
    In Python, the [very end of the string is matched with `\Z`](https://stackoverflow.com/a/53283192/3832970). `$(?![\r\n])` is a workaround for text editors that have no `\z`/`\Z` support. – Wiktor Stribiżew Dec 30 '20 at 15:24
  • @WiktorStribiżew Thank you! The definitely cleans up my regex a bit. – MonkeyZeus Dec 30 '20 at 15:25