How to parse data effectively with python

Question

Here is my code for extracting fields that i want to.
But, I don't think it works effectively because extracting is depends on count of fields.
Surely It's not important in small data however, I want to know better way.
So I want to extract at once or more effectively
Sorry for my stupidity.

import re

data="""
Message-ID: <1608636066635.7f830.79689714@crcvmail15.nm>
Received: from 125.209.x.x (net58.219.x-x.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
Received: from 125.209.x.x (net58.219.x-18.host.lt-nn.net [91.219.x.x])
 by crcvmail15.google.com with ESMTP id +844Q-zuS122aEqk5CZDZg
 for <test@google.com>;
 Tue, 22 Dec 2020 11:20:58 -0000
From: "test"<from@google.com>
To: test@google.com
Subject:example email
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
"""

def searchHeader(field):
    form = re.search(r'('+field+'\W+(.*?)\n)',data)
    if form:
        print(form.group())

fields = ['From','To','Cc','Subject','Message-ID','Date','(Return-Path|Reply-To)']
for field in fields:
    res = searchHeader(field)

Why `'(Return-Path|Reply-To)'` and not `'Return-Path', 'Reply-To'`? It is not a good idea to define regex and plain text inside a single list. — Wiktor Stribiżew, Dec 30 '20 at 15:13
Try declaring `fields = ['From','To','Cc','Subject','Message-ID','Date','Return-Path', 'Reply-To']` and then using `dict(re.findall(fr'\b({"|".join(map(re.escape, fields))})\W+(.*)',data))`, see https://ideone.com/8Aa2Zs. If the matches are always at the start of the line, use `^` instead of `\b`, and add `re.M` flag, see https://ideone.com/piVaBs — Wiktor Stribiżew, Dec 30 '20 at 15:15
Or, do you want to parse all, even unknown field and values? — Wiktor Stribiżew, Dec 30 '20 at 15:20
Honestly I don't care but I didn't think all fields was have same pattern. this is why I define fields. — user11230064, Dec 30 '20 at 15:26
There are Python modules that parse emails. Why not use one of them over regex? Here is a recent post on parsing emails: https://stackoverflow.com/questions/65164218/parse-emails-body-in-python — Life is complex, Dec 30 '20 at 15:39
@Lifeiscomplex I am delightfully surprised at how well my regex works against the email in the post you linked :-) https://regex101.com/r/IkNlAE/1 — MonkeyZeus, Dec 30 '20 at 16:00
@MonkeyZeus Yes, the regex worked on regex101 against the message in that other post. When I tested it locally using re.search, it didn't work correctly. re.search(r"(?P^[\w-]+): *(?P[\s\S]+?)(?=^[\w-]+: *|\Z)", data, re.MULTILINE) — Life is complex, Dec 30 '20 at 16:26
@Lifeiscomplex Sorry, I've never done Python programming. The website has a "code generator" button which takes you to https://regex101.com/r/IkNlAE/1/codegen?language=python so you can figure out what went wrong. — MonkeyZeus, Dec 30 '20 at 16:28
@MonkeyZeus thanks, I didn't know regex101 had a "code generator." I will save your regex in my own bag-of-tricks. — Life is complex, Dec 30 '20 at 16:34
@Lifeiscomplex I appreciate the compliment :-) but my bag of tricks are child's play compared to the stuff I've seen [Wiktor Stribiżew](https://stackoverflow.com/users/3832970) put out. — MonkeyZeus, Dec 30 '20 at 16:36
@MonkeyZeus thanks again for pointing me toward Wiktor's account. — Life is complex, Dec 30 '20 at 16:49

MonkeyZeus · Accepted Answer · 2020-12-30T15:30:02.007

2

Depending on your definition of "effective" you can make use of named capture groups:

(?P<field>^[\w-]+): *(?P<value>[\s\S]+?)(?=^[\w-]+: *|\Z)

(?P<field>^[\w-]+) - name a capture group "field" and capture everything from the beginning of the line which is a \w char or - dash.
: * - capture a colon followed by optional spaces.
(?P<value>[\s\S]+?) - name a capture group "value" and capture everything (including newlines). If you enable the dotall modifier then .+? could be used in place of [\s\S]+?. This ensures we capture the multiline values which can be found after Received:.
(?=^[\w-]+: *|\Z) - continue capturing the "value" until we hit a new "field" or the end of the string.

https://regex101.com/r/rBBRfM/1

You can see performance stats in the upper right at regex101.

edited Dec 30 '20 at 15:30

answered Dec 30 '20 at 15:16

MonkeyZeus

20,375
4
36
77

1

In Python, the [very end of the string is matched with `\Z`](https://stackoverflow.com/a/53283192/3832970). `$(?![\r\n])` is a workaround for text editors that have no `\z`/`\Z` support. – Wiktor Stribiżew Dec 30 '20 at 15:24
@WiktorStribiżew Thank you! The definitely cleans up my regex a bit. – MonkeyZeus Dec 30 '20 at 15:25

How to parse data effectively with python

1 Answers1