REGEX extracting specific part non greedy

Question

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:

import re
f_hand = open('mail.txt')
for line in f_hand:
    line.rstrip()
    if re.findall('\S+@\S+?',line): print re.findall('\S+@\S+?',line)

however this is what i"m getting instead of just the email address:

['href="mailto:secretary@abc-mediaent.com">sercetary@a']

What shall I use in re.findall to get just the email out?

Don't try to parse HTML with regular expressions. Use a HTML parser. — Daniel, Sep 23 '16 at 16:39
It would help to see an example of the text you're trying to parse, and what the expected output is. — Brendan Abel, Sep 23 '16 at 16:41

saurabh baid · Answer 1 · 2016-09-23T17:02:13.070

1

try this re.findall('mailto:(\S+@\S+?\.\S+)\"',str))

It should give you something like ['secretary@abc-mediaent.com']

edited Sep 23 '16 at 17:02

answered Sep 23 '16 at 16:48

saurabh baid

1,819
1
14
26

hi Saurabh! this is what i'm getting now: ['mailto: email@email.com" '] how can i remove the mailto and the " sign? – PIMg021 Sep 23 '16 at 16:55

score 1 · Answer 2 · answered Sep 23 '16 at 16:51

1

\S means not a space. " and > are not spaces.

You should use mailto:([^@]+@[^"]+) as the regex (quoted form: 'mailto:([^@]+@[^"]+)'). This will put the email address in the first capture group.

answered Sep 23 '16 at 16:51

Laurel

5,965
14
31
57

Casimir et Hippolyte · Accepted Answer · 2016-09-23T16:58:53.117

If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:

for line in f_hand: 
    print re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', line)

(re.findall returns only the capture group. \1 stands for the content of the first capture group.)

If the file is a more complicated html file, use a parser, extract the links and filter them.
Or eventually use XPath, something like:
substring-after(//a/@href[starts-with(., "mailto:")], "mailto:")

Glen Ragan · Answer 4 · 2016-10-07T15:29:49.893

\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of

[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+

(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").

This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.

score 0 · Answer 5 · edited Oct 07 '21 at 13:21

This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.

Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)". (This works without having to depend on the text surrounding an email address.)

The following lines of code -

html_str = r'<a href="mailto:sachin.gokhale@indiacast.com">sachin.gokhale@indiacast.com</a>'
email_regex = r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)

yields -

['sachin.gokhale@indiacast.com', 'sachin.gokhale@indiacast.com']

P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/

REGEX extracting specific part non greedy

5 Answers5