What should I use in bash script to extract email addresses from noisy lines in file?

Question

I have a file that has one email address per line. Some of them are noisy, i.e. contain junk characters before and/or after the address, e.g.

name.lastname@bar.com&amp;lt;mailto
&amp;lt;someone@foo.bar.baz.edu&amp;gt;
&amp;amp;lt;someone@foo.com&amp;amp;gt;Mobile
&amp;lt;nobody@nowere.com&amp;gt;
&amp;lt;ab@cd.com
no@noise.com

How can I extract the right address from each line of the file in a loop like this?

for l in `cat file_of_email_addresses`
do
     # do magic here to extract address form $l
done

It looks like that if I get garbage before the address then it always ends with lt;, and if I get it after then it always starts with &amp

First things first: [Don't read lines with `for`](http://mywiki.wooledge.org/DontReadLinesWithFor). Then those lines are URL encoded (twice) so you should probably un-encode them. That'll get you saner output which you may be able to deal with more easily. But ultimately you need to come up with a way to figure out what part of each line is the information you care about. — Etan Reisner, Sep 25 '15 at 17:53

Cyrus · Accepted Answer · 2015-09-25T18:12:41.880

1

Try this with GNU grep:

grep -Po '[\w.-]+@[\w.-]+' file

Output:

name.lastname@bar.com
someone@foo.bar.baz.edu
someone@foo.com
nobody@nowere.com
ab@cd.com
no@noise.com

It's not perfect but perhaps it is sufficient for your task.

edited Sep 25 '15 at 18:12

answered Sep 25 '15 at 18:07

Cyrus

84,225
14
89
153

score 0 · Answer 2 · answered Sep 25 '15 at 18:10

It would be better to use a tool that's built for pattern matching, such as sed. It would help to first decode the data, as Etan suggested, but if you're willing to assume

that the leading segments you want to remove will always end with a ;,
that the trailing segments you want to remove will always begin with an &,
that the desired addresses will not contain either of those characters, and
that every line will contain exactly one @, and that in the address,

then you can do this:

sed 's/^\([^@]*;\)\?\([^&;]*@[^&;]*\).*/\2/' file_of_email_addresses

What should I use in bash script to extract email addresses from noisy lines in file?

2 Answers2