1

I have a file that has one email address per line. Some of them are noisy, i.e. contain junk characters before and/or after the address, e.g.

name.lastname@bar.com<mailto
<someone@foo.bar.baz.edu>
<someone@foo.com>Mobile
<nobody@nowere.com>
<ab@cd.com
no@noise.com

How can I extract the right address from each line of the file in a loop like this?

for l in `cat file_of_email_addresses`
do
     # do magic here to extract address form $l
done

It looks like that if I get garbage before the address then it always ends with lt;, and if I get it after then it always starts with &amp

I Z
  • 5,719
  • 19
  • 53
  • 100
  • First things first: [Don't read lines with `for`](http://mywiki.wooledge.org/DontReadLinesWithFor). Then those lines are URL encoded (twice) so you should probably un-encode them. That'll get you saner output which you may be able to deal with more easily. But ultimately you need to come up with a way to figure out what part of each line is the information you care about. – Etan Reisner Sep 25 '15 at 17:53

2 Answers2

1

Try this with GNU grep:

grep -Po '[\w.-]+@[\w.-]+' file

Output:

name.lastname@bar.com
someone@foo.bar.baz.edu
someone@foo.com
nobody@nowere.com
ab@cd.com
no@noise.com

It's not perfect but perhaps it is sufficient for your task.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
0

It would be better to use a tool that's built for pattern matching, such as sed. It would help to first decode the data, as Etan suggested, but if you're willing to assume

  • that the leading segments you want to remove will always end with a ;,
  • that the trailing segments you want to remove will always begin with an &,
  • that the desired addresses will not contain either of those characters, and
  • that every line will contain exactly one @, and that in the address,

then you can do this:

sed 's/^\([^@]*;\)\?\([^&;]*@[^&;]*\).*/\2/' file_of_email_addresses
John Bollinger
  • 160,171
  • 8
  • 81
  • 157