Extract email addresses from log with grep or sed

Question

Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<wanted1918_ke@yahoo.com>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account (wanted1918_ke@yahoo.com) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<devi_joshi@yahoo.com>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account (devi_joshi@yahoo.com) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))

From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... > eg. to=<wanted1918_ke@yahoo.com> to wanted1918_ke@yahoo.com

I am using cut -d' ' -f7 to extract emails but I am curious if there is a more flexible way.

fedorqui · Accepted Answer · 2017-01-26T14:11:02.663

6

With GNU grep, just use a regular expression containing a look behind and look ahead:

$ grep -Po '(?<=to=<).*(?=>)' file
wanted1918_ke@yahoo.com
devi_joshi@yahoo.com

This says: hey, extract all the strings preceded by to=< and followed by >.

edited Jan 26 '17 at 14:11

answered Jan 26 '17 at 11:41

fedorqui

275,237
103
548
598

hek2mgl · Answer 2 · 2017-01-26T12:22:22.527

3

You can use awk like this:

awk -F'to=<|>,' '{print $2}' the.log

I'm splitting the line by to=< or >, and print the second field.

edited Jan 26 '17 at 12:22

answered Jan 26 '17 at 12:08

hek2mgl

152,036
28
249
266

score 3 · Answer 3 · edited May 23 '17 at 12:16

Just to show a sed alternative (requires GNU or BSD/macOS sed due to -E):

sed -E 's/.* to=<(.*)>.*/\1/' file

Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.

A slightly more efficient - but perhaps less readable - variation is
sed -E 's/.* to=<([^>]*).*/\1/' file

A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):

sed 's/.* to=<\(.*\)>.*/\1/' file

A variation of fedorqui's helpful GNU grep answer:

grep -Po ' to=<\K[^>]*' file

\K, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...), but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).

Performance comparison

Here's how the various solutions on this page compare in terms of performance.

Note that this may not matter much in many use cases, but gives insight into:

the relative performance of the various standard utilities
for a given utility, how tweaking the regex can make a difference.

The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.

Mawk                            0.364s
GNU grep, \K, non-backtracking  0.392s
GNU awk                         0.830s
GNU grep, \K                    0.937s
GNU grep, (?>=...)              1.639s
BSD grep + cut                  2.733s
GNU grep + cut                  3.697s
BSD awk                         3.785s
BSD sed, non-backtracking       7.825s
BSD sed                         8.414s
GNU sed                         16.738s
GNU sed, non-backtracking       17.387s

A few conclusions:

The specific implementation of a given utility matters.
grep is generally a good choice, even if it needs to be combined with cut
Tweaking the regex to avoid backtracking and look-behind assertions can make a difference.
GNU sed is surprisingly slow, whereas GNU awk is faster than BSD awk. Strangely, the (partially) non-backtracking solution is slower with GNU sed.

Here's the script that produced the timings above; note that the g-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk was installed via Homebrew.

Note that "non-backtracking" only applies partially to some of the commands.

#!/usr/bin/env bash

# Define the test commands.
test01=( 'BSD sed'                        sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking'      sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed'                        gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking'      gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk'                        awk  -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk'                        gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk'                           mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)'             ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K'                   ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut'                 "{ grep -o  ' to=<[^>]*' | cut  -d'<' -f2; }" )
test12=( 'GNU grep + cut'                 "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )

# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
#       when it detects stdout going nowhere, which distorts the timings.
#       Use dev/tty if you want to see stdout in the terminal (will print
#       as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'

# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'

# How many runs per test whose timings to average.
runs=10

# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null

# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
    for n in ${!test*}; do    
        arrRef="$n[@]"
        test=( "${!arrRef}" )
        # Print test description.
        printf '%s\t' "${test[0]}"
        # Execute test command.
        if (( ${#test[@]} == 2 )); then # single-token command? assume `eval` must be used.
          time for (( n = 0; n < runs; n++ )); do eval "${test[@]: 1}" < "$inFile" >"$outFile"; done
        else # multiple command tokens? assume that they form a simple command that can be invoked directly.
          time for (( n = 0; n < runs; n++ )); do "${test[@]: 1}" "$inFile" >"$outFile"; done
        fi
    done
} 2>&1 | 
  sort -t$'\t' -k2,2n | 
    awk -v runs="$runs" '
      BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
    ' | column -s$'\t' -t

Claes Wikner · Answer 4 · 2017-01-26T12:50:16.747

0

awk -F'[<>]' '{print $2}' file

wanted1918_ke@yahoo.com
devi_joshi@yahoo.com

edited Jan 26 '17 at 12:50

answered Jan 26 '17 at 12:22

Claes Wikner

1,457
1
9
8

2

That's probably too weak. Anchoring it at `to=<` is imo a good idea. – hek2mgl Jan 26 '17 at 12:56

Extract email addresses from log with grep or sed

4 Answers4

Performance comparison

Linked