2

I am trying to parse a log file to extract email addresses. I am able to match the email and print it with the help of regular expressions. I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.

Here is the code I have written so far :

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            print pattern.group()

        else:
            print "nono"

Here is my example log file that I am trying to parse:

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => someuser@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => me@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => wo@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => lol@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h Completed

Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.

Thanks in advance.

abhinav singh
  • 1,024
  • 1
  • 12
  • 34

3 Answers3

2

As danidee (he was first) said, set would do the trick

Try this:

from __future__ import print_function

import re

with open('test.txt') as f:
    data = f.read().splitlines()

emails = set(re.sub(r'^.*\s+(\w+\@[^\s]*?)\s+.*', r'\1', line) for line in data if '@' in line)

print('\n'.join(emails)) if len(emails) else print('nono')

Output:

lol@somedomain.com
me@somedomain.com
someuser@somedomain.com
wo@somedomain.com

PS you may want to do a proper email RegExp check, because i used very primitive check

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            matched =  pattern.group()
            if matched not in seen:
               print matched 

        else:
            print "nono"
Mazdak
  • 105,000
  • 18
  • 159
  • 188
1

Some of the duplicates are due to a bug in your code where you do not reset temp when processing each line. A line that does not contain either -> or => and which is preceded by a line that does contain either of those strings will trigger the if temp: test, and output the email address from the previous line if there was one.

That can be fixed by jumping back to the start of the loop with continue when the line contains neither -> nor =>.

For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set.

import sys
import re

addresses = set()
pattern = re.compile('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?')

with open('/Users/me/Desktop/test.txt', 'r') as f:
    for line in f:
        if '->' in line:
            temp = line.split('->')
        elif '=>' in line:
            temp = line.split('=>')
        else:
            # neither '=>' nor '->' present in the line
            continue

        match = pattern.match(temp[1])
        if match is not None:
            addresses.add(match.group())
        else:
            print "nono"

for address in sorted(addresses):
    print(address)

The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with statement to open the file within a context manager. This guarantees that the file will always be closed.

Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.

With a properly written regex pattern your code can be greatly simplified:

import re

addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}@\w{1,}\.\w{2,3})')

with open('test.txt', 'r') as f:
    for line in f:
        match = pattern.search(line)
        if match:
            addresses.add(match.groups()[0])

for address in sorted(addresses):
    print(address)
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • No wonder why I was getting more duplicates than their actual count in the log file. Thanks for pointing out the bug. Testing the solution :) . Cheers! – abhinav singh Mar 02 '16 at 11:36
  • @abhinavsingh: I've updated with simplified code by using a more targeted regex pattern. – mhawke Mar 02 '16 at 11:55
  • Wow. the new simplified code definitely makes more sense and simplifies the script. Thanks for the added help :) – abhinav singh Mar 02 '16 at 12:04