I am trying to parse a log file to extract email addresses. I am able to match the email and print it with the help of regular expressions. I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.
Here is the code I have written so far :
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
print pattern.group()
else:
print "nono"
Here is my example log file that I am trying to parse:
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => someuser@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => me@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => wo@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => lol@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h Completed
Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.
Thanks in advance.