0

I was trying to write a regex to detect email addresses of the type 'abc@xyz.com' in java. I came up with a simple pattern.

String line = // my line containing email address
Pattern myPattern = Pattern.compile("()(\\w+)( *)@( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);

This will however also detect email addresses of the type 'abcd.efgh@xyz.com'. I went through http://www.regular-expressions.info/ and links on this site like

How to match only strings that do not contain a dot (using regular expressions)

Java RegEx meta character (.) and ordinary dot?

So I changed my pattern to the following to avoid detecting 'efgh@xyz.com'

Pattern myPattern = Pattern.compile("([^\\.])(\\w+)( *)@( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
String mailid = myMatcher.group(2) + "@" + myMatcher.group(5) + ".com";

If String 'line' contained the address 'abcd.efgh@xyz.com', my String mailid will come back with 'fgh@yyz.com'. Why does this happen? How do I write the regex to detect only 'abc@xyz.com' and not 'abcd.efgh@xyz.com'?

Also how do I write a single regex to detect email addresses like 'abc@xyz.com' and 'efg at xyz.com' and 'abc (at) xyz (dot) com' from strings. Basically how would I implement OR logic in regex for doing something like check for @ OR at OR (at)?

After some comments below I tried the following expression to get the part before the @ squared away.

Pattern.compile("((([\\w]+\\.)+[\\w]+)|([\\w]+))@(\\w+)\\.com")
Matcher myMatcher = myPattern.matcher(line);

what will the myMatcher.groups be? how are these groups considered when we have nested brackets?

System.out.println(myMatcher.group(1));
System.out.println(myMatcher.group(2));
System.out.println(myMatcher.group(3));
System.out.println(myMatcher.group(4));
System.out.println(myMatcher.group(5));

the output was like

abcd.efgh
abcd.efgh
abcd.
null
xyz

for abcd.efgh@xyz.com

abc
null
null
abc
xyz

for abc@xyz.com

Thanks.

Community
  • 1
  • 1
Chinmay Nerurkar
  • 495
  • 6
  • 22
  • 2
    Why are you allowing blanks before and after the `@`? That's not valid in email addresses. – Jim Garrison Mar 27 '12 at 18:29
  • Why do you want to detect email addresses written in a format "me (at) example (dot) com"? If someone writes that, they have gone out of their way to avoid it being parsed by a machine. – Andrew Morton Mar 27 '12 at 18:32
  • @Jim I left blanks to detect email addresses written like 'abc @ xyz.com' written so as they are not easy to mine. – Chinmay Nerurkar Mar 27 '12 at 18:33
  • @Andrew - I am trying to write something to read those addresses as a part of a course I have taken. Suspicious as it looks I am working with locally stored html files provided by the university and not mining data on the internet. – Chinmay Nerurkar Mar 27 '12 at 18:36

2 Answers2

0

You can use | operator in your regexps to detect @ORAT: @|OR|(at).
You can avoid having dot in email addresses by using ^ at the beginning of the pattern:
Try this:

    Pattern myPattern = Pattern.compile("^(\\w+)\\s*(@|at|\\(at\\))\\s*(\\w+)\\.(\\w+)");
    Matcher myMatcher = myPattern.matcher(line);
    if (myMatcher.matches())
    {
        String mail = myMatcher.group(1) + "@" + myMatcher.group(3) + "." +myMatcher.group(4);
        System.out.println(mail);
    }

dexametason
  • 1,133
  • 7
  • 16
0

Your first pattern needs to combine the facts that you want word character and not dots, you currently have it separately, it should be:

[^\\.\W]+

This is 'not dots' and 'not not word characters'

So you have:

Pattern myPattern = Pattern.compile("([^\\.\W]+)( *)@( *)(\\w+)\\.com");

To answer your second question, you can use OR in REGEX with the | character

(@|at)
Ina
  • 4,400
  • 6
  • 30
  • 44
  • how does the 'not not work part' work here? [^\\.\W]+ won't compile until I change it to [^\\.\\W]+ and that seems to works to negate the word 'abc' of abc@xyz.com and returns only @xyz.com – Chinmay Nerurkar Mar 27 '12 at 19:33