0

I'm using the current regex to match email addresses in Python:

EMAIL_REGEX = re.compile(r"[^@]+@[^@]+\.[^@]+")

This works great, except for when I run it, I'm getting domains such as '.co.uk', etc. For my project, I am simply trying to get a count of international looking TLD's. I understand that this doesn't necessarily guarantee that my users are only US based, but it does give me a count of my users without internationally based TLD's (or what we would consider internationally based TLD's - .co.uk, .jp, etc).

Alpinestar22
  • 553
  • 2
  • 9
  • 20
  • 1
    How do you know that your users don't have any of those email addresses? (.gov.us, .aaa.bbb, etc.) – xxbbcc Jan 12 '15 at 16:04
  • 1
    What do you want to do with `@free.fr`, `@gmx.de` etc.? They are non-US and nevertheless have only 2 parts... – glglgl Jan 12 '15 at 16:05
  • http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address – xxbbcc Jan 12 '15 at 16:06
  • In place of the final `[^@]+` you could do something like this: `(com|net|org|gov|mil|edu|us)`. – David Faber Jan 12 '15 at 16:07
  • And even that won't help much - anyone outside the U.S. can have a `.com`, `.net` or `.org`. – glglgl Jan 12 '15 at 16:08
  • So... I'm assuming @free.fr is an international domain, which I would then want to discard. I guess it's possible to have the .gov.us... but it is less likely in my set of emails (most tend to be generic hotmails or gmails, or hotmails/gmails + .co.uk, etc – Alpinestar22 Jan 12 '15 at 16:08
  • I live in the US and I have (and regularly use) several non-US email addresses. – xxbbcc Jan 12 '15 at 16:08
  • 1
    And, BTW, the TLD is the top level domain. Top level is at the end, so there is never a `.` after the TLD. In `.co.uk`, `uk` is the TLD. – glglgl Jan 12 '15 at 16:09
  • 4
    All these comments try to point out to you that what you're trying to do is pointless. You cannot determine from the email address whether it's US or non-US and even trying is pointless - why do you care what email address a user of yours is using? – xxbbcc Jan 12 '15 at 16:09
  • Well, instead of downvoting you probably need to realize that these are obviously my business requirements or I wouldn't be asking. Does it not make sense that if some one has a .jp TLD, or .co.uk, it stands to reason (with considerable accuracy) that they are internationally based? Other than a username, and some interests, this is all I have to determine the location of my users. – Alpinestar22 Jan 12 '15 at 16:13
  • 3
    As an aside, `com` is _not_ the TLD for the US (`us` is), but for everything "commercial", and can be used outside the US. Same for e.g. `net` etc. – tobias_k Jan 12 '15 at 16:14
  • 2
    @Alpinestar22 You have _no way_ to determine the location of your users. You _may_ be able to use the IP address for a _guess_ but that's as close as it gets: IP can be spoofed, routed through other computers, etc. You can require a US address for your users but even that doesn't guarantee that the user actually lives there. I have a US address but I could be living outside the US. Or I could have a US friend whose address I use, etc. – xxbbcc Jan 12 '15 at 16:14
  • Understandable @xxbbcc. However, when my users username is in Japanese, and their TLD is '.jp', can I not strip these from my current working project? For some reason, it seems like this would be the best way to go about it? – Alpinestar22 Jan 12 '15 at 16:18
  • @Alpinestar22 Why do you care that your user's user name is in Japanese? And what does it matter that the email ends with `.jp`? What difference does it make? If your users pay for your service, why do you care where the money comes from? Checking the username / email address doesn't give you any information about the user's location, no matter what characters make up either of them. – xxbbcc Jan 12 '15 at 16:20
  • @Alpinestar22 You can require a credit card with a US address but even that only tells you that your user has a US-based credit card, nothing else. And the user may still want to use special characters in his/her user name but that's independent of location. Your database should be able to store Unicode characters, so blocking special characters makes no sense (a few special symbols aside). – xxbbcc Jan 12 '15 at 16:22
  • I understand where you are coming form @xxbbcc, but maybe I need to do this as a business requirement? If it can't be done, I understand... this isn't a debate about what I 'should' do, but rather how to alter the current regex in order to accomplish my business requirements. Let's say I wanted to run a test for a new part of my service to US only residents, and not my entire userbase - this is the quickest way for me to get a sample set. I understand that it won't be perfect, but it helps me test my new service with 'slightly' greater accuracy, albeit only a little. – Alpinestar22 Jan 12 '15 at 16:24
  • @Alpinestar22 Lol, I'm sorry if I gave you the impression that I was arguing with you. :) My arguments are points to your business people that their requirements are nonsense. – xxbbcc Jan 12 '15 at 16:27
  • @xxbbcc, I understand... but if the goal of changing the regex is simply to get a count of TLD's WITHOUT '.co.uk', or '.jp'... that is all I'm looking for, not a full on debate about whether or not it means the person is actually located internationally. Most people with the same requirement would jump to this initial conclusion, test it for validity, and move on. I'm simply testing for validity and getting a count of my users without the '.co.uk', etc. – Alpinestar22 Jan 12 '15 at 16:30
  • @Alpinestar22 You should probably update your question clearly explaining that. Honestly, I don't understand at this point what you're trying to do. You _can_ came up with a regex for your purpose but the number you're going to get at the end is likely meaningless. – xxbbcc Jan 12 '15 at 16:44

1 Answers1

2

What you want is very difficult.

If I make a mail server called this.is.my.email.my-domain.com, and an account called martin, my perfectly valid US email would be martin@this.is.my.email.my-domain.com. Emails with more than 1 domain part are not uncommon (.gov is a common example).

Disallowing emails from the .uk TLD is also problematic, since many US-based people might have a .uk address, for example they think it sounds nice, work for a UK based company, have a UK spouse, used to live in the UK and never changed emails, etc.

If you only want US-based registrations, your options are:

  • Ask your users if they are US-based, and tell them your service is only for US-based users if they answer with a non-US country.

  • Ask for a US address or phone number. Although this can be faked, it's not easy to get a matching address & ZIP code, for example.

  • Use GeoIP, and allow only US email addresses. This is not fool-proof, since people can use your service on holidays and such.

In the question's comments, you said:

Does it not make sense that if some one has a .jp TLD, or .co.uk, it stands to reason (with considerable accuracy) that they are internationally based?

Usually, yes. But far from always. My girlfriend has 4 .uk email addresses, and she doesn't live in the UK anymore :-) This is where you have to make a business choice, you can either:

  1. Turn away potential customers
  2. Take more effort in allowing customers with slightly "strange" email addresses

Your business, your choice ;-)

So, with that preamble, if you must do this, this is how you could do it:

import re

EMAIL_REGEX = re.compile(r'''
    ^             # Anchor to the start of the string
    [^@]+         # Username
    @             # Literal @
    ([^@.]+){1}   # One domain part
    \.            # Literal 1
    ([^@.]+){1}   # One domain part (the TLD)
    $             # Anchor to the end of the string
''', re.VERBOSE)

print(EMAIL_REGEX.search('test@example.com'))
print(EMAIL_REGEX.search('test@example.co.uk'))

Of course, this still allows you to register with a .nl address, for example. If you want to allow only a certain set of TLD's, then use:

allow_tlds = ['com', 'net'] # ... Probably more
result = EMAIL_REGEX.search('test@example.com')
if result is None or result.groups()[1] in allowed_tlds:
    print('Not allowed')

However, if you're going to create a whilelist, then you don't need the regexp anymore, since not using it will allow US people with multi-domain addresses to sign up (such as @nlm.nih.gov).

Martin Tournoij
  • 26,737
  • 24
  • 105
  • 146
  • I agree with the solution not being 100% accurate, however in your specific example, your girlfriend DID used to live in the UK. While her current status is different, it does prove that her .uk email address originated when she was living in the UK - unless I am missing something. Again, I'm simply just trying to gather a sample set with 'slightly' higher probability of US users. If I were registering new users at the moment, I would most likely be filtering and geo locating my customers before they were able to sign up :) – Alpinestar22 Jan 12 '15 at 16:35
  • 1
    @Alpinestar22 Well, if you want to "guess" where someone is located, then that's an entirely different question as to what you asked... From what you *asked in the question*, it looked like you wanted to allow only US-based submits on a form or some such (although you didn't explicitly say this, it's just what I and I think everyone else assumed, since it's the most likely use case)... I've updated the question with some code. – Martin Tournoij Jan 12 '15 at 16:45