5

When I was first learning how to use regular expressions we were taught how to parse things like phone numbers (obviously always 5 digits, an optional space and a further 6 digits), email addresses (obviously always alphanumerics, then a single '@', then alphanumerics followed by a '.' and three letters) which we should always do to validate the data that the user enters.

Of course as I've developed I've learned how silly the basic approach can be, but the more I look, the more I question the concept altogether, the most open careful correct validation of something like an email address through regexes ends up being hundreds if not thousands of characters long in order to both accept all the legal cases and correctly reject only the illegal ones. Even worse, all that effort does absolutely nothing for the actual validity, the user may have accidentally added an 'a', or may not use that email address at all, or even is using someone else's address, or may even use a '+' symbol which is being flagged inappropriately.

Yet at the same time seemingly every site I come across still does this kind of technical checking, preventing me from putting more obscure characters in an email address or name, or objecting to the idea that someone would have more or less than a single title, then a single firstname and a single lastname, all made purely from latin characters yet without any form of check that it's my real name.

Is there a benefit to this? Once injection attacks are handled (which should be through methods other than sterilizing the input) is there any other point to these checks?

Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?

Cactus
  • 75
  • 3

2 Answers2

17

Overly validating things is indeed one of the banes of the internet. Especially if the person writing the validation code has no actual knowledge of the problem domain. No, you probably do not actually know what the valid syntax for email addresses is. Or real-world addresses, especially internationally. Or telephone numbers. Or people's names.

Looking at a few localised examples (my email address) and extrapolating to rules covering all possible values within the domain (all email addresses) is madness. Unless you have perfect domain knowledge, you should not come up with rules about the domain. In the case of email addresses this leads to only a very narrow subset of possible email addresses actually being usable in daily life. Ghee, thanks, guys.

As for people's names, whatever a person tells you is their name is by definition their name. It's what you call them by. You cannot validate it automatically; they'd have to send in a copy of their birth certificate for actual official validation. And even then, is that really what you're interested in knowing? Or do you merely need a "handle" to greet and identify them on your forum page?

Facebook does (did?) strict name validation in order to force people to use their real names to register. Well, many people I know on Facebook still use some made up nonsense name. The filter obviously doesn't work. Having said this, perhaps it works well enough for Facebook so that most people use their actual name because they couldn't be bothered to figure out which particular pattern will pass the validation. In that sense, such a filter can serve some purpose.

In the end it's up to you to decide on reasons for validation and the specific limits you want to enforce. The issue is that people often do not think about the bigger picture before writing validation code and they have no good reason for their specific limits. Don't fall into that trap.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 7
    Regarding Facebook's name validation, it not only fails to make people use real names, but also in some cases *prevents* users from using real names. One of my friends has to use a fake last name on Facebook because the site says her real last name is made up. – Thunderforge Jan 13 '17 at 18:59
0

is there any other point to these checks?

Certainly. Knowing that your data is valid is very important. In the case of email addresses, for example, sending an email to an address you haven't validated will, at the very least, lead to bounces. Enough bounces and your mailhost might block you for spamming. Not validating a phone number could lead to unnecessary costs if your app tries to send SMS to them. The list goes on and on.

Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?

Yes, but regex is generally bad way to validate data. If a phone number is supposed to be "5 digits a space then 6 digits", then your check is going to fail if I type "5 digits two spaces then 6 digits" or "5 digits a dash then 6 digits" or "11 digits". Use common sense, and expect any crazy format the user provides. Know what the absolute minimal requirement is. For example, if you need 11 digits total, then strip everything that's not a digit first. Then formatting doesn't matter.

Also, read the RFCs. I can't count the number of times my email address has been rejected because it has a plus sign in it. The amount of those that were large tech-oriented company with programmers that should know better was rather disappointing.

Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
  • "Knowing that your data is valid" is what I'm asking, knowing that an email address is RFC-compliant isn't going to stop a bounce from the address if it doesn't actually exist or was a disposable throwaway. A phone number regex to do even basic handling of error codes is going to be huge and rendered inaccurate after a handful of months. What would you say is a good way to validate data? – Cactus Mar 11 '16 at 16:31
  • Certainly, you can't stop bounces for non-existing addresses. (Which is why you need a bounce processor to catch them and prevent them from reoccurring.) However, checking your input for addresses that are *syntactically invalid* is certainly not a waste as long as you know what the correct valid syntax is. E.g., "followed by a '.' and three letters" is wrong. The issue isn't whether you should validate -- you should. The issue is, to what extent? Simply checking that you have a non-empty value for phone number might be sufficient for your use. Start small. – Alex Howansky Mar 11 '16 at 16:53