which one should performe first ? sanitizing or validation

Question

i have a field in my registration form that contains for instance a name field,it will be stored in database in a field called user_name varchar(20). it's clear that i should validate the user input if i validate this field frist with code below:

<?php
 if(emptiy($_pos['name']) || strlen($_post['name'])>20)
 //send an not valid input error
 else{
 $name=htmlspcialchars($_post['name']);
 //check for sql injection;
 //insert name into database;}
?>

if a user insert a name like <i> some one </i> the string length is 17 so the else part will performe and name will be <i&gt some one </i> which the length is 28 that will produce an error while inserting to db.in this time if i send an error to user that his/her input is too longe he will got confused. what should i do? what is the best approach?

You should never encode data before storing it. Store it raw (using proper escaping like `mysqli_real_escape_string` or similar) and encode it before outputting it. This is because it needs different encoding if you're outputting it as HTML or JSON or anything else. — Niet the Dark Absol, Oct 18 '13 at 14:43
best method to stop sql injection is to use mysqli or PDO prepared statement to insert data into database.@[Niet the Dark Absol](http://stackoverflow.com/users/507674/niet-the-dark-absol) is right but mysqli_real_escape_string() function is deprecated. — nurakantech, Oct 18 '13 at 14:48
i'll never use functions like `mysqli_real escape_string()` for sesurity reasons, im using pdo it's more secure. — naazanin, Oct 18 '13 at 14:48
I've always followed the `sanitize first, then validate` approach. — asprin, Oct 18 '13 at 14:48
if a person inserts 'some one' should i isert 'some one' into data base or first strip white spaces and then store in db? — naazanin, Oct 18 '13 at 14:59

BrianH · Answer 1 · 2013-10-18T15:02:42.033

In general one should sanitize first - "for your protection, and theirs." This includes stripping out any invalid characters (character coding sensitive, of course). If a field should only contain characters and spaces, then strip out anything that isn't that first.

With that done, you then validate the results - is the name already used (for unique fields), is it the right size, is it not blank?

The reason you give is precisely the right one - to maximize the user experience. Don't confuse the user, if you can avoid it. This helps protect from dumb copy & paste behavior, but you have to be careful - if I want my name recorded as "Ke$h@", I may or may not be ok with changing it to "Keh".

Secondly, it is also to prevent bugs.

What happens when you want to create usernames that don't allow special characters? If I enter "Brian", and your system rejects it as the name us already in use, then I submit "Brian$"? First you validate it, and it is not in use, then you strip special characters and you are left with "Brian". Uh oh - now you either have to validate AGAIN, or you'll get a weird error that either account creation failed (if your database is set to require unique usernames, for instance), or worse it will succeed and over-writing/corruption occurs to user user accounts.

Another example is minimum field lengths: if you require a name be at least 3 letters long and only accept letters, and I enter "no" you'd reject it; but if I enter "no@#$%" you would might say it was valid (long enough), sanitize it, and now it isn't valid anymore, etc.

The easy way to avoid this is sanitize first, and then you don't have to double-think about validation.

However, Niet was right about not encoding data before storage; it is generally much easier to setup output into HTML as being encoded when appropriate, then it is to remember to decode it when you just want the plain text (to entry into text boxes, JSON strings, etc). Most test cases you'll use won't include data with HTML entities, so its easy to introduce silly bugs that aren't easily caught.

The big problem is that when such a bug is introduced, it can quickly lead to data corruption that is not easily solved. Example: you have plain text, output it to a text field incorrectly as html entities, the form gets submitted back and you re-encode it...every time it gets opened/resubmitted it gets re-encoded. With a busy site/form you could end up with thousands of differently encoded entries, with no clear way to determine what should and what was not intended to be HTML encoded.

Protecting from injection is good, but HTML encoding isn't designed (and must not be relied upon) to do that.

ok, assume that you habe enterd Brian$, first saniteze that and the result will be Brian, and validate that and it's unique, ok youre know registered and you want to log in, you enter Brian and again in log in form i shoul sanitize the input? so if yeh i should out put hello Brian , and you would got confused becouse you entered Brian$ — naazanin, Oct 18 '13 at 15:17
You should let the user know that you had to sanitize the input - I would even go as far as to suggest in a situation like this that you give an error to the user when he inputs that it is invalid. — Deniz Zoeteman, Oct 18 '13 at 15:30
@naazanin I'd agree with gdscei, though generally I save such niceties to before-posting client-side form validation. There I am more gentle about prompting the user about invalid output, where on the server-side I'm more likely to chose one of two models: 1) make it work and don't bother the user if they don't have to know, or 2) reject invalid input and let the user figure out what to do. This will depend on your use case, and I can't offer a global suggestion. The more international your app will be, the more careful you'll need to be about forbidding potentially valid characters. — BrianH, Oct 18 '13 at 17:30
"if I want my name recorded as "Ke$h@", I may or may not be ok with changing it to "Keh"." That's why I like to sanitize first, validate and if everything checks out, I will also check if the original untouched version equals the sanitized version. If it's not the same then I return the sanitized input to the form with an appropriate error message. — Ilyes512, Apr 17 '14 at 12:34

score 3 · Answer 2 · answered Dec 25 '14 at 08:31

No, you should validate first. Sanitizing is preformed to handle the datastorage level which is the last step. There's no point in approaching a datastorage level if the business rules don't pass the validation phase. If you require a number and you're given a string, that's an error so you send them back to the form. Sanitizing with the exception of stripslashes if required (not necessary as of 5.4) is not necessary if you use SQL with prepared statements and would in fact corrupt the input instead.

which one should performe first ? sanitizing or validation

2 Answers2