When I do laundering tainted data with checking whether it has any bad characters are there unicode-properties which will filter the bad characters?
Asked
Active
Viewed 161 times
1
-
What do you mean by "bad characters"? That's usually context sensitive, and the better solution is usually to escape them rather than filter them out. – ikegami Aug 31 '11 at 17:50
-
But then I have to find them before I can escape them. – sid_com Sep 01 '11 at 05:50
-
Not necessarily. Most escaping functions you need already exist. They handle converting what needs converting, you just pass them entire strings. You didn't answer the question. What do you mean by "bad character"? – ikegami Sep 01 '11 at 06:11
-
Or I should say "web-form" is the first context, because when I process the input-data, the context could change. – sid_com Sep 01 '11 at 10:02
-
Mark Jason Dominus likes to talk about the "Prussian Approach" where you start with what you know is good and add more to that as you find things you left out. The other approach is the "American Approach", where you disallow a few things and let everything else have a wild party. – brian d foy Sep 01 '11 at 16:42
-
@sid_com, Indeed. Whenever you insert text into something, you must convert it into something appropriate for what you are inserting it into. – ikegami Sep 01 '11 at 18:07
-
So see some advantages of escaping: with escaping I don't have to know already at start (input) all "bad characters" and I don't have to forbid to the user any characters and there are maybe for the most contexts modules that can do the escaping for me. – sid_com Sep 02 '11 at 09:22
-
@brian d foy: maybe the MarkJasonDominus.PrussianApproachType would use in a case where the MarkJasonDominus.AmericanApproachType woudn't check at all the MarkJasonDominus.AmericanApproach. – sid_com Sep 02 '11 at 09:32
3 Answers
4
User-Defined Character Properties in perlunicode
package Characters::Sid_com;
sub InBad {
return <<"BAD";
0000\t10FFFF
BAD
}
sub InEvil {
return <<"EVIL";
0488
0489
EVIL
}
sub InStupid {
return <<"STUPID";
E630\tE64F
F8D0\tF8FF
STUPID
}
⋮
die 'No.' if $tring =~ /
(?: \p{Characters::Sid_com::InBad}
| \p{Characters::Sid_com::InEvil}
| \p{Characters::Sid_com::InStupid}
)
/x;

daxim
- 39,270
- 4
- 65
- 132
-
Clever, but pushes the responsibility to define what's bad back to the user (unless somebody already did this for precisely the scenario the OP is not revealing in spite of multiple questions about it). – tripleee Sep 01 '11 at 06:33
3
I think "no" is an understatement for an answer, but there you have it. No, Unicode does not have a concept of "bad" or "good" characters (let alone "ugly" ones).

tripleee
- 175,061
- 34
- 275
- 318
-
I didn't expect a unicode-property "bad characters" but I thought there could have been an answer like: if you exclude this and this an this unicode-property you should be save. – sid_com Sep 01 '11 at 05:48
-
2@sid_com, All characters are safe in some circumstances, otherwise they wouldn't exist. What do you considering unsafe? – ikegami Sep 01 '11 at 06:12
-
Accepted this answer because I think it matches best to my initially question. – sid_com Sep 01 '11 at 10:07
2
XML (and thus XHTML) can only contains these chars:
\x09 \x0A \x0D
\x{0020}-\x{D7FF}
\x{E000}-\x{FFFD}
\x{10000}-\x{10FFFF}
Of the above, the following should be avoided:
\x7F-\x84
\x86-\x9F
\x{FDD0}-\x{FDEF}
\x{1FFFE}-\x{1FFFF}
\x{2FFFE}-\x{2FFFF}
\x{3FFFE}-\x{3FFFF}
\x{4FFFE}-\x{4FFFF}
\x{5FFFE}-\x{5FFFF}
\x{6FFFE}-\x{6FFFF}
\x{7FFFE}-\x{7FFFF}
\x{8FFFE}-\x{8FFFF}
\x{9FFFE}-\x{9FFFF}
\x{AFFFE}-\x{AFFFF}
\x{BFFFE}-\x{BFFFF}
\x{CFFFE}-\x{CFFFF}
\x{DFFFE}-\x{DFFFF}
\x{EFFFE}-\x{EFFFF}
\x{FFFFE}-\x{FFFFF}
\x{10FFFE}-\x{10FFFF}
If you are generating XHTML, you need to escape the following:
&
⇒&
<
⇒<
>
⇒>
(optional)"
⇒"
(optional except in attribute values delimited with"
)'
⇒'
(optional except in attribute values delimited with'
)
HTML should have the same if not looser requirements, so if you stick to this, you should be safe.

ikegami
- 367,544
- 15
- 269
- 518
-
When someone did suffering from cross-site scripting, one reason could have been, that he didn't escape form inputs like you showed? – sid_com Sep 01 '11 at 07:44
-
@sid_com, Yes. If you insert text into HTML, you need to convert it to HTML first. – ikegami Sep 01 '11 at 08:20
-
I've read in an ajax tutorial: "However, always use POST requests when:" ... 3."Sending user input (which can contain unknown characters), POST is more robust and secure than GET". Does this concern a different escaping? – sid_com Sep 01 '11 at 08:44
-
@sid_com, Both POST and GET use urlencoding, so that statement makes no sense to me. – ikegami Sep 01 '11 at 18:07