11

What Unicode characters (more precisely codepoints) are dangerous and should be blacklisted and prohibited for the users to use? I know that BIDI override characters and the "zero width space" are very prone to make problems, but what others are there?

Thanks

federico-t
  • 12,014
  • 19
  • 67
  • 111
  • 2
    Can make problems in the layout (like BIDI chars), post empty comments, that sort of things – federico-t Nov 04 '11 at 01:31
  • 1
    Those don’t sound dangerous to me. You just have to handle things carefully at times: “The Hebrew alphabet is ‪אָלֶף־בֵּית עִבְרִי‬ and is written from right to left.” – tchrist Nov 04 '11 at 01:37
  • You can’t stop people from posting “empty” comments, you know. – tchrist Nov 04 '11 at 01:37
  • I can try... I guess I'll just disallow every Unicode control character and create an option like "write right-to-left" so I can handle BIDI manually – federico-t Nov 04 '11 at 01:41
  • You don’t have to add an option. Just let them write however they please, and enclose the BC=R text with an RLE and a PDF the way I did above. And yes, those are control characters. So you will interfere with people trying to do the right thing. You really can’t do serious Unicode work in PHP though. You need to use real Perl. Otherwise you don’t have the property support, grapheme support, an d a million other things you need for working with Unicode. – tchrist Nov 04 '11 at 01:50
  • 7
    I've heard U+2423 will try to stab you if you turn your back on it. – Cat Plus Plus Nov 04 '11 at 01:52
  • 2
    @CatPlusPlus That would be U+1F0AB, actually, especially when it follows U+100CB. – tchrist Nov 04 '11 at 02:06

5 Answers5

5

Characters aren’t dangerous: only inappropriate uses of them are.

You might consider reading things like:

It is impossible to guess what you mean by dangerous.

tchrist
  • 78,834
  • 30
  • 123
  • 180
4

A Golden Rule in security is to whitelist instead of blacklist, instead of trying to cover all bad characters, it is a much better idea to validate based on ensuring the user only use known good characters.

There are solutions that help you build the large whitelist that is required for international whitelisting. For example, in .NET there is UnicodeCategory.

The idea is that instead of whitelisting thousands of individual characters, the library assigns them into categories like alphanumeric characters, punctuations, control characters, and such.

Tutorial on whitelisting international characters in .NET

Unicode Regex: Categories

Desmond Zhou
  • 1,369
  • 1
  • 11
  • 18
  • 3
    Yes, I know that would be much more secure. But at the same time, there are literally THOUSANDS of Unicode chars (for the many languages that are there), and I can't whitelist all of them. And if I did, I'd probably left out many languages out, so I prefer a blacklist – federico-t Nov 04 '11 at 01:27
  • 1
    There are solutions that help you build whitelists, I have updated an article that deals with this issue in .NET. I would image JAVA must also have libraries for this. – Desmond Zhou Nov 04 '11 at 01:33
  • 1
    Interesting.. I thought whitelists that large were utterly inefficient. I'll look that up. Too bad I'm using PHP though – federico-t Nov 04 '11 at 01:39
  • 1
    Well, with PHP at least you have tolerable regexes. – tchrist Nov 04 '11 at 01:42
  • The golden rule is defense in depth. If you can blacklist using ranges, do that before whitelisting. You cannot blacklist everything, but you can make sure that there is a moat outside your wall. – Anthony Rutledge Mar 31 '16 at 14:25
1

'HANGUL FILLER' (U+3164)

Since Unicode 1.1 in 1993, there is an empty wide, zero space character.

We can't see it, neither copy/paste it alone because we can't select it!

It need to be generated, by the unix keyboard shortcut: CTRL + SHIFT + u + 3164

It can pretty much up anything: variables, function name, url, file names, mimic DNS, invalidate hash strings, database entries, blog posts, logins, allow to fake identical accounts, etc.


DEMO 1: Altering variables

The variable hijacked contains a Hangul Filler char, the console log call the variable without the char:

const normal = "Hello w488ld"
const hijaㅤcked = "Hello w488ld"
console.log(normal)
console.log(hijacked)

DEMO 2: Hijack URL's

Those 3 url will lead to xn--stackoverflow-fr16ea.com:

https://stackㅤㅤoverflow.com

https://stackㅤㅤoverflow.com

https://stackㅤㅤoverflow.com

NVRM
  • 11,480
  • 1
  • 88
  • 87
0

See Unicode Security Considerations Report.

It covers various aspects, from spoofing of rendered strings to dangers of processing UTF encodings in unsafe languages.

Kornel
  • 97,764
  • 37
  • 219
  • 309
0

U+2800 BRAILLE PATTERN BLANK - a Braille character without any "dots". It looks like a regular "space" but is not classified as one.

Flopp
  • 1,887
  • 14
  • 24