How does one allow a subset of UNICODE codepoints in input validation?

Question

I am creating a service that could "go international" to non-English speaking markets. I do not want to restrict a username to the ASCII range of characters but would like to allow a user to specify their "natural" username. OK, use UNICODE (and say UTF-8 as my username text encoding).

But! I don't want users to create "non-name" usernames that contain "symbol" code points. For instance, I don't want to allow a username like √√√√√√øøøøø.

Is there a list of "symbol" code points for UNICODE that I can check (perhaps with a regex) to accept/reject a given username?

Thanks!

score 5 · Accepted Answer · answered Oct 06 '09 at 15:51

5

Unicode has several categories, so you can easily exclude symbols. How exactly to do that depends on the language you are using. Some regex frameworks have that feature built-in, some don't.

answered Oct 06 '09 at 15:51

Lukáš Lalinský

40,587
6
104
126

Ah, I had no idea about this! That's perfect. Thanks. – z8000 Oct 06 '09 at 15:52
1

I suppose for my purposes I'll allow codepoints in any of these categories: [Ll] Letter, Lowercase [Lm] Letter, Modifier [Lo] Letter, Other [Lt] Letter, Titlecase [Lu] Letter, Uppercase – z8000 Oct 06 '09 at 15:54
Well, for example Perl supports a pseudo-category for regular expression called *IsWord*, which is defined as: Ll+Lu+Lt+Lo+Nd – Lukáš Lalinský Oct 06 '09 at 16:00

score 0 · Answer 2 · answered Jun 15 '17 at 17:16

In Python (per Input validation of free-form Unicode text in Python):

def only_letters(s):
    """
    Returns True if the input text consists of letters and ideographs only, False otherwise.
    """
    for c in s:
        cat = unicodedata.category(c)
        # Ll=lowercase, Lu=uppercase, Lo=ideographs
        if cat not in ('Ll','Lu','Lo'):
            return False
    return True

> only_letters('Bzdrężyło')
True
> only_letters('He7lo') # we don't allow digits here
False

How does one allow a subset of UNICODE codepoints in input validation?

2 Answers2