3

I am creating a service that could "go international" to non-English speaking markets. I do not want to restrict a username to the ASCII range of characters but would like to allow a user to specify their "natural" username. OK, use UNICODE (and say UTF-8 as my username text encoding).

But! I don't want users to create "non-name" usernames that contain "symbol" code points. For instance, I don't want to allow a username like √√√√√√øøøøø.

Is there a list of "symbol" code points for UNICODE that I can check (perhaps with a regex) to accept/reject a given username?

Thanks!

z8000
  • 3,715
  • 3
  • 29
  • 37

2 Answers2

5

Unicode has several categories, so you can easily exclude symbols. How exactly to do that depends on the language you are using. Some regex frameworks have that feature built-in, some don't.

Lukáš Lalinský
  • 40,587
  • 6
  • 104
  • 126
  • Ah, I had no idea about this! That's perfect. Thanks. – z8000 Oct 06 '09 at 15:52
  • 1
    I suppose for my purposes I'll allow codepoints in any of these categories: [Ll] Letter, Lowercase [Lm] Letter, Modifier [Lo] Letter, Other [Lt] Letter, Titlecase [Lu] Letter, Uppercase – z8000 Oct 06 '09 at 15:54
  • Well, for example Perl supports a pseudo-category for regular expression called *IsWord*, which is defined as: Ll+Lu+Lt+Lo+Nd – Lukáš Lalinský Oct 06 '09 at 16:00
0

In Python (per Input validation of free-form Unicode text in Python):

def only_letters(s):
    """
    Returns True if the input text consists of letters and ideographs only, False otherwise.
    """
    for c in s:
        cat = unicodedata.category(c)
        # Ll=lowercase, Lu=uppercase, Lo=ideographs
        if cat not in ('Ll','Lu','Lo'):
            return False
    return True

> only_letters('Bzdrężyło')
True
> only_letters('He7lo') # we don't allow digits here
False
kravietz
  • 10,667
  • 2
  • 35
  • 27