Canonicalisation of usernames

Question

What is the best way to get a canonical representation of a username that is idempotent?

I want to avoid having the same issue as Spotify: http://labs.spotify.com/2013/06/18/creative-usernames/

I'm looking for a good library to do this in Python. I would prefer not to do what Spotify ended up doing (running the canonicalisation twice to test if it is idempotent), and importing Twisted into my project is a tad overkill, is there a stand-alone library for this?

Would using email addresses instead be preferred when it comes to usernames? How do major sites/companies deal with this?

Do you need to support non-ascii usernames? If the answer's yes, give up and do what they did, it's a nontrivial problem. If not, `''.join([c for c in orig_username.lower() if c in string.punctuation + string.ascii_lowercase + string.digits])`. — AdamKG, Jul 21 '13 at 22:51
Well, there you go then. As to how the major players handle it... I assume that for the most part they don't. Those that do probably spent about as much effort on it as spotify did. I don't know of any standalone library, but wouldn't be surprised if one pops up now, using the approach from the spotify article and just copying out the relevant code from twisted (it's MIT). — AdamKG, Jul 22 '13 at 00:02
I want to put in my voice to agree with AdamKG. If you are allowing a variety of Unicode characters as input, this is a very difficult problem. And even if you were to find a library that did exactly what you want, are you willing to completely trust the integrity of your login system to the continuing correctness of that algorithm, or would you prefer to make one extra function call to verify that you're not opening up a security hole? — GrandOpener, Aug 03 '13 at 16:31

score 1 · Answer 1 · answered Aug 06 '13 at 17:45

First your should read Wikipedia's article on Unicode equivalence. It explains the caveats and which normalization methods there are to represent an Unicode string in its canonical form.

Then you can use Python's built-in module unicodedata to do the normalization of the Unicode string to your preferred normalization form.

A code example:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', u'ﬀñⅨﬃ⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ')
'ffñIXffi5KaÅéBIGBIRD'
>>> unicodedata.normalize('NFKC', u'ﬀñⅨﬃ⁵KaÅéᴮᴵᴳᴮᴵᴿᴰ').lower()
'ffñixffi5kaåébigbird'

repole · Answer 2 · 2013-11-19T16:50:25.737

For anyone reading this a few months later:

The module that Spotify uses isn't all that hard to pull out of Twisted without a whole bunch of dependencies (Twisted can be removed entirely with close to no effort, it's only imported for version check purposes). zope.interface is the only dependency left behind, though it should be removable with a decent bit of effort.

The heart of that module is unicodedata.normalize(), so if you want to roll your own implementation out, that's where you should be starting. But like others have said, be careful, this is an area that's open to easy exploits.

EDIT: I stripped out the zope and twisted dependencies: https://gist.github.com/repole/7548478

Canonicalisation of usernames

2 Answers2