Regular expression to confirm whether a string is a valid Python identifier?

Question

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

You know your definition isn't the same as Python's, right? Python allows underscores too. — Tom Zych, Mar 29 '11 at 14:32
All of the regex answers are not quite right, [see below](https://stackoverflow.com/a/54059733/3346095Z). — Hatshepsut, Mar 18 '19 at 21:01

MestreLion · Answer 1 · 2021-05-14T22:48:53.833

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

As @JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.

While we could try to solve the first issue using keyword.iskeyword(), as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.

As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^\d\W]\w*\Z

Example:

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa\n'   = False

I might be worth mentioning that this matches kewords such as `True`, `return` etc. I'm not suggesting a change to the regex but just that the OP might want to bear that in mind. — JoeCondron, Jun 08 '16 at 12:43
@JoeCondron This is also very easy to do, since Python contains the `keyword.iskeyword` function, which is merely a wrapper around the keyword list frozenset. — Alex Huszagh, Dec 31 '17 at 22:35
In Python 3.6 at least, this doesn't work for the Unicode string `'℘᧚'` even though that **is** a valid identifier in Python 3 (and isn't a keyword). — martineau, Jun 19 '18 at 19:45

Hatshepsut · Answer 2 · 2019-04-08T17:22:02.313

str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.

str.isidentifier() Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

Use keyword.iskeyword() to test for reserved identifiers such as def and class.

@martineau's comment gives the example of '℘᧚' where the regex solutions fail.

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

Why does this happen?

Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

How many regex matches are not identifiers?

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

How many identifiers are not regex matches?

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

Interesting -- which ones?

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

What's different about these two sets?

They have different Unicode "General Category" values.

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:

\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])

What about the other way?

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

That's Mark, nonspacing; Symbol, math; Symbol, other.

Where is this all documented?

In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers

Where is it implemented?

https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255

I still want a regular expression

Look at the regex module on PyPI.

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

It includes filters for "General Category".

Can you provide an example of when this works but the regex(es) fail? — martineau, Mar 18 '19 at 18:41
Indeed, you're right — but that surprises me because the `re` documentation seems to indicate that it support Unicode strings (even without a `re.UNICODE` flag in Python 3.x). — martineau, Mar 18 '19 at 19:55
@martineau Out of curiosity, how did you run across that particular one? — Hatshepsut, Mar 18 '19 at 21:14
I was developing a regex to recognize Python "special" method names i.e those start and end with two underscore characters. `__` — aka "dunder" names, so was searching this site for a general one that recognized any valid identifier. Guess, I'll have to give-up on doing it with an `re` regex...in fact, I now suspect the module's limitation/bug may be why the string`isidentifier` method was added in Python 3. — martineau, Mar 18 '19 at 21:30
...following on after your recent update: Yes, I'm aware there are third-party regex libraries, but would prefer to limit what I'm doing to the standard library — and so will use a combination of `str.isidentifer()` along with `str.startswith()` & `str.endswith()` to detect dunder names (the code not being especially speed critical). Thank you for the responses (and answer updates). — martineau, Mar 18 '19 at 22:04
What an amazing research and answer! Made me completely revamp my own.I wasn't aware of `str.isidentifier()`, not sure if it was available in 2012. But in 2021 it really makes no sense using a regex at all. — MestreLion, May 05 '21 at 02:40

Tim Pietzcker · Answer 3 · 2012-04-13T05:48:47.353

3

For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:

re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)

[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".

edited Apr 13 '12 at 05:48

answered Mar 29 '11 at 14:37

Tim Pietzcker

328,213
58
503
561

2

Almost there... but not quite... it will fail for single-letter identifiers "a", and it also allows "aa@#$%" as a valid identifier – MestreLion Apr 13 '12 at 02:47

Joe · Answer 4 · 2012-04-13T13:19:51.663

2

\w matches digits and characters. Try ^[_a-zA-Z]\w*$

edited Apr 13 '12 at 13:19

answered Mar 29 '11 at 14:21

Joe

56,979
9
128
135

5

Careful, Python 3 allows all Unicode letters and digits in its identifiers. – Tim Pietzcker Mar 29 '11 at 14:33
Should it be "[_a-zA-Z]\w*" since you want to match 0 or more after the initial character? – William Stein Jun 03 '11 at 00:56
1

That will match "a$%@#%" as a valid identifier – MestreLion Apr 13 '12 at 02:49

keepAlive · Answer 5 · 2021-05-08T22:35:42.347

^{The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.}

Interested in getting the problematic characters ?

Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.

def checker(str_: str) -> 'set[str]':
    return {
        c for i, c in enumerate(str_)
        if not (f'_{c}' if i else c).isidentifier()
    }

>>> checker('℘3᧚₂')
{'₂'}

Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See

>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%@#%\n")
{'@', '#', '\n', '$', '%'}

To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.

My answer in your terms:

if not checker(i):
    return True
else:
    return False

which could be contracted into

return not checker(i)

score 0 · Answer 6 · answered Jul 24 '23 at 14:17

I needed a working regex (i.e. I couldn't just use str.isidentifier) because I needed to find all identifiers embedded in a string, not just test if a whole string was a valid identifier. I also couldn't use the ast module because I expected the string to not be valid Python syntax. So the existing answers didn't help, and I wasn't satisfied with 'use the regex package'. So here's an actual regex that does the job, along with the code for constructing it and testing it.

# coding: utf-8
import itertools
import re

full_pattern = r"[A-Z_a-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々-〇〡-〩〱-〵〸-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛯꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹﹱﹳﹷﹹﹻﹽﹿ-ﻼＡ-Ｚａ-ｚｦ-ﾝﾠ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------][0-9A-Z_a-zªµ·ºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮ̀-ʹͶ-ͷͻ-ͽͿΆ-ΊΌΎ-ΡΣ-ϵϷ-ҁ҃-҇Ҋ-ԯԱ-Ֆՙՠ-ֈ֑-ֽֿׁ-ׂׄ-ׇׅא-תׯ-ײؐ-ؚؠ-٩ٮ-ۓە-ۜ۟-۪ۨ-ۼۿܐ-݊ݍ-ޱ߀-ߵߺ߽ࠀ-࠭ࡀ-࡛ࡠ-ࡪࢠ-ࢴࢶ-ࣇ࣓-ࣣ࣡-ॣ०-९ॱ-ঃঅ-ঌএ-ঐও-নপ-রলশ-হ়-ৄে-ৈো-ৎৗড়-ঢ়য়-ৣ০-ৱৼ৾ਁ-ਃਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹ਼ਾ-ੂੇ-ੈੋ-੍ੑਖ਼-ੜਫ਼੦-ੵઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ૹ-૿ଁ-ଃଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହ଼-ୄେ-ୈୋ-୍୕-ୗଡ଼-ଢ଼ୟ-ୣ୦-୯ୱஂ-ஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఀ-ఌఎ-ఐఒ-నప-హఽ-ౄె-ైొ-్ౕ-ౖౘ-ౚౠ-ౣ౦-౯ಀ-ಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕ-ೖೞೠ-ೣ೦-೯ೱ-ೲഀ-ഌഎ-ഐഒ-ൄെ-ൈൊ-ൎൔ-ൗൟ-ൣ൦-൯ൺ-ൿඁ-ඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟ෦-෯ෲ-ෳก-ฺเ-๎๐-๙ກ-ຂຄຆ-ຊຌ-ຣລວ-ຽເ-ໄໆ່-ໍ໐-໙ໜ-ໟༀ༘-༙༠-༩༹༵༷༾-ཇཉ-ཬཱ-྄྆-ྗྙ-ྼ࿆က-၉ၐ-ႝႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፝-፟፩-፱ᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲ-ᝳក-៓ៗៜ-៝០-៩᠋-᠍᠐-᠙ᠠ-ᡸᢀ-ᢪᢰ-ᣵᤀ-ᤞᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉ᧐-᧚ᨀ-ᨛᨠ-ᩞ᩠-᩿᩼-᪉᪐-᪙ᪧ᪰-᪽ᪿ-ᫀᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᯳ᰀ-᰷᱀-᱉ᱍ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿ᳐-᳔᳒-ᳺᴀ-᷹᷻-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼ‿-⁀⁔ⁱⁿₐ-ₜ⃐-⃥⃜⃡-⃰ℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ⵿-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿ々-〇〡-〯〱-〵〸-〼ぁ-ゖ゙-゚ゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘫꙀ-꙯ꙴ-꙽ꙿ-꛱ꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠧ꠬ꡀ-ꡳꢀ-ꣅ꣐-꣙꣠-ꣷꣻꣽ-꤭ꤰ-꥓ꥠ-ꥼꦀ-꧀ꧏ-꧙ꧠ-ꧾꨀ-ꨶꩀ-ꩍ꩐-꩙ꩠ-ꩶꩺ-ꫂꫛ-ꫝꫠ-ꫯꫲ-꫶ꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯪ꯬-꯭꯰-꯹가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹ︀-️︠-︯︳-︴﹍-﹏ﹱﹳﹷﹹﹻﹽﹿ-ﻼ０-９Ａ-Ｚ＿ａ-ｚｦ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]*"


def chars():
    for i in itertools.count():
        try:
            yield chr(i)
        except ValueError:
            break


def make_full_pattern():
    def make_pattern(is_valid):
        pattern = ""

        for is_identifier, group in itertools.groupby(chars(), is_valid):
            if is_identifier:
                group = list(group)
                if len(group) == 1:
                    pattern += group[0]
                else:
                    pattern += group[0] + "-" + group[-1]

        return "[" + pattern + "]"

    return make_pattern(str.isidentifier) + make_pattern(lambda c: ("x" + c).isidentifier()) + "*"


def test_pattern():
    assert full_pattern == make_full_pattern()
    identifier_regex = re.compile(full_pattern)

    for char in chars():
        for string in [char, "x" + char]:
            assert bool(identifier_regex.fullmatch(string)) == string.isidentifier()


test_pattern()

score -1 · Answer 7 · answered Dec 27 '18 at 15:44

-1

Works like a charm: r'[^\d\W][\w\d]+'

answered Dec 27 '18 at 15:44

acesaif

192
1
3
16

Doesn't match single chars, and has a bunch of other omissions like Unicode. – lionello Apr 21 '23 at 17:13

Regular expression to confirm whether a string is a valid Python identifier?

7 Answers7

Why does this happen?

What's different about these two sets?

Where is this all documented?

Where is it implemented?

I still want a regular expression

Linked

Related