16

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

martineau
  • 119,623
  • 25
  • 170
  • 301
user682194
  • 171
  • 1
  • 1
  • 3

7 Answers7

29

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

  • As @JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.

  • Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.

While we could try to solve the first issue using keyword.iskeyword(), as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.


As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^\d\W]\w*\Z

Example:

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa\n'   = False
MestreLion
  • 12,698
  • 8
  • 66
  • 57
  • 6
    I might be worth mentioning that this matches kewords such as `True`, `return` etc. I'm not suggesting a change to the regex but just that the OP might want to bear that in mind. – JoeCondron Jun 08 '16 at 12:43
  • 2
    @JoeCondron This is also very easy to do, since Python contains the `keyword.iskeyword` function, which is merely a wrapper around the keyword list frozenset. – Alex Huszagh Dec 31 '17 at 22:35
  • 2
    In Python 3.6 at least, this doesn't work for the Unicode string `'℘᧚'` even though that **is** a valid identifier in Python 3 (and isn't a keyword). – martineau Jun 19 '18 at 19:45
14

str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.

str.isidentifier() Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

Use keyword.iskeyword() to test for reserved identifiers such as def and class.

@martineau's comment gives the example of '℘᧚' where the regex solutions fail.

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

Why does this happen?

Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

How many regex matches are not identifiers?

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

How many identifiers are not regex matches?

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

Interesting -- which ones?

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

What's different about these two sets?

They have different Unicode "General Category" values.

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:

\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])

What about the other way?

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

That's Mark, nonspacing; Symbol, math; Symbol, other.

Where is this all documented?

Where is it implemented?

https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255

I still want a regular expression

Look at the regex module on PyPI.

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

It includes filters for "General Category".

Hatshepsut
  • 5,962
  • 8
  • 44
  • 80
  • Can you provide an example of when this works but the regex(es) fail? – martineau Mar 18 '19 at 18:41
  • Indeed, you're right — but that surprises me because the `re` documentation seems to indicate that it support Unicode strings (even without a `re.UNICODE` flag in Python 3.x). – martineau Mar 18 '19 at 19:55
  • @martineau Out of curiosity, how did you run across that particular one? – Hatshepsut Mar 18 '19 at 21:14
  • I was developing a regex to recognize Python "special" method names i.e those start and end with two underscore characters. `__` — aka "dunder" names, so was searching this site for a general one that recognized any valid identifier. Guess, I'll have to give-up on doing it with an `re` regex...in fact, I now suspect the module's limitation/bug may be why the string`isidentifier` method was added in Python 3. – martineau Mar 18 '19 at 21:30
  • ...following on after your recent update: Yes, I'm aware there are third-party regex libraries, but would prefer to limit what I'm doing to the standard library — and so will use a combination of `str.isidentifer()` along with `str.startswith()` & `str.endswith()` to detect dunder names (the code not being especially speed critical). Thank you for the responses (and answer updates). – martineau Mar 18 '19 at 22:04
  • What an amazing research and answer! Made me completely revamp my own.I wasn't aware of `str.isidentifier()`, not sure if it was available in 2012. But in 2021 it really makes no sense using a regex at all. – MestreLion May 05 '21 at 02:40
3

For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:

re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)

[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 2
    Almost there... but not quite... it will fail for single-letter identifiers "a", and it also allows "aa@#$%" as a valid identifier – MestreLion Apr 13 '12 at 02:47
2

\w matches digits and characters. Try ^[_a-zA-Z]\w*$

Joe
  • 56,979
  • 9
  • 128
  • 135
0

The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.

Interested in getting the problematic characters ?

Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.

def checker(str_: str) -> 'set[str]':
    return {
        c for i, c in enumerate(str_)
        if not (f'_{c}' if i else c).isidentifier()
    }
>>> checker('℘3᧚₂')
{'₂'}

Which solution deals with unauthorised first characters, such as digits or e.g. . See

>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%@#%\n")
{'@', '#', '\n', '$', '%'}

To be improved, since it does check neither for reserved names, nor tells anything about why is sometime problematic, whereas always is... but here is my without-regex approach.


My answer in your terms:

if not checker(i):
    return True
else:
    return False

which could be contracted into

return not checker(i)
keepAlive
  • 6,369
  • 5
  • 24
  • 39
0

I needed a working regex (i.e. I couldn't just use str.isidentifier) because I needed to find all identifiers embedded in a string, not just test if a whole string was a valid identifier. I also couldn't use the ast module because I expected the string to not be valid Python syntax. So the existing answers didn't help, and I wasn't satisfied with 'use the regex package'. So here's an actual regex that does the job, along with the code for constructing it and testing it.

# coding: utf-8
import itertools
import re

full_pattern = r"[A-Z_a-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々-〇〡-〩〱-〵〸-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛯꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹﹱﹳﹷﹹﹻﹽﹿ-ﻼA-Za-zヲ-ンᅠ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------][0-9A-Z_a-zªµ·ºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮ̀-ʹͶ-ͷͻ-ͽͿΆ-ΊΌΎ-ΡΣ-ϵϷ-ҁ҃-҇Ҋ-ԯԱ-Ֆՙՠ-ֈ֑-ֽֿׁ-ׂׄ-ׇׅא-תׯ-ײؐ-ؚؠ-٩ٮ-ۓە-ۜ۟-۪ۨ-ۼۿܐ-݊ݍ-ޱ߀-ߵߺ߽ࠀ-࠭ࡀ-࡛ࡠ-ࡪࢠ-ࢴࢶ-ࣇ࣓-ࣣ࣡-ॣ०-९ॱ-ঃঅ-ঌএ-ঐও-নপ-রলশ-হ়-ৄে-ৈো-ৎৗড়-ঢ়য়-ৣ০-ৱৼ৾ਁ-ਃਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹ਼ਾ-ੂੇ-ੈੋ-੍ੑਖ਼-ੜਫ਼੦-ੵઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ૹ-૿ଁ-ଃଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହ଼-ୄେ-ୈୋ-୍୕-ୗଡ଼-ଢ଼ୟ-ୣ୦-୯ୱஂ-ஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఀ-ఌఎ-ఐఒ-నప-హఽ-ౄె-ైొ-్ౕ-ౖౘ-ౚౠ-ౣ౦-౯ಀ-ಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕ-ೖೞೠ-ೣ೦-೯ೱ-ೲഀ-ഌഎ-ഐഒ-ൄെ-ൈൊ-ൎൔ-ൗൟ-ൣ൦-൯ൺ-ൿඁ-ඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟ෦-෯ෲ-ෳก-ฺเ-๎๐-๙ກ-ຂຄຆ-ຊຌ-ຣລວ-ຽເ-ໄໆ່-ໍ໐-໙ໜ-ໟༀ༘-༙༠-༩༹༵༷༾-ཇཉ-ཬཱ-྄྆-ྗྙ-ྼ࿆က-၉ၐ-ႝႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፝-፟፩-፱ᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲ-ᝳក-៓ៗៜ-៝០-៩᠋-᠍᠐-᠙ᠠ-ᡸᢀ-ᢪᢰ-ᣵᤀ-ᤞᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉ᧐-᧚ᨀ-ᨛᨠ-ᩞ᩠-᩿᩼-᪉᪐-᪙ᪧ᪰-᪽ᪿ-ᫀᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᯳ᰀ-᰷᱀-᱉ᱍ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿ᳐-᳔᳒-ᳺᴀ-᷹᷻-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼ‿-⁀⁔ⁱⁿₐ-ₜ⃐-⃥⃜⃡-⃰ℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ⵿-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿ々-〇〡-〯〱-〵〸-〼ぁ-ゖ゙-゚ゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘫꙀ-꙯ꙴ-꙽ꙿ-꛱ꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠧ꠬ꡀ-ꡳꢀ-ꣅ꣐-꣙꣠-ꣷꣻꣽ-꤭ꤰ-꥓ꥠ-ꥼꦀ-꧀ꧏ-꧙ꧠ-ꧾꨀ-ꨶꩀ-ꩍ꩐-꩙ꩠ-ꩶꩺ-ꫂꫛ-ꫝꫠ-ꫯꫲ-꫶ꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯪ꯬-꯭꯰-꯹가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹ︀-️︠-︯︳-︴﹍-﹏ﹱﹳﹷﹹﹻﹽﹿ-ﻼ0-9A-Z_a-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]*"


def chars():
    for i in itertools.count():
        try:
            yield chr(i)
        except ValueError:
            break


def make_full_pattern():
    def make_pattern(is_valid):
        pattern = ""

        for is_identifier, group in itertools.groupby(chars(), is_valid):
            if is_identifier:
                group = list(group)
                if len(group) == 1:
                    pattern += group[0]
                else:
                    pattern += group[0] + "-" + group[-1]

        return "[" + pattern + "]"

    return make_pattern(str.isidentifier) + make_pattern(lambda c: ("x" + c).isidentifier()) + "*"


def test_pattern():
    assert full_pattern == make_full_pattern()
    identifier_regex = re.compile(full_pattern)

    for char in chars():
        for string in [char, "x" + char]:
            assert bool(identifier_regex.fullmatch(string)) == string.isidentifier()


test_pattern()
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
-1

Works like a charm: r'[^\d\W][\w\d]+'

acesaif
  • 192
  • 1
  • 3
  • 16