Python regular expressions with more than 100 groups?

Question

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit.

Can you explain why you need more than 100 groups? Perhaps we can help you find an alternate solution. — Suraj, Jan 26 '09 at 21:45
I have the same problem. I'm trying to make a regex that matches subsequences for a set of strings `(a?b?)|(b?a?b?)|(a?a?a?c?)|...` and I need the groups to retrieve which original string the subsequence was a part of. — Thomas Ahle, Aug 12 '13 at 11:28
I ran into this problem when searching a large document for a large (externally provided) list of words in an efficient way by building a single RE from the list. To find out which word was found, I wrapped each word in a named group. — Feuermurmel, May 21 '15 at 23:34

score 10 · Answer 1 · answered Jan 25 '09 at 22:54

10

There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.

answered Jan 25 '09 at 22:54

Keltia

14,535
3
29
30

I agree with your sentiment. If you're hitting the 100 group regex limit, I think there's something wrong with the design. – Kamil Kisiel Jan 25 '09 at 23:09
14

Sorry, but I disagree - what is "too much memory" and why should the module hard code this threshold? There are (rare) cases when this usage is justified. I have (sadly) come across such a case myself. I'm parsing a complex grammar with pyparsing and (alas) found out that pyparsing is too slow. I'm now auto-generating a regular expression to match my grammar (and I've hit the hard coded `100` brick wall). – Tal Weiss Apr 05 '11 at 21:07
I'm also using autogenerated RegExps, but to check for file inclusion of certain files. Each file is separated with an or (`|`) operator, and then the file is searched for by using `((caalog1/)?catalog2/)?file.hh`, if the relative path of the file is `catalog1/catalog2/file.hh`. This is because I want to match both `file.hh`, `catalog2/file.hh` and `catalog1/catalog2/file.hh`. Since I have quite a lot of files to check for, this becomes quite many groups... – HelloGoodbye Jan 27 '14 at 09:25

score 9 · Answer 2 · answered Jan 18 '13 at 02:37

9

I found the easiest way was to

import regex as re

instead of

import re

The default _MAXCACHE for regex is 500 instead of 100 I believe. This is one of the many reasons I find regex to be a better module than re.

answered Jan 18 '13 at 02:37

zanbri

5,958
2
31
41

4

I don't have this module – Thomas Ahle Aug 12 '13 at 11:26
A previous answer already proposes this solution, even if it does not put it as well as you did. – Gallaecio Oct 23 '14 at 05:56
on debian-based unices, you can use `apt install python-regex` instead of pip – serv-inc Oct 18 '18 at 08:28

score 6 · Answer 3 · answered May 29 '12 at 11:53

6

If I'm not mistaken, the "new" regex module (currently third-party, but intended to eventually replace the re module in the stdlib) does not have this limit, so you might give that a try.

answered May 29 '12 at 11:53

Steven

28,002
5
61
51

You might want to improve your answer with the data form zambri’s and the comment in his. – Gallaecio Oct 23 '14 at 05:52

rolish · Answer 4 · 2015-12-02T16:13:11.717

5

I doubt you really need to process 100 named groups by next commands or use it in regexp replacement command. It would be quite impractical. If you just need groups to express the rich conditions in regexp you can use non-grouping group.

(?:word1|word2)(?:word3|word4)

etc. Complex scenarios including nesting groups are possible. There is no limit for non-grouping groups.

edited Dec 02 '15 at 16:13

answered Dec 02 '15 at 15:17

rolish

147
1
5

3

This is not an answer to the question. – jogo Dec 02 '15 at 15:39
1

True. I faced similar situation when I needed multiple groups. Non-grouping group was the solution. I struggle to believe more than 100 multiple named groups are really needed in good application design. I think it the correct solution is to avoid using so many named groups in the first place. Non-grouping group is a solution. There is no limit for those. – rolish Dec 02 '15 at 16:11

score 5 · Accepted Answer · answered Jan 26 '09 at 21:58

I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:

 newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)

If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.

I used the or method you mentioned and a few tricks of my own. Thanks Jim. — Evan Fosmark, Jan 27 '09 at 07:49

Kenan Banks · Answer 6 · 2009-01-26T23:57:48.923

First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.

In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:

# in lib/sre_compile.py
if pattern.groups > 100:
    raise AssertionError(
        "sorry, but this version only supports 100 named groups"
        )

For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.

Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.

I would discourage anyone from modifying the standard library as this causes the application to work only on the local install. This leads to a maintenance nightmare. — HelloGoodbye, Jan 29 '14 at 08:04

score 1 · Answer 7 · answered May 15 '19 at 11:56

1

I've found that Python 3 doesn't have this limitation, whereas the same code ran in latest 2.7 displays this error.

answered May 15 '19 at 11:56

GDR

2,301
1
21
26

score 0 · Answer 8 · answered Oct 23 '14 at 05:45

When I run into this I had a really complex pattern that was actually composed of a bunch of high-level patterns joined by ORs, like this:

pattern_string = u"pattern1|" \
    u"pattern2|" \
    u"patternN"
pattern = re.compile(pattern_string, re.UNICODE)

for match in pattern.finditer(string_to_search):
    pass # Extract data from the groups in the match.

As a workaround, I turned the pattern into a list and I used that list as follows:

pattern_strings = [
    u"pattern1",
    u"pattern2",
    u"patternN",
]
patterns = [re.compile(pattern_string, re.UNICODE) for pattern_string in pattern_strings]

for pattern in patterns:
    for match in pattern.finditer(string_to_search):
        pass # Extract data from the groups in the match.
    string_to_search = pattern.sub(u"", string_to_search)

score -1 · Answer 9 · edited May 26 '12 at 15:08

-1

in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is

hello
goodbye

edited May 26 '12 at 15:08

Martijn Pieters

1,048,767
296
4,058
3,343

answered May 25 '10 at 23:15

alex

1

Sounds like the same problem to me, with the same solution: `\b(hello|goodbye|whatever)\b` If that doesn't work, ask a new question so we can help you properly. – Alan Moore May 26 '10 at 06:01
Yours is an example of how not to write a regular expression. – Gallaecio Oct 23 '14 at 05:54

score -1 · Answer 10 · answered Jan 25 '09 at 23:07

-1

I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.

answered Jan 25 '09 at 23:07

orip

73,323
21
116
148

score -2 · Answer 11 · edited Nov 19 '12 at 17:10

-2

It's very ease to resolve this error: Open the re class and you'll see this constant _MAXCACHE = 100. Change the value to 1000, for example, and do a test.

edited Nov 19 '12 at 17:10

Egon

4,757
1
23
38

answered Nov 19 '12 at 16:49

Adriano Santos

21
2

1

You generally don't want to change built-in classes, as this causes an application to work only on your own install. This leads to a maintenance nightmare. – Egon Nov 19 '12 at 17:14
Writing `re._MAXCACHE=1000` doesn't seem to do the trick. The error still happens – Thomas Ahle Aug 12 '13 at 11:24
1

The OP did not complain about the cache - He complained about the limit of the number of groups. Sadly that number is hard coded in the module. See http://stackoverflow.com/a/478849/78234 – Tal Weiss Aug 21 '13 at 20:00
See @Triptych's answer. `_MAXCACHE` is not used to create the 100-group limit. Did you try this out yourself, or did you just see that constant in the code and guessed? – HelloGoodbye Jan 29 '14 at 07:59

Python regular expressions with more than 100 groups?

11 Answers11

Linked