22

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit.

serv-inc
  • 35,772
  • 9
  • 166
  • 188
Evan Fosmark
  • 98,895
  • 36
  • 105
  • 117
  • wow!, 100 is a big number for a regex :O – Andrea Ambu Jan 25 '09 at 22:52
  • Yet again, I wish I'd dare to tag this you-dont-want-this. – phihag Jan 27 '09 at 00:42
  • 1
    Can you explain why you need more than 100 groups? Perhaps we can help you find an alternate solution. – Suraj Jan 26 '09 at 21:45
  • I have the same problem. I'm trying to make a regex that matches subsequences for a set of strings `(a?b?)|(b?a?b?)|(a?a?a?c?)|...` and I need the groups to retrieve which original string the subsequence was a part of. – Thomas Ahle Aug 12 '13 at 11:28
  • I ran into this problem when searching a large document for a large (externally provided) list of words in an efficient way by building a single RE from the list. To find out which word was found, I wrapped each word in a named group. – Feuermurmel May 21 '15 at 23:34

11 Answers11

10

There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.

Keltia
  • 14,535
  • 3
  • 29
  • 30
  • I agree with your sentiment. If you're hitting the 100 group regex limit, I think there's something wrong with the design. – Kamil Kisiel Jan 25 '09 at 23:09
  • 14
    Sorry, but I disagree - what is "too much memory" and why should the module hard code this threshold? There are (rare) cases when this usage is justified. I have (sadly) come across such a case myself. I'm parsing a complex grammar with pyparsing and (alas) found out that pyparsing is too slow. I'm now auto-generating a regular expression to match my grammar (and I've hit the hard coded `100` brick wall). – Tal Weiss Apr 05 '11 at 21:07
  • I'm also using autogenerated RegExps, but to check for file inclusion of certain files. Each file is separated with an or (`|`) operator, and then the file is searched for by using `((caalog1/)?catalog2/)?file.hh`, if the relative path of the file is `catalog1/catalog2/file.hh`. This is because I want to match both `file.hh`, `catalog2/file.hh` and `catalog1/catalog2/file.hh`. Since I have quite a lot of files to check for, this becomes quite many groups... – HelloGoodbye Jan 27 '14 at 09:25
9

I found the easiest way was to

import regex as re

instead of

import re

The default _MAXCACHE for regex is 500 instead of 100 I believe. This is one of the many reasons I find regex to be a better module than re.

zanbri
  • 5,958
  • 2
  • 31
  • 41
6

If I'm not mistaken, the "new" regex module (currently third-party, but intended to eventually replace the re module in the stdlib) does not have this limit, so you might give that a try.

Steven
  • 28,002
  • 5
  • 61
  • 51
5

I doubt you really need to process 100 named groups by next commands or use it in regexp replacement command. It would be quite impractical. If you just need groups to express the rich conditions in regexp you can use non-grouping group.

(?:word1|word2)(?:word3|word4)

etc. Complex scenarios including nesting groups are possible. There is no limit for non-grouping groups.

rolish
  • 147
  • 1
  • 5
  • 3
    This is not an answer to the question. – jogo Dec 02 '15 at 15:39
  • 1
    True. I faced similar situation when I needed multiple groups. Non-grouping group was the solution. I struggle to believe more than 100 multiple named groups are really needed in good application design. I think it the correct solution is to avoid using so many named groups in the first place. Non-grouping group is a solution. There is no limit for those. – rolish Dec 02 '15 at 16:11
5

I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:

 newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)

If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.

Jim Carroll
  • 2,320
  • 17
  • 23
3

First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.

In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:

# in lib/sre_compile.py
if pattern.groups > 100:
    raise AssertionError(
        "sorry, but this version only supports 100 named groups"
        )

For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.

Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.

Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • 4
    I would discourage anyone from modifying the standard library as this causes the application to work only on the local install. This leads to a maintenance nightmare. – HelloGoodbye Jan 29 '14 at 08:04
1

I've found that Python 3 doesn't have this limitation, whereas the same code ran in latest 2.7 displays this error.

GDR
  • 2,301
  • 1
  • 21
  • 26
0

When I run into this I had a really complex pattern that was actually composed of a bunch of high-level patterns joined by ORs, like this:

pattern_string = u"pattern1|" \
    u"pattern2|" \
    u"patternN"
pattern = re.compile(pattern_string, re.UNICODE)

for match in pattern.finditer(string_to_search):
    pass # Extract data from the groups in the match.

As a workaround, I turned the pattern into a list and I used that list as follows:

pattern_strings = [
    u"pattern1",
    u"pattern2",
    u"patternN",
]
patterns = [re.compile(pattern_string, re.UNICODE) for pattern_string in pattern_strings]

for pattern in patterns:
    for match in pattern.finditer(string_to_search):
        pass # Extract data from the groups in the match.
    string_to_search = pattern.sub(u"", string_to_search)
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
-1

in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is

hello
goodbye

my regex would be: (^|\s)hello($|\s)|(^|\s)goodbye($|\s) ... it's the only way to do it, and works fine on small dictionaries, but when you have more tan 50 words, well...

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
alex
  • 1
  • Sounds like the same problem to me, with the same solution: `\b(hello|goodbye|whatever)\b` If that doesn't work, ask a new question so we can help you properly. – Alan Moore May 26 '10 at 06:01
  • Yours is an example of how not to write a regular expression. – Gallaecio Oct 23 '14 at 05:54
-1

I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.

orip
  • 73,323
  • 21
  • 116
  • 148
-2

It's very ease to resolve this error: Open the re class and you'll see this constant _MAXCACHE = 100. Change the value to 1000, for example, and do a test.

Egon
  • 4,757
  • 1
  • 23
  • 38
  • 1
    You generally don't want to change built-in classes, as this causes an application to work only on your own install. This leads to a maintenance nightmare. – Egon Nov 19 '12 at 17:14
  • Writing `re._MAXCACHE=1000` doesn't seem to do the trick. The error still happens – Thomas Ahle Aug 12 '13 at 11:24
  • 1
    The OP did not complain about the cache - He complained about the limit of the number of groups. Sadly that number is hard coded in the module. See http://stackoverflow.com/a/478849/78234 – Tal Weiss Aug 21 '13 at 20:00
  • See @Triptych's answer. `_MAXCACHE` is not used to create the 100-group limit. Did you try this out yourself, or did you just see that constant in the code and guessed? – HelloGoodbye Jan 29 '14 at 07:59