4

Im trying to build a regular expression that captures any number (integer, float, with scientific notation or not). Im using groups so that if I need to update something I update only one line. Here's what I'm doing:

intNumber = r"(?P<Integer>-?(0|[1-9]+[0-9]*))" # Integer
floatNumber = r"(?P<Float>"+intNumber+r"\.[0-9]+)" # Float
sciNumber = r"(?P<Scientific>"+floatNumber+r"(e|E)(-|\+)?[0-9]+)" # Scientific
anyNumber = r"(?P<AnyNumber>"+sciNumber+"|(?P=Integer)|(?P=Float))" # Any number

The problem is that although each of the regex works on its own, when I combine them all in anyNumber using or (|) it captures only scientific notation numbers and not the rest. What am I doing wrong?

Edit: To refine my question, is it possible to have a dynamically generated regex (with the goal of simple single spot maintenance in mind) that also is flexible enough to allow me to use its components separately, without problems like redefinition of groups and with convenient naming of the groups? I know I may be asking too much..

ekad
  • 14,436
  • 26
  • 44
  • 46
capitan
  • 309
  • 4
  • 13

2 Answers2

0

The (?P=Integer) is a named backreference that matches the same text (not recurses the group subpattern!) as matched by the capturing group named "Integer". Same is with (?P=Float). That means, you need to use the pattern itself, not the backreferences.

Also, you cannot use the named backreferences if you plan to build the regex dynamically that way. Use non-capturing groups and your pattern building will look similar to

import re
intNumber = r"-?(?:0|[1-9]+[0-9]*)" # Integer
floatNumber = intNumber+r"\.[0-9]+" # Float
sciNumber = floatNumber+r"[eE][-+]?[0-9]+" # Scientific
anyNumber = r"{0}|{1}|{2}".format(sciNumber,floatNumber,intNumber) # Any number
print(re.findall(anyNumber, '12 12.34 12.34E-34'))

See the Python demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Ok thank you! But if I also want to refer to intNumber group Integer, or group Float, because I'll use them separately as well (and not just for making the anyNumber regex) then I'll need to have capturing groups. In that case it would be great if I could have named capturing groups. See my question edit :) – capitan Oct 26 '16 at 09:24
  • You just cannot use two named groups with the same names in `re` regex. With the `anyNumber` just defined with your patterns as `r"{0}|{1}|{2}".format(sciNumber,floatNumber,intNumber)`, you will get an exception. You may consider using PyPi regex module, or forget about such deep level naming. – Wiktor Stribiżew Oct 26 '16 at 09:48
0

I ended up doing the following:

intNumber_re = r"(?P<Integer>-?(0|[1-9]+[0-9]*))" # Integer
floatNumber_re = r"(?P<Float>"+intNumber_re+r"\.[0-9]+)" # Float
sciNumber_re = r"(?P<Scientific>"+floatNumber_re+r"[eE][-\+]?[0-9]+)" # Scientific
groupNames_re = r'(\?P<Integer>)|(\?P<Float>)|(\?P<Scientific>)'
anyNumber_re = r"(?P<AnyNumber>{0}|{1}|{2})".format(re.sub(groupNames_re,'?:',sciNumber_re),
               re.sub(groupNames_re,'?:',floatNumber_re),re.sub(groupNames_re,'?:',intNumber_re)) # Any number

Effectively I'm removing the group names (regex for those is in groupNames_re) when I construct the anyNumber RE with the re.sub() functions. It is a bit ugly but it works and gives me the flexibility that I want. Thanks Wiktor for your input, I ended up using a bit of your code :)

capitan
  • 309
  • 4
  • 13