Python regex: Lookbehind + Lookahead with characterset

Question

I would like to get the string 10M5D8P into a dictionary:

M:10, D:5, P:8 etc. ...

The string could be longer, but it's always a number followed by a single letter from this alphabet: MIDNSHP=X

As a first step I wanted to split the string with a lookbehind and lookahead, in both cases matching this regex: [0-9]+[MIDNSHP=X]

So my not working solution looks like this at the moment:

import re

re.compile("(?<=[0-9]+[MIDNSHP=X])(?=[0-9]+[MIDNSHP=X])").split("10M5D8P")

It gives me an error message that I do not understand: "look-behind requires fixed-width pattern"

Avinash Raj · Accepted Answer · 2015-07-20T04:39:09.853

2

You may use re.findall.

>>> import re
>>> s = "10M5D8P"
>>> {i[-1]:i[:-1] for i in re.findall(r'[0-9]+[MIDNSHP=X]', s)}
{'M': '10', 'P': '8', 'D': '5'}
>>> {i[-1]:int(i[:-1]) for i in re.findall(r'[0-9]+[MIDNSHP=X]', s)}
{'M': 10, 'P': 8, 'D': 5}

Your regex won't work because re module won't support variable length lookbehind assertions. And also it won't support splitting on zero width boundary, so this (?<=\d)(?=[A-Z]) also can't be possible.

edited Jul 20 '15 at 04:39

answered Jul 20 '15 at 04:32

Avinash Raj

172,303
28
230
274

That's a nice solution, thanks! But still I'm very curious why mine doesn't work ... ? – user3182532 Jul 20 '15 at 04:36
refer from this [answer](http://stackoverflow.com/a/23782359/3451543) : Python re module, as most languages (with the notable exception of .NET), doesn't support variable length lookbehind. – changhwan Jul 20 '15 at 04:37
And a question regarding your solution: Why doesn't it work anymore if I say myregex = re.compile("[0-9]+[MIDNSHP=X]") and give that as first argument to re.findall ? – user3182532 Jul 20 '15 at 04:41
Oh yes, sorry, my mistake ... used the up arrow keys and therefore redefined the string with an old (different) alphabet, didn't notice. That's why the result was an empty dict. Thanks again! – user3182532 Jul 20 '15 at 04:51

score 2 · Answer 2 · answered Jul 20 '15 at 04:41

look-behind requires fixed-width pattern means exactly what it says - a look-behind pattern must match a fixed number of characters in the Python engine. In particular, it is not allowed to contain any quantifiers (?, +, *). Thus, we should pick a fixed-width piece to use as our lookbehind:

(?<=[MIDNSHP=X])(?=\d)

This uses just the single character as the lookbehind and a single digit as the lookahead. However, if you try to split with this expression it will fail due to Python bug 3262. You need to use a workaround like this instead:

>>> re.compile(r"(?<=[MIDNSHP=X])(?=\d)").sub('|', '10M5D8P').split("|")
['10M', '5D', '8P']

but this is pretty ugly. A simpler solution is to use findall to extract what you want:

>>> re.findall('([0-9]+)([MIDNSHP=X])', '10M5D8P')
[('10', 'M'), ('5', 'D'), ('8', 'P')]

from which you can pretty easily create a dictionary:

>>> {k:int(v) for v,k in re.findall('([0-9]+)([MIDNSHP=X])', '10M5D8P')}
{'P': 8, 'M': 10, 'D': 5}

I like this one too because it's more generic, i.e. the capturing allows to have identifiers of different lengths (like e.g. 10CG80M10GHJ) (although that is not needed in my case, but maybe I need that in the future) — user3182532, Jul 20 '15 at 04:55

Python regex: Lookbehind + Lookahead with characterset

2 Answers2