2

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.

Take the following example:

The original regex is cat|car|bat so matching output is

cat
car
bat

I want to add to this regex and output only matches that start with 'ca',

cat
car

I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:

^ca[tr]

or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.

This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.

Things I've tried:

^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))

It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.

A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.

Thanks for any solutions and/or opinions on it.

EDIT: See solution including @Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.

Halfcard
  • 33
  • 4

2 Answers2

1

You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):

(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})

In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.

Armali
  • 18,255
  • 14
  • 57
  • 171
  • 1
    Thanks. It looks like it's close to what I need but for example the second solution results in cabcdefg being a match but I'm only looking for results within strings that match [a-z]{4}, i.e. 4 characters long. – Halfcard Nov 22 '19 at 11:50
  • 2
    You need to add anchors (`^` and `$`) to the start and end of the regex https://regex101.com/r/ZFMu16/1 – Nick Nov 22 '19 at 12:05
  • 1
    Thanks Nick. looks like we got there at the same time but. I'll tag this as the correct answer. – Halfcard Nov 22 '19 at 12:06
1

Ok, thanks to @Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.

See @Nick's comment.

I've also raised an issue on the exrex GitHub for this.

Halfcard
  • 33
  • 4