7

I'm trying to write some regex that will allow me to do a negative lookbehind on a capture group so that I can extract possible references from emails. I need to know how to look behind from a certain point to the first white space. If a digit is found, I don't want the reference to be extracted.

I have got as far as shown below. I have 2 capture groups - 'PreRef' and 'Ref'. I don't want a 'Ref' match to be found if 'PreRef' contains a digit. What I've got so far only checks if the character immediately before the colon is a digit.

(?<PreRef>\S+)(?<![\d]):(?<Ref>\d{5})

A 'Ref' match of 12345 should be found here:

This is a reference:12345

But not here (there's a 5 in the word 'reference'):

This is not a ref5rence:12345
Not a real meerkat
  • 5,604
  • 1
  • 24
  • 55
Lank
  • 105
  • 1
  • 5

3 Answers3

3

You can exclude digits from the \S class, then surround the expression
with whitespace boundrys, then viola ..

(?<!\S)(?<PreRef>[^\s\d]+):(?<Ref>\d{5})(?!\S)

https://regex101.com/r/JrU7Kd/1

Explained

 (?<! \S )                     # Whitespace boundary
 (?<PreRef> [^\s\d]+ )         # (1), Not whitespace nor digit
 :                             # Colon
 (?<Ref> \d{5} )               # (2), Five digits
 (?! \S )                      # Whitespace boundary
2

Do you need a negative lookbehind? It's easier to just exclude digits from the PreRef capture. [^\W\d] will match word characters but not digits. Then you just need to add a \b or other similar word boundary assertion to make sure what does match is a full word.

\b(?<PreRef>[^\W\d]+):(?<Ref>\d{5})
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
1

I surely agree with John, and we can use a simple expression, if digits are not allowed prior to :, such as:

^\D+:(\d{5})

or:

^\D+:(\d{5})$

If we wish to add more boundaries, we can surely do that too.

Demo

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Test

const regex = /^\D+:(\d{5})/gm;
const str = `This is a reference:12345
This is not a ref5rence:12345`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}
Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    This solution didn't quite work for me because it prevents the reference being extracted where there's a digit anywhere before it, i.e. earlier in the sentence, e.g. 'This is 1 reference:12345'. – Lank Jun 03 '19 at 15:30