6

Im trying to do something fairly simple with regular expression in python... thats what i thought at least.

What i want to do is matching words from a string if its preceded and followed by a whitespace. If its at the beginning of the string there is no whitespace required before - if its at the end, dont't search for whitespace either.

Example:

"WordA WordB WordC-WordD WordE"

I want to match WordA WordB WordE.

I only came up with overcomplicated way of doing this...

(?<=(?<=^)|(?<=\s))\w+(?=(?=\s)|(?=$))

It seems to me there has to be a simple way for such a simple problem.... I figured i can just start with (?<=\s|^) but that doesnt seem possible because "look-behind requires fixed-width pattern".

SyntaxError
  • 330
  • 3
  • 16

1 Answers1

9

You seem to work in Python as (?<=^|\s) is perfectly valid in PCRE, Java and Ruby (and .NET regex supports infinite width lookbehind patterns).

Use

(?<!\S)\w+(?!\S)

It will match 1 or more word chars that are enclosed with whitespace or start/end of string.

See the regex demo.

Pattern details:

  • (?<!\S) - a negative lookbehind that fails the match once the engine finds a non-whitespace char immediately to the left of the current location
  • \w+ - 1 or more word chars
  • (?!\S) - a negative lookahead that fails the match once the engine finds a non-whitespace char immediately to the right of the current location.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • that makes sense! Thanks. I guess searching for nonwhitespace instead of whitespace is much easier. – SyntaxError Jul 19 '17 at 12:18
  • Not sure it is easier, but is more efficient. – Wiktor Stribiżew Jul 19 '17 at 12:22
  • I don't understand why simply \s+ surrounding what we need does not work – B Furtado Jul 30 '21 at 15:17
  • 1
    @BFurtado Because `\s` consumes a whitespace. Look at [this demo](https://regex101.com/r/EwvuHA/47): there is only one match because the `\s` on both ends *requires* a whitespace on the left and right. `WordA` and `WordE` have no whitespace on one end. You might think `(\s|^)\w+(\s|$)` will work, but [it does not match consecutive occurrences](https://regex101.com/r/EwvuHA/48) because `(\s|$)` consumes the whitespace after `WordA` and thus `(\s|^)` cannot find the `WordB` match. – Wiktor Stribiżew Jul 30 '21 at 15:53
  • Thank you very much @WiktorStribiżew. I have struggled with regex countless times. Official documentation https://docs.python.org/3/howto/regex.html says nothing about consuming space. There is rather cryptic mention of zero-width (which seems out of context for me, but may resemble what you are kindly explaining. Best, – B Furtado Jul 30 '21 at 16:05
  • 1
    @BFurtado I will try to explain it in my Youtube channel and share a link with you (the channel link is in my profile). – Wiktor Stribiżew Jul 30 '21 at 16:06