11

I need to emulate the behavior of \b at the start of a string, where I'm adding additional characters to the set that count as a word boundary. Right now I'm using something like:

"(?<=\\W|\\p{InCJKUnifiedIdeographs})foo"

This works as I would like, unless I'm at the start of the string being matched: in which case the assertion fails and I don't get a hit. What I want is the equivalent of match if I'm at the start of the string or foo is preceded by a non-word character or an ideograph. But I can't get the right incantation to support that.

Any thoughts? Or is this impossible?

Thanks in advance.

TreeRex
  • 507
  • 1
  • 5
  • 13
  • What do you mean by match if i am at the start of the string? That would capture all strings because all strings have a 'start of string' – Jaskirat Jan 11 '11 at 17:51
  • It doesn't: if I use the aforementioned regex against the string "foo foobar baz" it will *not* find 'foo' because the look behind fails. – TreeRex Jan 11 '11 at 18:06
  • 1
    In most cases, you can get what you want by reversing the condition: `(?<![\w\P{InCJKUnifiedIdeographs}])`. I'd add it as an answer, but I don't have time to test it. – Kobi Jan 11 '11 at 21:03

1 Answers1

24
"(?<=^|\\W|\\p{InCJKUnifiedIdeographs})foo"

Just add the start-of-string anchor to the lookbehind conditions.

RobertB
  • 4,592
  • 1
  • 30
  • 29
  • 1
    Thanks Robert, that works like a charm. Somehow in the various combinations I experimented with I didn't try the most obvious. – TreeRex Jan 11 '11 at 18:14
  • 1
    Adding a carot leads to error in my case `((?<=^| )is(?= |$)` https://regex101.com/r/vD5iH9/21 – Sashko Lykhenko Jan 30 '15 at 16:20
  • 4
    @СашкоЛихенко That's a limitation of Python's regex engine. It allows only "fixed width" look-behinds, and the length of `^` (zero/null?/NaN?) is obviously different than ` ` (one). – RobertB Feb 09 '15 at 16:20
  • @СашкоЛихенко See if using the word-boundary match will work for you, e.g. `\bis\b` or something like that. – RobertB Feb 09 '15 at 16:30
  • Thanks, this was not very easy to find in documents... I tried variants like `(?<=\\^|\\W...` or `(?<=(\\^|\\W...` etc. – turingtested Aug 15 '17 at 06:45