using regex to skip ahead all characters until a specific sequence of letters is found using negative lookahead

Question

I'm alright with basic regular expressions, but I get a bit lost around pos/neg look aheads/behinds.

I'm trying to pull the id # from this:

[keyword stuff=otherstuff id=123 morestuff=stuff]

There could be unlimited amounts of "stuff" before or after. I've been using The Regex Coach to help debug what I've tried, but I'm not moving forward anymore...

So far I have this:

\[keyword (?:id=([0-9]+))?[^\]]*\]

Which takes care of any extra attributes after the id, but I can't figure out how to ignore everything between keyword and id. I know I can't go [^id]* I believe I need to use a negative lookahead like this (?!id)* but I guess since it's zero-width, it doesn't move forward from there. This doesn't work either:

\[keyword[A-z0-9 =]*(?!id)(?:id=([0-9]+))?[^\]]*\]

I've been looking all over for examples, but haven't found any. Or perhaps I have, but they went so far over my head I didn't even realize what they were.

Help! Thanks.

EDIT: It has to match [keyword stuff=otherstuff] as well, where id= doesn't exist at all, so I have to have a 1 or 0 on the id # group. There are also other [otherkeywords id=32] which I do not want to match. The document needs to match multiple [keyword id=3] throughout the documents using preg_match_all.

The solutions provided work great and probably benchmark faster than using any type of lookahead, I'll definitely be doing it that way. But for my own curiosity, and perhaps anyone who hits this question with google in the distant future, is the method I attempted at getting at possible? That is, can lookaheads be used to skip some stuff until a particular word is hit? — phazei, Jul 20 '10 at 02:10

Wrikken · Accepted Answer · 2010-07-20T01:20:57.550

2

No lookahead/behind required:

/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/

Added the ending '[^]]*]' to check for a real tag end, could be unnecessary.

Edit: added the \b to id as otherwise it could match [keyword you-dont-want-this-guid=123123-132123-123 id=123]

$ php -r 'preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff morestuff=stuff]",$matches);var_dump($matches);'
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(42) "[keyword stuff=otherstuff morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(0) ""
  }
}
$ php -r 'var_dump(preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff id=123 morestuff=stuff]",$matches),$matches);'
int(1)
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(49) "[keyword stuff=otherstuff id=123 morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(3) "123"
  }
}

edited Jul 20 '10 at 01:20

answered Jul 20 '10 at 00:50

Wrikken

69,272
8
97
136

I was thinking that was working, but after testing it, it seems id isn't optional and it needs to be. – phazei Jul 20 '10 at 01:04
Oh, did not get that, will fix, – Wrikken Jul 20 '10 at 01:05
Fixed (in a non-capturing subpattern) – Wrikken Jul 20 '10 at 01:12
Tried it, but it doesn't get any matches on the id. – phazei Jul 20 '10 at 01:12
Seriously? (did 2 edits in quick succession 9 mins ago b.t.w, the first did indeed not work). What string doesn't match? Entered 2 teststrings which seem to work here. – Wrikken Jul 20 '10 at 01:22
Ah, sorry, I must have missed a \ or something. Just got home from work and tried it again, seems to hit right on :) Awesome, thanks! I'm not to sure I understand the first "[^]]*" and why it doesn't match until the last ]. I noticed that the ] from ^] can really be any character that's not used. – phazei Jul 20 '10 at 02:02
1

Be carefull with that last remark: `[keyword ][keyword id=123]` will suddenly have only 1 match instead of the 2 if you don't use [^\]]. It doesn't match untill the last `]` because it's ungreedy (the `?`), so it stops matching as soon what comes after matches the next part, which is also why would couldn't just set the whole \bid etc. in a non-required subpattern of it's own. – Wrikken Jul 20 '10 at 02:14

Peter Ajtai · Answer 2 · 2010-07-20T04:08:04.597

2

You do not need look ahead / behind.

Since the question is tagged PHP, use preg_match_all() and store the match in $matches.

Here's how:

<?php

  // Store the string. I single quote, in case there are backslashes I
  // didn't see.
$string = 'blah blah[keyword stuff=otherstuff id=123 morestuff=stuff]
           blah blah[otherkeyword stuff=otherstuff id=555 morestuff=stuff]
           blah blah[keyword stuff=otherstuff id=444 morestuff=stuff]';

  // The pattern is '[keyword' followed by not ']' a space and id
  // The space before id is important, so you don't catch 'guid', etc.
  // If '[keyword'  is always at the beginning of a line, you can use
  // '^\[keyword'
$pattern = '/\[keyword[^\]]* id=([0-9]+)/';

  // Find every single $pattern in $string and store it in $matches
preg_match_all($pattern, $string, $matches);

  // The only tricky part you have to know is that each entire match is stored in
  // $matches[0][x], and the part of the match in the parentheses, which is what
  // you want is stored in $matches[1][x]. The brackets are optional, since it's
  // only one line.
foreach($matches[1] as $value)
{     
    echo $value . "<br/>";
}
?>

Output:

123
444

( 555 is skipped, as it should be)

PS

You can also use \b instead of a literal space if there could be a tab instead. \b represents a word boundary... in this case the beginning of a word.

$pattern = '/\[keyword[^\]]*\bid=([0-9]+)/';

edited Jul 20 '10 at 04:08

answered Jul 20 '10 at 00:51

Peter Ajtai

56,972
13
121
140

That won't work, because I'm using preg_match_all on a large document that could have [otherkeyword id=324] which I can't match. Also, I have to match [keyword stuff=otherstuff] where there is no id. – phazei Jul 20 '10 at 01:05
@phazei Edited my answer to show multiple answers and ignore otherkeyword. – Peter Ajtai Jul 20 '10 at 01:22
Cool. You skipped everything after the id, though I need to keep that since I'm using it to replace the entire [keyword x=x] section, but that's no problem for me to change. I see that you fixed the biggest issue I was having the same way Wrikken did with [^]]* right after the keyword. Why does that work and not cause it to skip everything till the last "]"? – phazei Jul 20 '10 at 02:07
I skipped everything after the ID, since you said, "I'm trying to pull the id #" and the stuff after the ID isn't the ID #. '[^\]]*\bid=' means any number of things that aren't a close square bracket followed by a whitespace and 'id='.... so it can't skip till the last ']' due to it having to look for '\bid=' – Peter Ajtai Jul 20 '10 at 02:51
1

@Peter, `\b` doesn't match whitespace; you're thinking of `\s`. See here for what `\b` really does: http://www.regular-expressions.info/wordboundaries.html – Alan Moore Jul 20 '10 at 03:16

score 0 · Answer 3 · answered Jul 20 '10 at 04:37

I think this is what you're getting at:

\[keyword(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)*(?:\s+id=([0-9]+))?[^\]]*\]

(I'm assuming attribute names can only contain ASCII letters, while the values can contain any non-whitespace character except ].)

(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)* matches any number of attribute=value pairs (and the whitespace preceding them), as long as the attribute name isn't id. The \b (word boundary) is there just in case there are attribute names that start with id, like idiocy. There's no need to put a \b in front of the attribute name this time, because you know any name it matches will be preceded by whitespace. But, as you've learned, the lookahead approach is overkill in this case.

Now, about this:

[A-z0-9 =]

That A-z is either a typo or an error. If you're expecting it to match all uppercase and lowercase letters, well, it does. But it also matches

'[', ']', '^', '_', '`` and '\'

...because their code points lie between those of the uppercase letters and the lowercase letters. ASCII letters, that is.

using regex to skip ahead all characters until a specific sequence of letters is found using negative lookahead

3 Answers3