1

I have a string and a list of names that I compare the string against using preg_match_all which returns the matches. However in the list of names, some names are first-name OR last-name only while others are both. See my example below.

$names = 'jon|jon snow|lana|smith|lana smith|megan';
$string = 'Jon Snow and Lana Smith met up with Lana and Megan.';
preg_match_all("~\b($names)\b~i", $string, $matches);

The above example with my current expression returns all the names. Which isn't what I want.

What I want returned: jon snow, lana smith, lana, megan.

What I don't want returned: jon, smith

Draken
  • 3,134
  • 13
  • 34
  • 54
Jesse
  • 429
  • 6
  • 12
  • 1
    Why do you have names you don't want in the `$names`? – 4castle Jul 21 '16 at 04:26
  • 1
    i would like to see the real world application. –  Jul 21 '16 at 04:32
  • I was going to suggest that you somehow remove non name words, then split on a regex of more than one space. What you would be left with would be a name. But alas, I don't see an easy way to distinguish between name nouns and other types of nouns. – Tim Biegeleisen Jul 21 '16 at 04:35
  • @4castle All names in $names are wanted. Im checking models names against content. Some models are single words only. A real world example would be the model "Kat" and model "Kat Dior". Two seperate individuals. – Jesse Jul 21 '16 at 05:28
  • Just to expand on that given real world example. If I have content that contains the words "Kat Dior", both "Kat" and "Kat Dior" would be returned as matches. Which is the problem. But "Kat" is still very much a wanted name to search for. – Jesse Jul 21 '16 at 05:36
  • that still makes little sense in a real world senerio. simply search for kate or Dior –  Jul 21 '16 at 05:48
  • You have to order your alternation differently, put the longest possible patterns first. – Sebastian Proske Jul 21 '16 at 06:10
  • @Dagon that would return incorrect matches for way too many models. I have an array of model names, all names are wanted, some names are 1 word and some are not. There are hundreds of models and thousands of strings. "Kat" is a model and "Kat Dior" is another. A string containing the words "Kat Dior" should ONLY return Kate Dior as a match. Because Kat and Kat Dior are two different people. With that being said if both are found in a single string separately both should be returned. Does that make more sense. It needs to be precise given the amount of content it will sift through. – Jesse Jul 21 '16 at 06:21
  • @SebastianProske I thought something along those lines but wasn't sure how to do that? Any advise would be appreciated. – Jesse Jul 21 '16 at 06:23
  • @Jesse for your sample: `jon snow|lana smith|jon|lana|smith|megan` – Sebastian Proske Jul 21 '16 at 06:25
  • Oh I see, I didn't think of it like that. So by ordering the list of names by longest to shortest I can match the longest sets of names first. Thank you! – Jesse Jul 21 '16 at 06:31

1 Answers1

1

It seems you're looking for negative lookaround assertions.

For example, jon(?! snow) matches "jon", but only if " snow" does not follow.

$names = 'jon(?! snow)|jon snow|lana(?! smith)|(?<!lana )smith|lana smith|megan';

Test it live on regex101.com.

Another possibility - less explicit but with comparable results - is to ensure that the "composite" terms are tested first:

$names = 'jon snow|jon|lana smith|lana|smith|megan';

Test it live on regex101.com.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks Tim, this is exactly what I was looking for. Just didn't know how to word it. Works well. – Jesse Jul 21 '16 at 18:06