Getting all names from regex

Question

I have made a regex for all kinds of names in a string:

$nameRegex = "/[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ']" .
    "[.A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽa-z-àáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšž']" .
    '+\b(?: \b' .
    "[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ']?[van|de]" .
    "[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽa-z-àáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšž']+\b)*/u";

I am trying to match all non-standard cases such as:

John Doe waves                        | John Doe
Bakary N'Diaye says hello             | Bakary N'Diaye
Iván Aguilar goes well                | Iván Aguilar
Cisteró shot                          | Cisteró
Dan I Soylu shots                     | Dan I Soylu
Mike van der Hoorn with a cross       | Mike van der Hoorn
M.J. Williams takes a shot            | M.J. Williams
Donny van de Beek left foot           | Donny van de Beek
Mike van der Hoorn hello              | Mike van der Hoorn
Artak G. Grigoryan with through ball  | Artak G. Grigoryan
Trent Alexander-Arnold after a break  | Trent Alexander-Arnold

However my one is doing a weak job on matching these names - here you can see it in action https://regexr.com/4qgbt.

How can I improve my regex so it catches all the names? (The names are in the beginning of the sentences)

My name is `John Doe waves`, why don't you accept it? Have a look at https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ — Toto, Dec 10 '19 at 16:22
Matching **all** names is going to be **a lot** more difficult than you think; see https://shinesolutions.com/2018/01/08/falsehoods-programmers-believe-about-names-with-examples/ — IMSoP, Dec 10 '19 at 16:23
If the name is guaranteed to be at the start of the string then don't forget to anchor your regex with `^` — MonkeyZeus, Dec 10 '19 at 16:26
@Toto By `John Doe waves`, meant the guy's name is John Doe. — senty, Dec 10 '19 at 16:28
The guy's name is, may be, `John Doe` but **my** name is `John Doe waves` — Toto, Dec 10 '19 at 16:29
@senty You missed the point: if someone's name is "John Doe waves", your program will misname them as "John Doe", assuming the "waves" is not part of the name. A more plausible example might be "John Pleads the Third", which could quite reasonably be the third generation of men named "John Pleads", or a headline about someone called John invoking the Third Amendment of the US Constitution. Or perhaps a sentence starting with "May", which may or not be a name... — IMSoP, Dec 10 '19 at 16:31
As others have mentioned, this is not practical, but `(?:(?:\p{Lu}\p{L}*|van|der?)[.' -]*)+` — ctwheels, Dec 10 '19 at 16:37
Hmm, thanks for your inputs - I see your point. I'll check other approaches — senty, Dec 10 '19 at 16:37

Emma · Accepted Answer · 2019-12-10T21:13:34.663

Maybe, an expression similar to,

^([\p{L} '.-]+?)(?:\s[a-z]+)*\h*$

would be OK to look into (with preg_match_all) in which there are two groups. The left starting group is a capturing one for the names, and the second one on the right is a non-capturing group to collect everything afterwards, which we are not interested in those.

RegEx Demo 1

Test 1

$re = '/^([\p{L} \'.-]+?)(?:\s[a-z]+)*\s*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output 1

array(9) {
  [0]=>
  array(2) {
    [0]=>
    string(14) "John Doe waves"
    [1]=>
    string(8) "John Doe"
  }
  [1]=>
  array(2) {
    [0]=>
    string(25) "Bakary N'Diaye says hello"
    [1]=>
    string(14) "Bakary N'Diaye"
  }
  [2]=>
  array(2) {
    [0]=>
    string(17) "Dan I Soylu shots"
    [1]=>
    string(11) "Dan I Soylu"
  }
  [3]=>
  array(2) {
    [0]=>
    string(31) "Mike van der Hoorn with a cross"
    [1]=>
    string(18) "Mike van der Hoorn"
  }
  [4]=>
  array(2) {
    [0]=>
    string(26) "M.J. Williams takes a shot"
    [1]=>
    string(13) "M.J. Williams"
  }
  [5]=>
  array(2) {
    [0]=>
    string(27) "Donny van de Beek left foot"
    [1]=>
    string(17) "Donny van de Beek"
  }
  [6]=>
  array(2) {
    [0]=>
    string(24) "Mike van der Hoorn hello"
    [1]=>
    string(18) "Mike van der Hoorn"
  }
  [7]=>
  array(2) {
    [0]=>
    string(36) "Artak G. Grigoryan with through ball"
    [1]=>
    string(18) "Artak G. Grigoryan"
  }
  [8]=>
  array(2) {
    [0]=>
    string(37) "Trent Alexander-Arnold after a break
"
    [1]=>
    string(22) "Trent Alexander-Arnold"
  }
}

In the input strings, on the left side, no problem seems to be there because each line would start with a name. On the right side though, there are lowercase words in a row with an space in between. Here, we'd try to write an statement to find those, maybe even with a positive lookahead:

(?=(?:\s[a-z]+)*\h*$)

then with a second statement,

^[\p{L} '.-]+?

we'd collect the names, and our final expression would become:

^[\p{L} '.-]+?(?=(?:\s[a-z]+)*\h*$)

RegEx Demo 2 with positive lookahead

Test 2

$re = '/^[\p{L} \'.-]+?(?=(?:\s[a-z]+)*\h*$)/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output 2

array(9) {
  [0]=>
  array(1) {
    [0]=>
    string(8) "John Doe"
  }
  [1]=>
  array(1) {
    [0]=>
    string(14) "Bakary N'Diaye"
  }
  [2]=>
  array(1) {
    [0]=>
    string(11) "Dan I Soylu"
  }
  [3]=>
  array(1) {
    [0]=>
    string(18) "Mike van der Hoorn"
  }
  [4]=>
  array(1) {
    [0]=>
    string(13) "M.J. Williams"
  }
  [5]=>
  array(1) {
    [0]=>
    string(17) "Donny van de Beek"
  }
  [6]=>
  array(1) {
    [0]=>
    string(18) "Mike van der Hoorn"
  }
  [7]=>
  array(1) {
    [0]=>
    string(18) "Artak G. Grigoryan"
  }
  [8]=>
  array(1) {
    [0]=>
    string(22) "Trent Alexander-Arnold"
  }
}

Method 3

I guess, we can also look at preg_replace function, and totally forget about the names and focus on matching the right side boundary of a name in a line, maybe with a simple expression similar to:

(?:\s[a-z]+){0,}\h*$

or:

(?:\s*\b[a-z]+){0,}\h*$

RegEx Demo

Test 3

$re = '/(?:\s[a-z]+){0,}\h*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break ';

echo preg_replace($re, '', $str);

Output 3

John Doe
Bakary N'Diaye
Iván Aguilar
Cisteró
Dan I Soylu
Mike van der Hoorn
M.J. Williams
Donny van de Beek
Mike van der Hoorn
Artak G. Grigoryan
Trent Alexander-Arnold

RegEx Demo 3 for `preg_replace`

Method 4:

Maybe, this would be the easiest and fastest way. Here, we'd get the last uppercase letter in a line with a greedy expression, then we'd add a \S+ or \S*:

^.*\p{Lu}\S+

or,

^.*\p{Lu}\S*

RegEx Demo 4

or with a numeric quantifier:

^.{0,50}\p{Lu}\S*

RegEx Demo 5

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.

Your one is doing pretty good job. I'll try it with a larger set and try to understand your regex, I'll give a shout if I can't figure out some cases (hope that's alright) — senty, Dec 10 '19 at 16:35