Maybe, an expression similar to,
^([\p{L} '.-]+?)(?:\s[a-z]+)*\h*$
would be OK to look into (with preg_match_all
) in which there are two groups. The left starting group is a capturing one for the names, and the second one on the right is a non-capturing group to collect everything afterwards, which we are not interested in those.
Test 1
$re = '/^([\p{L} \'.-]+?)(?:\s[a-z]+)*\s*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Output 1
array(9) {
[0]=>
array(2) {
[0]=>
string(14) "John Doe waves"
[1]=>
string(8) "John Doe"
}
[1]=>
array(2) {
[0]=>
string(25) "Bakary N'Diaye says hello"
[1]=>
string(14) "Bakary N'Diaye"
}
[2]=>
array(2) {
[0]=>
string(17) "Dan I Soylu shots"
[1]=>
string(11) "Dan I Soylu"
}
[3]=>
array(2) {
[0]=>
string(31) "Mike van der Hoorn with a cross"
[1]=>
string(18) "Mike van der Hoorn"
}
[4]=>
array(2) {
[0]=>
string(26) "M.J. Williams takes a shot"
[1]=>
string(13) "M.J. Williams"
}
[5]=>
array(2) {
[0]=>
string(27) "Donny van de Beek left foot"
[1]=>
string(17) "Donny van de Beek"
}
[6]=>
array(2) {
[0]=>
string(24) "Mike van der Hoorn hello"
[1]=>
string(18) "Mike van der Hoorn"
}
[7]=>
array(2) {
[0]=>
string(36) "Artak G. Grigoryan with through ball"
[1]=>
string(18) "Artak G. Grigoryan"
}
[8]=>
array(2) {
[0]=>
string(37) "Trent Alexander-Arnold after a break
"
[1]=>
string(22) "Trent Alexander-Arnold"
}
}
In the input strings, on the left side, no problem seems to be there because each line would start with a name. On the right side though, there are lowercase words in a row with an space in between. Here, we'd try to write an statement to find those, maybe even with a positive lookahead:
(?=(?:\s[a-z]+)*\h*$)
then with a second statement,
^[\p{L} '.-]+?
we'd collect the names, and our final expression would become:
^[\p{L} '.-]+?(?=(?:\s[a-z]+)*\h*$)
Test 2
$re = '/^[\p{L} \'.-]+?(?=(?:\s[a-z]+)*\h*$)/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Output 2
array(9) {
[0]=>
array(1) {
[0]=>
string(8) "John Doe"
}
[1]=>
array(1) {
[0]=>
string(14) "Bakary N'Diaye"
}
[2]=>
array(1) {
[0]=>
string(11) "Dan I Soylu"
}
[3]=>
array(1) {
[0]=>
string(18) "Mike van der Hoorn"
}
[4]=>
array(1) {
[0]=>
string(13) "M.J. Williams"
}
[5]=>
array(1) {
[0]=>
string(17) "Donny van de Beek"
}
[6]=>
array(1) {
[0]=>
string(18) "Mike van der Hoorn"
}
[7]=>
array(1) {
[0]=>
string(18) "Artak G. Grigoryan"
}
[8]=>
array(1) {
[0]=>
string(22) "Trent Alexander-Arnold"
}
}
Method 3
I guess, we can also look at preg_replace
function, and totally forget about the names and focus on matching the right side boundary of a name in a line, maybe with a simple expression similar to:
(?:\s[a-z]+){0,}\h*$
or:
(?:\s*\b[a-z]+){0,}\h*$
Test 3
$re = '/(?:\s[a-z]+){0,}\h*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break ';
echo preg_replace($re, '', $str);
Output 3
John Doe
Bakary N'Diaye
Iván Aguilar
Cisteró
Dan I Soylu
Mike van der Hoorn
M.J. Williams
Donny van de Beek
Mike van der Hoorn
Artak G. Grigoryan
Trent Alexander-Arnold
Method 4:
Maybe, this would be the easiest and fastest way. Here, we'd get the last uppercase letter in a line with a greedy expression, then we'd add a \S+
or \S*
:
^.*\p{Lu}\S+
or,
^.*\p{Lu}\S*
or with a numeric quantifier:
^.{0,50}\p{Lu}\S*
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.