regular expression to match name initials - PCRE

Question

I have a regular expression to get the initials of a name like below:

/\b\p{L}\./gu

it works fine with English and other languages until there are graphemes and combined charecters occur. Like
क in Hindi and
ಕ in Kannada are being matched
But,
के this one in Hindi,
ಕೆ this one in Kannada are notmatched with this regex.
I am trying to get the initials from a name like J.P.Morgan, etc.
Any help would be greatly appreciated.

Wiktor Stribiżew · Accepted Answer · 2019-01-17T07:41:36.833

2

You need to match diacritic marks after base letters using \p{M}*:

'~\b(?<!\p{M})\p{L}\p{M}*\.~u'

The pattern matches

\b - a word boundary
(?<!\p{M}) - the char before the current position must not be a diacritic char (without it, a match can occur within a single word)
\p{L} - any base Unicode letter
\p{M}* - 0+ diacritic marks
\. - a dot.

See the PHP demo online:

$s = "क. ಕ. के. ಕೆ. ";
echo preg_replace('~\b(?<!\p{M})\p{L}\p{M}*+\.~u', '<pre>$0</pre>', $s); 
// => <pre>क.</pre> <pre>ಕ.</pre> <pre>के.</pre> <pre>ಕೆ.</pre>

edited Jan 17 '19 at 07:41

answered Jan 14 '19 at 09:39

Wiktor Stribiżew

607,720
39
448
563

i have a doubt @Wiktor. this one also matches the `last charecter of words` like this **ಹೇಳಿದ್ದೇ`ನೆ.`** even though there is no word break, it is picking up this combination. any idea? – Prashanth Benny Jan 16 '19 at 13:05
2

@PrashanthBenny Right, it is due to the diacritics. Add a negative lookbehind: `'~\b(?<!\p{M})\p{L}\p{M}*+\.~u'` – Wiktor Stribiżew Jan 16 '19 at 13:08

regular expression to match name initials - PCRE

1 Answers1