1

I have a regular expression to get the initials of a name like below:

/\b\p{L}\./gu

it works fine with English and other languages until there are graphemes and combined charecters occur. Like
in Hindi and
in Kannada
are being matched
But,
के this one in Hindi,
ಕೆ this one in Kannada
are notmatched with this regex.
I am trying to get the initials from a name like J.P.Morgan, etc.
Any help would be greatly appreciated.

Prashanth Benny
  • 1,523
  • 21
  • 33

1 Answers1

2

You need to match diacritic marks after base letters using \p{M}*:

'~\b(?<!\p{M})\p{L}\p{M}*\.~u'

The pattern matches

  • \b - a word boundary
  • (?<!\p{M}) - the char before the current position must not be a diacritic char (without it, a match can occur within a single word)
  • \p{L} - any base Unicode letter
  • \p{M}* - 0+ diacritic marks
  • \. - a dot.

See the PHP demo online:

$s = "क. ಕ. के. ಕೆ. ";
echo preg_replace('~\b(?<!\p{M})\p{L}\p{M}*+\.~u', '<pre>$0</pre>', $s); 
// => <pre>क.</pre> <pre>ಕ.</pre> <pre>के.</pre> <pre>ಕೆ.</pre> 
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • i have a doubt @Wiktor. this one also matches the `last charecter of words` like this **ಹೇಳಿದ್ದೇ`ನೆ.`** even though there is no word break, it is picking up this combination. any idea? – Prashanth Benny Jan 16 '19 at 13:05
  • 2
    @PrashanthBenny Right, it is due to the diacritics. Add a negative lookbehind: `'~\b(?<!\p{M})\p{L}\p{M}*+\.~u'` – Wiktor Stribiżew Jan 16 '19 at 13:08