Join separated grapheme cluster

Question

I have some Burmese text, which was split down to individual characters to check for and remove characters outside of the relevant Unicode block, e.g. removing Latin characters from Burmese text. The result (if I am using the correct term) is that the grapheme clusters have been separated like:

ေမာင္ေကာင္းၫိႈ႕မွဴးႏိုင္

I believe where the dotted line circles are should be the two chracters as one Unicode character as opposed to two.

Correctly rendered Burmese shouldn't have these dotted circles like:

ယနေ့ မြန်မာမှုအဖြစ် ပုံဖော်ပေးခဲ့သည့် ယဉ်ကျေးမှုမှာ နှစ်ပေါင်း အတော်အတန်ကြာမြင့်နေပြီဖြစ်ကြောင်း

Any ideas on how this could be fixed?

Are you sure you removed only non-Myanmar characters? Many of the combining marks in the string you’ve given don’t have a letter before them, which is why they display with the dotted circle placeholder. Can you post the original string that lead to this result? — CharlotteBuff, Aug 05 '17 at 16:30
Yes, but I no longer have the original source. The problem is I split to single characters using PHP's preg_split and mb_regex_encoding UTF-8. This causes the clusters to be broken. Using UTF-16 leaves the clusters intact. — Kohjah Breese, Aug 06 '17 at 07:15
I just had a look at the string you provided on my phone and it seems like it’s encoded in Zawgyi and never was Unicode to begin with. — CharlotteBuff, Sep 10 '17 at 15:21

Join separated grapheme cluster

0 Answers0