0

I have some Burmese text, which was split down to individual characters to check for and remove characters outside of the relevant Unicode block, e.g. removing Latin characters from Burmese text. The result (if I am using the correct term) is that the grapheme clusters have been separated like:

ေမာင္ေကာင္းၫိႈ႕မွဴးႏိုင္

I believe where the dotted line circles are should be the two chracters as one Unicode character as opposed to two.

Correctly rendered Burmese shouldn't have these dotted circles like:

ယနေ့ မြန်မာမှုအဖြစ် ပုံဖော်ပေးခဲ့သည့် ယဉ်ကျေးမှုမှာ နှစ်ပေါင်း အတော်အတန်ကြာမြင့်နေပြီဖြစ်ကြောင်း

Any ideas on how this could be fixed?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Kohjah Breese
  • 4,008
  • 6
  • 32
  • 48
  • 2
    Are you sure you removed only non-Myanmar characters? Many of the combining marks in the string you’ve given don’t have a letter before them, which is why they display with the dotted circle placeholder. Can you post the original string that lead to this result? – CharlotteBuff Aug 05 '17 at 16:30
  • Yes, but I no longer have the original source. The problem is I split to single characters using PHP's preg_split and mb_regex_encoding UTF-8. This causes the clusters to be broken. Using UTF-16 leaves the clusters intact. – Kohjah Breese Aug 06 '17 at 07:15
  • I just had a look at the string you provided on my phone and it seems like it’s encoded in Zawgyi and never was Unicode to begin with. – CharlotteBuff Sep 10 '17 at 15:21

0 Answers0