0

I use mb_convert_encoding function to convert UTF8 characters to SJIS characters.

Before conversion:でんぱ組 出会いの歌26 カミソヤマ ユニ

After conversion: て?んは?組 出会いの歌26 カミソヤマ ユニ

Non-convertible characters: て?んは?

Code used to convert :

$str = mb_convert_encoding('でんぱ組 出会いの歌26 カミソヤマ ユニ', "SJIS", "UTF-8");
Ihenry
  • 111
  • 1
  • 2
  • 12
  • So most likely there is no valid conversion for those characters. Which is why they have to be transcribed. – arkascha Jul 15 '22 at 10:34

1 Answers1

0

as 1 grapheme is only a rendering of composing the 2 Unicode codepoints and ◌゙ (not to be confused with the codepoint that can't be combined) - the former can be translated from UTF-8 to Shift-JIS, the latter not.

Same thing with: - it's combined from and ◌゚ instead of being one single character:

◌゙ ◌゚
Unicode U+3066 U+3067 U+3099 U+3071 U+306F U+309A
UTF-8 e3 81 a6 e3 81 a6 e3 82 99 e3 81 b1 e3 81 af e3 82 9a
Shift-JIS
or CP932
82 c5 82 c4 (doesn't exist) 82 cf 82 cd (doesn't exist)

Just because you see 1 grapheme (f.e. で or ぱ) in Unicode (f.e. in UTF-8) it doesn't mean it is build from 1 codepoint. You can neither trust your eyes, nor your user's input - it can either be really 1 codepoint or not. You have to normalize your UTF-8 text (f.e. to the NFC form) before converting it to Shift-JIS, as then those 2 codepoints (U+3067 and U+3099) for 1 grapheme also become 1 codepoint (U+3066), which can then also be translated to Shift-JIS without problems (82 c5).

In PHP the extension intl must be installed, then you can use normalizer_normalize() - the result of that function can then be fully converted to Shift-JIS.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31