PHP mb_convert_encoding convert from UTF-8 to SHIFT JIS is wrong

Question

I use mb_convert_encoding function to convert UTF8 characters to SJIS characters.

Before conversion:でんぱ組出会いの歌26 カミソヤマ　ユニ

After conversion: て?んは?組出会いの歌26 カミソヤマ　ユニ

Non-convertible characters: て?んは?

Code used to convert :

$str = mb_convert_encoding('でんぱ組 出会いの歌26 カミソヤマ　ユニ', "SJIS", "UTF-8");

So most likely there is no valid conversion for those characters. Which is why they have to be transcribed. — arkascha, Jul 15 '22 at 10:34

score 0 · Answer 1 · answered Jul 17 '22 at 13:22

で as 1 grapheme is only a rendering of composing the 2 Unicode codepoints て and ◌゙ (not to be confused with the codepoint ゛ that can't be combined) - the former can be translated from UTF-8 to Shift-JIS, the latter not.

Same thing with: ぱ - it's combined from は and ◌゚ instead of being one single character:

	で	て	◌゙	ぱ	は	◌゚
Unicode	U+3066	U+3067	U+3099	U+3071	U+306F	U+309A
UTF-8	`e3 81 a6`	`e3 81 a6`	`e3 82 99`	`e3 81 b1`	`e3 81 af`	`e3 82 9a`
Shift-JIS or CP932	`82 c5`	`82 c4`	(doesn't exist)	`82 cf`	`82 cd`	(doesn't exist)

Just because you see 1 grapheme (f.e. で or ぱ) in Unicode (f.e. in UTF-8) it doesn't mean it is build from 1 codepoint. You can neither trust your eyes, nor your user's input - it can either be really 1 codepoint or not. You have to normalize your UTF-8 text (f.e. to the NFC form) before converting it to Shift-JIS, as then those 2 codepoints (U+3067 and U+3099) for 1 grapheme also become 1 codepoint (U+3066), which can then also be translated to Shift-JIS without problems (82 c5).

In PHP the extension intl must be installed, then you can use normalizer_normalize() - the result of that function can then be fully converted to Shift-JIS.

PHP mb_convert_encoding convert from UTF-8 to SHIFT JIS is wrong

1 Answers1