で
as 1 grapheme is only a rendering of composing the 2 Unicode codepoints て
and ◌゙
(not to be confused with the codepoint ゛
that can't be combined) - the former can be translated from UTF-8 to Shift-JIS, the latter not.
Same thing with: ぱ
- it's combined from は
and ◌゚
instead of being one single character:
|
で |
て |
◌゙ |
ぱ |
は |
◌゚ |
Unicode |
U+3066 |
U+3067 |
U+3099 |
U+3071 |
U+306F |
U+309A |
UTF-8 |
e3 81 a6 |
e3 81 a6 |
e3 82 99 |
e3 81 b1 |
e3 81 af |
e3 82 9a |
Shift-JIS or CP932 |
82 c5 |
82 c4 |
(doesn't exist) |
82 cf |
82 cd |
(doesn't exist) |
Just because you see 1 grapheme (f.e. で or ぱ) in Unicode (f.e. in UTF-8) it doesn't mean it is build from 1 codepoint. You can neither trust your eyes, nor your user's input - it can either be really 1 codepoint or not. You have to normalize your UTF-8 text (f.e. to the NFC form) before converting it to Shift-JIS, as then those 2 codepoints (U+3067 and U+3099) for 1 grapheme also become 1 codepoint (U+3066), which can then also be translated to Shift-JIS without problems (82 c5
).
In PHP the extension intl
must be installed, then you can use normalizer_normalize()
- the result of that function can then be fully converted to Shift-JIS.