How to find the missing components of Chinese characters within Unicode?

Question

I am currently working on the decomposition of Chinese characters (Japanese kanji, to be more exact) and I have found a few components that are seemingly either not included in the Unihan database or they cannot be properly displayed with any font I am aware of. Is there some way to locate these characters within UTF-8 or UTF-16 and make them to be properly displayed in their character form? The list of components is provided below:

渋 ---> 氵+ 止 + ??? ... I have not managed to find those four dots in the Unihan database ... even here the authors had to encode the component ... the same issue appears in kanji 楽 and 摂 and 率

龍 ---> + ??? .... the component on the right hand side seems not to be in the Unicode ... the same goes for 拝 or 継

制 ---> ??? + 刂 ... left component seems not to be in the Unicode (the closest probably is 韦) ... the same goes for the kanji 段 ---> ??? + 殳

祭 ---> ??? + 示

留 ---> ??? + 田 (it is possible to decompose into three components + 刀 + 田, but two would be better)

Thank you very much for your advice :-)

I went through the whole Unihan database (over 90 000 chars) and did not manage to find the missing components. I tried installing various fonts Babel Stone Han, simch5100, etc. but their coverage of Unicode is not 100%. Nevertheless, I am afraid that some of these components are not included within Unicode by themselves and they can be displayed only as a part another character.

Looking at your link, I think you're looking for Ideographic Description Sequences (your link includes IDCs, even though your examples don't). I believe your answer is best given here: https://japanese.stackexchange.com/questions/14702/decomposition-of-kanji (and the short version is "yes, there are missing components"). — Rob Napier, Nov 11 '22 at 22:30

score 2 · Answer 1 · answered Nov 11 '22 at 22:52

You may want to have a look at the IDS.TXT data file maintained by Andrew West (BabelStone), which provides Ideographic Description Sequences (IDS) for all the 97,058 CJK unified ideographs defined in Unicode version 15.0.

It makes use of about 120 "numbered components" which are characters not yet defined in Unicode (although it seems they may be added later on, according to some official proposal). They are currently represented by glyphs found in an associated Private Use Area (PUA) font named BabelStone Han PUA, which can be freely downloaded from the bottom of the page.

There is also one open-source application making extensive use of this data in a graphical way, called Unicopedia Sinica, available on GitHub.

How to find the missing components of Chinese characters within Unicode?

1 Answers1