Is there a way to programmatically combine Korean unicode into one?

Question

Using a Korean Input Method Editor (IME), it's possible to type 버리 + 어 and it will automatically become 버려.

Is there a way to programmatically do that in Python?

>>> x, y = '버리', '어'
>>> z = '버려'
>>> ord(z[-1])
47140
>>> ord(x[-1]), ord(y)
(47532, 50612)

Is there a way to compute that 47532 + 50612 -> 47140?

Here's some more examples:

가보 + 아 -> 가봐

끝나 + ㄹ -> 끝날

The relations between the characters are not part of the Unicode standard. — Peter Wood, Feb 27 '17 at 07:19
As Peter said, there's no such relation in Unicode. The only relations the Unicode standard has are between single Jamo characters and precomposed Hangul syllables; you can combine isolated jamo into complete syllables. Here you want to combine two syllables (리 + 어) into a different syllable. You'll need to come up with your own table (or find one elsewhere). — R. Martinho Fernandes, Feb 27 '17 at 12:50
@PeterWood: I think his question is more about "is there any library that already handles all the mapping?" — justhalf, Mar 01 '17 at 05:39
Section 18.6 of the Unicode 9.0 standard covers Hangul Syllables, which seems to describe most of the code points in the question . It talks about 'jamo' and the 'Johab' set of modern Hangul syllables (399 possible two-jamo syllable blocks and 10,773 possible three-jamo syllable blocks). It references section 3.12 Conjoining Jame Behavior. This looks like a complicated area. (You can find the chapters at the [Unicode](http://www.unicode.org/versions/Unicode9.0.0/#Chapters_nb) web site, as you probably know.) — Jonathan Leffler, Mar 01 '17 at 23:47
@justhalf: Nominally, if the question is asking for "an existing library", it is at least in danger of being closed 'off-topic' because _Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam._ — Jonathan Leffler, Mar 01 '17 at 23:48
When I view '버리 + 어 ⟶ 버려', I see the Unicode code points U+BC84 U+B9AC + U+C5B4 ⟶ U+BC84 U+B824. Is that correct? So you start with two adjacent Hangul syllables, add a third, and end up with just two adjacent Hangul syllables — is that correct? — Jonathan Leffler, Mar 01 '17 at 23:56
@JonathanLeffler: That's probably why the question is phrased as "is there a way to programmatically combine Korean unicode into one?" =) — justhalf, Mar 02 '17 at 05:18

score 6 · Accepted Answer · answered Mar 08 '17 at 19:15

I'm a Korean. First, if you type 버리 + 어, it becomes 버리어 not 버려. 버려 is an abbreviation of 버리어 and it's not automatically generated. Also 가보아 cannot becomes 가봐 automatically during typing by the same reason.

Second, by contrast, 끝나 + ㄹ becomes 끝날 because 나 has no jongseong(종성). Note that one character of Hangul is made of choseong(초성), jungseong(중성), and jongseong. choseong and jongseong are a consonant, jungseong is a vowel. See more at Wikipedia. So only when there's no jongseong during typing (like 끝나), there's a chance that it can have jongseong(ㄹ).

If you want to make 버리 + 어 to 버려, you should implement some Korean language grammar like, especially for this case, abbreviation of jungseong. For example ㅣ + ㅓ = ㅕ, ㅗ + ㅏ = ㅘ as you provided. 한글 맞춤법 chapter 4. section 5 (I can't find English pages right now) defines abbreviation like this. It's possible, but not so easy job especially for non-Koreans.

Next, if what you want is just to make 끝나 + ㄹ to 끝날, it can be a relatively easy job since there're libraries which can handle composition and decomposition of choseong, jungseong, jongseong. In case of Python, I found hgtk. You can try like this (nonpractical code):

# hgtk methods take one character at a time
cjj1 = hgtk.letter.decompose('나')  # ('ㄴ', 'ㅏ', '')
cjj2 = hgtk.letter.decompose('ㄹ')  # ('ㄹ', '', '')
if cjj1[2]) == '' and cjj2[1]) == '':
    cjj = (cjj1[0], cjj1[1], cjj2[0])
    cjj2 = None

Still, without proper knowledge of Hangul, it will be very hard to get it done.

Thanks @SangbokLee for the explanation!! – alvas Mar 09 '17 at 01:24 — alvas, Mar 09 '17 at 01:24

score 3 · Answer 2 · answered Mar 02 '17 at 17:24

3

You could use your own Translation table.
The drawback is you have to input all pairs manual or you have a file to get it from.
For instance:

# Sample Korean chars to map
k = [[('버리', '어'), ('버려')], [('가보', '아'), ('가봐')], [('끝나', 'ㄹ'), ('끝날')]]

class Korean(object):
    def __init__(self):
        self.map = {}

        for m in k:
            key = m[0][0] + m[0][1]
            self.map[hash(key)] = m[1]

    def __getitem__(self, item):
        return self.map[hash(item)]

    def translate(self, s):
        return [ self.map[hash(token)] for token in s]

if __name__ == '__main__':
    k_map = Korean()
    k_chars = [ m[0][0] + m[0][1] for m in  k]

    print('Input: %s' % k_chars)
    print('Output: %s' % k_map.translate(k_chars))

    one_char_3 = k[0][0][0] + k[0][0][1]
    print('%s = %s' % (one_char_3, k_map[ one_char_3 ]) )

Input: ['버리어', '가보아', '끝나ㄹ']
Output: ['버려', '가봐', '끝날']
버리어 = 버려

Tested with Python:3.4.2

answered Mar 02 '17 at 17:24

stovfl

14,998
7
24
51

This is impractical since the number of combinations is very large in Hangul. – Sangbok Lee Mar 08 '17 at 18:17
I pointed that out. How large, tell the number. Think more about a dictionary, all dictionary in the world have been written manually. One of the therm, left or right, should already exist. So it's only half of the work to do. The samples given are unordered, I think ordered samples will show a Mathematical correlation. – stovfl Mar 08 '17 at 19:09
See the Wikipedia link in my answer for the detail. And the number is [1,638,394](https://namu.wiki/w/%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C#s-4.2.2)(Korean page). – Sangbok Lee Mar 08 '17 at 19:15
The number is Unicode based. If you take rather old sub-character-set, it can be reduced, but still nonpractical. – Sangbok Lee Mar 08 '17 at 19:25

Is there a way to programmatically combine Korean unicode into one?

2 Answers2