5

I am trying to query and display utf-8 encoded characters in a gui built on tkinter and thus tcl. However, I have found that tkinter cannot display 4-byte characters i.e. unicode codepoints greater than U+FFFF. Why is this the case? What limitations would implementing non-BMP characters have for tcl?

I can't query non-BMP characters through my gui, but if they come up in a result I can copy/paste the character and see the character/codepoint through unicode-table.com despite my system not displaying it. So, it seems that the character is being displayed as codepoint U+FFFD but stored in the view with the correct codepoint.

I am running a Python 3.6.4 script on Windows 7.

Update: Here is the error I get for some context where the 4-byte unicode codepoint is out of range of BMP characters and can't be handled by Tcl

 File "Project/userInterface.py", line 569, in populate_tree
    iids.append(self.detailtree.insert('', 'end', values=entry))
  File "C:\Program Files (x86)\Python36-32\Lib\tkinter\ttk.py", line 1343, in insert
    res = self.tk.call(self._w, "insert", parent, index, *opts)
_tkinter.TclError: character U+1f624 is above the range (U+0000-U+FFFF) allowed by Tcl

I handle this by using regular expressions to substitute out of range unicode characters with the replacement character.

  for item in entries:
        #handles unicode characters that are greator than 3 bytes as tkinter/tcl cannot handle/display them
        entry = list(item)
        for i, col in enumerate(entry):
            if col and isinstance(col, str):
                re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
                filtered_string = re_pattern.sub(u'\uFFFD', col) #replaces \u1000 and greater with the unknow character
                if filtered_string != col:
                    entry[i] = filtered_string
        entry = tuple(entry)
        iids.append(self.detailtree.insert('', 'end', values=entry))
Alec White
  • 172
  • 1
  • 9
  • 1
    ask authors why. Create minimal working example with your problem. – furas Jan 17 '18 at 21:22
  • You could use pillow to solve that problem. – Xantium Jan 17 '18 at 22:17
  • 1
    @Simon Pillow is Python 3's version of python imaging library correct? Here BMP does not mean an image bitmap but Basic Multilingual Plane (Plane 0) which means 3-byte unicode characters. Would pillow be helpful with extending tcl to 4-byte characters? – Alec White Jan 17 '18 at 23:12
  • Ah I misunderstood. – Xantium Jan 17 '18 at 23:13
  • 1
    In Windows, tkinter should support non-BMP characters if you pass UTF-16 surrogate codes in the string. Python allows this with the `'surrogatepass'` error handler. I don't think this is possible with UTF-8 in Unix. For example: `title_bytes = '\U0001F60A'.encode('utf-16le');` `title = ''.join(title_bytes[n:n+2].decode('utf-16le', 'surrogatepass') for n in range(0, len(title_bytes), 2));` `root = Tk();` `root.title(title);` `root.mainloop()`. – Eryk Sun Jan 18 '18 at 01:37
  • 5
    Tkinter only supports the BMP because Tk (the library that Tkinter is a wrapper around) only supports the BMP. That's a known issue with Tk that should be minimally fixed (provided you don't poke too closely) in 8.7. Encoding as surrogate pairs should work for now. – Donal Fellows Jan 18 '18 at 09:12

0 Answers0