I am trying to tokenize thai language text using deepcut in Python and I am getting UnicodeDecodeError.
This is what I have tried
import deepcut
thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)
Expected output:
[\['ตัดคำ','ได้','ดี','มาก'\]][1]
Tried:
for i in result:
print(i.decode('utf-8'))
Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data
print([i for i in result])
Output: ['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']
How can I get it to display the proper tokenized results or is there a better way to tokenize Thai language text?