UnicodeDecodeError when tokenizing Thai language text in Python

Question

I am trying to tokenize thai language text using deepcut in Python and I am getting UnicodeDecodeError.

This is what I have tried

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

Expected output:

[\['ตัดคำ','ได้','ดี','มาก'\]][1]

Tried:

for i in result:
  print(i.decode('utf-8'))

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

print([i for i in result])

Output: ['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']

How can I get it to display the proper tokenized results or is there a better way to tokenize Thai language text?

On my machine it gives `['ตัด', 'คำ', 'ได้', 'ดี', 'มาก']` on print of result. Your for loop gives error `AttributeError: 'str' object has no attribute 'decode'` so provide steps to reproduce error — Morse, Mar 20 '18 at 14:16
@EmilyE. This is the exact example with errors except it runs on a Databricks notebook. — Cryssie, Mar 20 '18 at 23:23

score -1 · Answer 1 · answered Mar 20 '18 at 13:09

-1

You don't need to convert it back to utf-8:

Just try:

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

print([i for i in result])

output:

['ตัด', 'คำ', 'ได้', 'ดี', 'มาก']

Apart that you can also try this Thai NLP Module

answered Mar 20 '18 at 13:09

Aaditya Ura

12,007
7
50
88

I tried printing the list without decoding but I am getting a bunch of ['\xe0', '\xb8', '\x95', ... as output. I am thinking it might be related to encoding errors to tokenize the thai text properly. – Cryssie Mar 20 '18 at 13:31
I get the same output with just `print(result)` – Morse Mar 20 '18 at 14:10

UnicodeDecodeError when tokenizing Thai language text in Python

1 Answers1