0

I am trying to tokenize thai language text using deepcut in Python and I am getting UnicodeDecodeError.

This is what I have tried

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

Expected output:

[\['ตัดคำ','ได้','ดี','มาก'\]][1]

Tried:

for i in result:
  print(i.decode('utf-8'))

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

print([i for i in result])

Output: ['\xe0', '\xb8', '\x95', '\xe0', '\xb8', '\xb1', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\x84', '\xe0', '\xb8', '\xb3', '\xe0', '\xb9', '\x84', '\xe0', '\xb8', '\x94', '\xe0', '\xb9', '\x89', '\xe0', '\xb8', '\x94', '\xe0', '\xb8', '\xb5', '\xe0', '\xb8', '\xa1', '\xe0', '\xb8', '\xb2', '\xe0', '\xb8', '\x81']

How can I get it to display the proper tokenized results or is there a better way to tokenize Thai language text?

Cryssie
  • 3,047
  • 10
  • 54
  • 81
  • 1
    On my machine it gives `['ตัด', 'คำ', 'ได้', 'ดี', 'มาก']` on print of result. Your for loop gives error `AttributeError: 'str' object has no attribute 'decode'` so provide steps to reproduce error – Morse Mar 20 '18 at 14:16
  • can you provide `print(result)`? – Morse Mar 20 '18 at 14:33
  • A [mcve] please. –  Mar 20 '18 at 17:19
  • @EmilyE. This is the exact example with errors except it runs on a Databricks notebook. – Cryssie Mar 20 '18 at 23:23

1 Answers1

-1

You don't need to convert it back to utf-8:

Just try:

import deepcut

thai = 'ตัดคำได้ดีมาก'
result = deepcut.tokenize(thai)

print([i for i in result])

output:

['ตัด', 'คำ', 'ได้', 'ดี', 'มาก']

Apart that you can also try this Thai NLP Module

Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
  • I tried printing the list without decoding but I am getting a bunch of ['\xe0', '\xb8', '\x95', ... as output. I am thinking it might be related to encoding errors to tokenize the thai text properly. – Cryssie Mar 20 '18 at 13:31
  • I get the same output with just `print(result)` – Morse Mar 20 '18 at 14:10