0

I would like to reverse the tokenization that I have applied to my data.

data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]

Expected output:

['this is a sentence', 'this is a sentence 2']

I tried to do this with the following code block:

from nltk.tokenize.treebank import TreebankWordDetokenizer
data_untoken= []
for i, text in enumerate(data):
    data_untoken.append(text)
    data_untoken = TreebankWordDetokenizer().detokenize(text)

But I have the following error

'str' object has no attribute 'append'
Tazz
  • 81
  • 9

1 Answers1

4

Use join():

def untokenize(data):
    for tokens in data:
        yield ' '.join(tokens)


data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
untokenized_data = list(untokenize(data))
Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268