undo the tokenization in python

Question

I would like to reverse the tokenization that I have applied to my data.

data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]

Expected output:

['this is a sentence', 'this is a sentence 2']

I tried to do this with the following code block:

from nltk.tokenize.treebank import TreebankWordDetokenizer
data_untoken= []
for i, text in enumerate(data):
    data_untoken.append(text)
    data_untoken = TreebankWordDetokenizer().detokenize(text)

But I have the following error

'str' object has no attribute 'append'

1

`' '.join(data)` – Amit Vikram Singh Apr 13 '21 at 21:24

Code-Apprentice · Accepted Answer · 2021-04-13T22:00:21.020

4

Use join():

def untokenize(data):
    for tokens in data:
        yield ' '.join(tokens)


data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']]
untokenized_data = list(untokenize(data))

edited Apr 13 '21 at 22:00

answered Apr 13 '21 at 21:29

Code-Apprentice

81,660
23
145
268

undo the tokenization in python

1 Answers1