0

I have a list of strings and I would like to untokenize some specific strings. Imagine having the following list with strings and I would like to join the words "my" and "apple" only if they are in respectively order. I was thinking to use the detokenize function from this Python Untokenize a sentence question. Here is some reproducible code:

target = "my apple"
words = ['this', 'is', 'my', 'apple', 'and', 'this', 'is', 'not', 'your', 'apple']

Using the detokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer    
TreebankWordDetokenizer().detokenize(['my', 'apple'])
'my apple'

But I am not sure how to use this in a list with multiple strings and with specifying a target. Here is the desired output:

target_output = ['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']

So I was wondering if anyone knows how to detokenize some specific words only if they are next to each other in a list?

user2390182
  • 72,016
  • 6
  • 67
  • 89
Quinten
  • 35,235
  • 5
  • 20
  • 53

1 Answers1

1

The following seems simple enough:

def detokenize(sent, tgt):
    i = 0
    tgt_len = len(tgt.split())  # this allows for phrases longer than 2
    while i < len(sent):
        if " ".join(sent[i:i+tgt_len]) == tgt:
            yield tgt
            i += tgt_len
        else:
            yield sent[i]
            i += 1

>>> list(detokenize(words, "my apple"))
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
>>> list(detokenize(words, "this is not"))
['this', 'is', 'my', 'apple', 'and', 'this is not', 'your', 'apple']
user2390182
  • 72,016
  • 6
  • 67
  • 89