0

I would like to untokenize multiple combinations of strings. For example I would like to untokenize the combination (based on respective order) "my" and "apple" to "my apple"; and combination "this", "is" and "not" to "this is not". Currently I am using this nicely written function by @user2390182, but when I want to target multiple combinations it doesn't return it in the same output. Here is some reproducible code:

def detokenize(sent, tgt):
    i = 0
    tgt_len = len(tgt.split())  # this allows for phrases longer than 2
    while i < len(sent):
        if " ".join(sent[i:i+tgt_len]) == tgt:
            yield tgt
            i += tgt_len
        else:
            yield sent[i]
            i += 1

target = ["my apple", "this is not"]
words = ['this', 'is', 'my', 'apple', 'and', 'this', 'is', 'not', 'your', 'apple']

for i in target:
    print(list(detokenize(words, i)))

Returns:

['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
['this', 'is', 'my', 'apple', 'and', 'this is not', 'your', 'apple']

But I would like to have it in one list like this:

target_output = ['this', 'is', 'my apple', 'and', 'this is not', 'your', 'apple']
['this', 'is', 'my apple', 'and', 'this is not', 'your', 'apple']

So I was wondering if anyone knows how to untokenize multiple target combinations like above?

Quinten
  • 35,235
  • 5
  • 20
  • 53

0 Answers0