I would like to untokenize multiple combinations of strings. For example I would like to untokenize the combination (based on respective order) "my" and "apple" to "my apple"; and combination "this", "is" and "not" to "this is not". Currently I am using this nicely written function by @user2390182, but when I want to target multiple combinations it doesn't return it in the same output. Here is some reproducible code:
def detokenize(sent, tgt):
i = 0
tgt_len = len(tgt.split()) # this allows for phrases longer than 2
while i < len(sent):
if " ".join(sent[i:i+tgt_len]) == tgt:
yield tgt
i += tgt_len
else:
yield sent[i]
i += 1
target = ["my apple", "this is not"]
words = ['this', 'is', 'my', 'apple', 'and', 'this', 'is', 'not', 'your', 'apple']
for i in target:
print(list(detokenize(words, i)))
Returns:
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
['this', 'is', 'my', 'apple', 'and', 'this is not', 'your', 'apple']
But I would like to have it in one list like this:
target_output = ['this', 'is', 'my apple', 'and', 'this is not', 'your', 'apple']
['this', 'is', 'my apple', 'and', 'this is not', 'your', 'apple']
So I was wondering if anyone knows how to untokenize multiple target combinations like above?