43

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

Edit:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')   
Noam Hacker
  • 4,671
  • 7
  • 34
  • 55
Brana
  • 1,197
  • 3
  • 17
  • 38
  • How did you get `'ve` from a sentence that used `have`? Is that what nltk actually does, or a transcription error? – user2357112 Feb 22 '14 at 00:46
  • I have modified the tokenised result. Anyway it is for a general case so you can put I've in the original sentence. – Brana Feb 22 '14 at 00:55
  • I am pretty sure that what you are requesting isn't possible. If you just have the bare strings `"I"` and `"'ve"` it is easy for a human to look at them and say "Oh, those two should go together without a space" but no simple program could figure that out. If the original parts-of-speech information that NLTK figured out from the original sentence was available, that could be used to untokenize, but `tokenize.untokenize()` was designed to work with `tokenize.tokenize()` and not `nltk.tokenize()`. You might want to read the free online book for NLTK: http://nltk.org/book – steveha Feb 25 '14 at 05:38
  • I edited the question so the source text has `'ve` to match the answer text. – steveha Feb 25 '14 at 05:39

10 Answers10

74

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • This is very helpful :) hopefully more people will upvote this – windweller Feb 06 '17 at 04:31
  • 1
    It's installable using `pip install nltk` now (v3.2.2). – Kirill Bulygin Feb 07 '17 at 14:57
  • 1
    @KirillBulygin thanks for the update! I've put this information into the answer. – alecxe Feb 07 '17 at 15:00
  • Using `l` as a variable name is confusing. I mistook it for a `1`. – Nick Graham Apr 05 '18 at 12:19
  • @NickGraham right, that, though, renders half of stackoverflow confusing :) Changing to `data`. Thanks. – alecxe Apr 05 '18 at 12:20
  • 4
    As of April 10, 2018, moses is not available in NLTK due to a licensing issue https://github.com/nltk/nltk/issues/2000 – gss Jun 21 '18 at 16:09
  • 2
    But it seems to have moved here https://github.com/alvations/sacremoses – gss Jun 21 '18 at 16:34
  • 1
    Per @fearwig above, this is not "nowadays" the correct answer any more; use [Uri's](https://stackoverflow.com/a/51007595/240443) answer. – Amadan Jul 03 '19 at 05:11
  • 1
    @Amadan thanks for flagging. Updated the answer accordingly. – alecxe Jul 03 '19 at 13:56
  • Its nice there is finally a solution on this problem. Moses is quite fast to load, i believe faster than NLTK, I did not test recently but when I did NLTK load time was over 1 sec, maybe even 2 sec for load. – Brana Jul 29 '19 at 16:16
  • 2
    When I use detokenize, sometimes I get a space before punctuation (before a period or comma) which I don't want. Anyone else have this problem or know what might be the issue? – Ken Mar 14 '21 at 21:27
12

To reverse word_tokenize from nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.

Short of doing crazy hacks on nltk, you can try this:

>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."
alvas
  • 115,346
  • 109
  • 446
  • 738
6

use token_utils.untokenize from here

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

 tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
 untokenize(tokenized)
 "I've found a medicine for my disease."
Renklauf
  • 971
  • 1
  • 12
  • 27
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/10825891) – Łukasz Rogalski Jan 08 '16 at 22:06
  • @Rogalski Suggested changes made. – Renklauf Jan 11 '16 at 02:29
4
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
Uri
  • 25,622
  • 10
  • 45
  • 72
  • 2
    While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context. – Robert Columbia Jun 24 '18 at 11:34
  • 2
    There is no not-redundant sentence to add. – Uri Jun 24 '18 at 14:01
1

I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.

import re
from nltk.tokenize import word_tokenize

def offset_tokenize(text):
    tail = text
    accum = 0
    tokens = self.tokenize(text)
    info_tokens = []
    for tok in tokens:
        scaped_tok = re.escape(tok)
        m = re.search(scaped_tok, tail)
        start, end = m.span()
        # global offsets
        gs = accum + start
        ge = accum + end
        accum += end
        # keep searching in the rest
        tail = tail[end:]
        info_tokens.append((tok, (gs, ge)))
    return info_token

sent = '''I've found a medicine for my disease.

This is line:3.'''

toks_offsets = offset_tokenize(sent)

for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]

Gives:

True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .
alemol
  • 8,058
  • 2
  • 24
  • 29
1

For me, it worked when I installed python nltk 3.2.5,

pip install -U nltk

then,

import nltk
nltk.download('perluniprops')

from nltk.tokenize.moses import MosesDetokenizer

If you are using insides pandas dataframe, then

df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
DIF
  • 2,470
  • 6
  • 35
  • 49
  • '''import nltk; nltk.download('perluniprops'); nltk.download('nonbreaking_prefixes')'''; from nltk.tokenize.moses import MosesTokenizer; from nltk.tokenize.moses import MosesDetokenizer; text = 'Pete ate a large cake. Sam has a big mouth.'; text_ = MosesTokenizer().tokenize(text); text1 = ' '.join(MosesDetokenizer().detokenize(text_)) # works for multiple sentences as well while the other methods (except Renklauf's) don't. – mikey Sep 12 '19 at 07:36
1

The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:

1) The original string

2) The original tokens

3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)

Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.

Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.

tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"

def detokenize(string,tokens,replacement_tokens):
    spans=[]
    cursor=0
    for token in tokens:
        while not string[cursor:cursor+len(token)]==token and cursor<len(string):
            cursor+=1        
        if cursor==len(string):break
        newcursor=cursor+len(token)
        spans.append((cursor,newcursor))
        cursor=newcursor
    i=len(tokens)-1
    for start,end in spans[::-1]:
        string=string[:start]+replacement_tokens[i]+string[end:]
        i-=1
    return string

>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
gss
  • 653
  • 8
  • 9
0

The reason tokenize.untokenize does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:

from StringIO import StringIO
import tokenize

sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)


Additional Help: Tokenize - Python Docs | Potential Problem

Community
  • 1
  • 1
dparpyani
  • 2,473
  • 2
  • 14
  • 16
  • thanks, but I have to convert specifically the output back to sentence. Is there any way to add the necessary info to the tokenizes output - ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.'] – Brana Feb 22 '14 at 02:57
  • I would do this in the way in the update but i found it really strange that nltk doesn't have such a method. – Brana Feb 22 '14 at 02:59
  • @Brana Sorry, I am not too familiar with `nltk`. I tried looking through the docs, but couldn't find untokenize. – dparpyani Feb 22 '14 at 03:02
  • thanks. I didn't eaither so i thought it was just me, – Brana Feb 22 '14 at 22:00
0

I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens

_SPLITTER_ = r"([-.,/:!?\";)(])"

def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by  our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
    if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
        left = detokenize_sentence.pop()
        detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
        pos +=1
    elif  words[pos] in '[(' and pos < len(words) - 1:
        detokenize_sentence.append(''.join(words[pos:pos + 2]))   
        pos +=1        
    elif  words[pos] in ']).,:!?;' and pos > 0:
        left  = detokenize_sentence.pop()
        detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))            
    else:
        detokenize_sentence.append(words[pos])
    pos +=1
return ' '.join(detokenize_sentence)
Asad
  • 2,782
  • 2
  • 16
  • 17
-3

Use the join function:

You could just do a ' '.join(words) to get back the original string.

shaktimaan
  • 11,962
  • 2
  • 29
  • 33