how to prevent NLTK to split specifics words?

Question

I have a list of stackoverflow tags : [javascript, node.js, c++, amazon-s3,....].

I want to tokenize a stackoverflow question : "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."

and I want nltk to tokenize 'node.js' into a single token : "node.js", not 'node' and 'js'.

How to tell nltk to not split a word if it is in my tag list ?

I have read this possible duplicate, and the question seems to be the same, but the answer based on Multi Word Expression Tokenizer doesn't satisfy my need.

In fact if I use this solution, I think I'll have to reconstruct manually all tags, example :

tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe('Python', '-', '3', '.', 'x)

My need is to keep all existing tags as "untokenizable"

Possible duplicate of [How to define special "untokenizable" words for nltk.word\_tokenize](https://stackoverflow.com/questions/45618528/how-to-define-special-untokenizable-words-for-nltk-word-tokenize) — Isaac B, Jan 31 '19 at 21:16
thanks @Isaac, but I think it doesn't answer my need. I edtited my question for detailed explanation. — Brigitte Maillère, Feb 01 '19 at 11:38

score 1 · Accepted Answer · answered Feb 01 '19 at 15:55

1

I don't know the full range of tags that you're looking to retain as whole tokens, but it seems that NLTK's basic word_tokenize() function will preserve those particular items as tokens, without any tag list defined.

import nltk
sentence = "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Output:

['what', 'do', 'I', 'prefer', '?', 'javascript', ',', 'node.js', ',', 'c++', 'or', 'amazon-S3', '?', 'This', 'is', 'dummy', '.']

answered Feb 01 '19 at 15:55

Isaac B

160
8

yes, thanks @Isaac B. It seems that I not even tried it, as I was "sure" that it wasn't possible :( . lesson learned. But I wonder how it is that NTLK handle this correctly, and where it is documented (I have searched but not found) – Brigitte Maillère Feb 04 '19 at 19:06

how to prevent NLTK to split specifics words?

1 Answers1