1

I have a list of stackoverflow tags : [javascript, node.js, c++, amazon-s3,....].

I want to tokenize a stackoverflow question : "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."

and I want nltk to tokenize 'node.js' into a single token : "node.js", not 'node' and 'js'.

How to tell nltk to not split a word if it is in my tag list ?

I have read this possible duplicate, and the question seems to be the same, but the answer based on Multi Word Expression Tokenizer doesn't satisfy my need.

In fact if I use this solution, I think I'll have to reconstruct manually all tags, example :

tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe('Python', '-', '3', '.', 'x)

My need is to keep all existing tags as "untokenizable"

Brigitte Maillère
  • 847
  • 1
  • 9
  • 27
  • Possible duplicate of [How to define special "untokenizable" words for nltk.word\_tokenize](https://stackoverflow.com/questions/45618528/how-to-define-special-untokenizable-words-for-nltk-word-tokenize) – Isaac B Jan 31 '19 at 21:16
  • thanks @Isaac, but I think it doesn't answer my need. I edtited my question for detailed explanation. – Brigitte Maillère Feb 01 '19 at 11:38

1 Answers1

1

I don't know the full range of tags that you're looking to retain as whole tokens, but it seems that NLTK's basic word_tokenize() function will preserve those particular items as tokens, without any tag list defined.

import nltk
sentence = "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Output:

['what', 'do', 'I', 'prefer', '?', 'javascript', ',', 'node.js', ',', 'c++', 'or', 'amazon-S3', '?', 'This', 'is', 'dummy', '.']
Isaac B
  • 160
  • 8
  • yes, thanks @Isaac B. It seems that I not even tried it, as I was "sure" that it wasn't possible :( . lesson learned. But I wonder how it is that NTLK handle this correctly, and where it is documented (I have searched but not found) – Brigitte Maillère Feb 04 '19 at 19:06