I have a list of stackoverflow tags : [javascript, node.js, c++, amazon-s3,....].
I want to tokenize a stackoverflow question : "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
and I want nltk to tokenize 'node.js' into a single token : "node.js", not 'node' and 'js'.
How to tell nltk to not split a word if it is in my tag list ?
I have read this possible duplicate, and the question seems to be the same, but the answer based on Multi Word Expression Tokenizer doesn't satisfy my need.
In fact if I use this solution, I think I'll have to reconstruct manually all tags, example :
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe('Python', '-', '3', '.', 'x)
My need is to keep all existing tags as "untokenizable"