The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:
- converting all letters to lower or upper case
- removing punctuation
- converting numbers into words
- removing accent marks and other diacritics
- expanding abbreviations
- removing stopwords or "too common" words
- text canonicalization (tumor = tumour, it's = it is)
Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.
UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.