23

The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:

  • converting all letters to lower or upper case
  • removing punctuation
  • converting numbers into words
  • removing accent marks and other diacritics
  • expanding abbreviations
  • removing stopwords or "too common" words
  • text canonicalization (tumor = tumour, it's = it is)

Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.

UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.

soshial
  • 5,906
  • 6
  • 32
  • 40

3 Answers3

23

Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.

a) converting all letters to lower or upper case

text='aiUOd'
print text.lower()
>> 'aiuod'
print text.upper()
>> 'AIUOD'

b) removing punctuation

text='She? Hm, why not!'
puncts='.?!'
for sym in puncts:
    text= text.replace(sym,' ')
print text
>> 'She  Hm  why not '

c) converting numbers into words

Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc

d) removing accent marks and other diacritics

look up point b), just create the list with diacritics as puncts

e) expanding abbreviations

Create a dictionary with abbreviations:

text='USA and GB are ...'
abbrevs={'USA':'United States','GB':'Great Britain'}
for abbrev in abbrevs:
    text= text.replace(abbrev,abbrevs[abbrev])
print text
>> 'United States and Great Britain are ...'

f) removing stopwords or "too common" words

Create a list with stopwords:

text='Mary had a little lamb'
temp_corpus=text.split(' ')
stops=['a','the','had']
corpus=[token for token in temp_corpus if token not in stops]
print corpus
>> ['Mary', 'little', 'lamb']

g) text canonicalization (tumor = tumour, it's = it is)

for tumor-> tumour use regex.

Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.

Max Li
  • 5,069
  • 3
  • 23
  • 35
  • 3
    As I deem an NLP toolkit, it should be able to do all processing operations that might involve some linguistic data. It means that I thought and I still think that nltk already has the dictionaries of equivalent words, abbreviation dictionary, canonicalization dict, converting into text numbers, **dates**, temperature, **currencies** and so on... Maybe we just do not know it well? – soshial Feb 13 '12 at 16:41
  • 1
    I'm sure you can't solve casemapping in the general case with just `.lower()` and `.upper()`. Consider Turkish `I`=`ı`, `İ`=`i`; German `ß`=`SS`; Greek `Σ`=both `ς` and `σ`. – hippietrail May 12 '13 at 08:38
  • Resolving abbreviations is risky. How do you know 'US' stands for 'United States'? 'You and me: US!' --> 'You and me: United States!' – imrek Oct 09 '15 at 17:37
  • 2
    @hippietrail, https://docs.python.org/3/library/stdtypes.html#str.casefold – brunsgaard Dec 30 '16 at 10:39
  • Of course this task can be done without NLTK, but you would have to create your own data (e.g. lists of stopwords, abbreviations). IMHO the OP means to ask for ready to use methods, e.g. http://www.nltk.org/api/nltk.stem.html. – Martin May 17 '17 at 12:13
2

I might be a little bit late, but this may be helpful. Here are the stop words for some languages (English, French, German, Finish, Hungarian, Turkish, Russian, Czech, Greek, Arabic, Chinese, Japanese, Korean, Catalan, Polish, Hebrew, Norwegian, Swedish, Italian, Portuguese and Spanish): https://pypi.python.org/pypi/many-stop-words

2

I suggest using stopwords.words() for stopword removal. Supports following languages: Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish.

wishiknew
  • 384
  • 1
  • 11