5

I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

This is the output:

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

UPDATED

This is the Stopwords that appears and has been lemmatized:

u'lar', u'ler', u'der'

It lemmatice all words and doesn't remove Stopwords. So, any idea?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
ambigus9
  • 1,417
  • 3
  • 19
  • 37
  • You have not specified the `LemmaTokenizer` in `CountVectorizer` here. And I am not getting the same output as yours on this code. – Vivek Kumar May 03 '18 at 12:39
  • Sorry, my mistake. But if you reproduce the code, It doesn't work. I just doesn't remove Stopwords. – ambigus9 May 03 '18 at 12:49
  • Again, I tried the new code and I did not find any stop words in the output, words which are present in stopwords.words('spanish') and the output both. Can you pinpoint in the output which stop word is not removed? – Vivek Kumar May 03 '18 at 13:21
  • Thanks. Updated. – ambigus9 May 03 '18 at 13:56

1 Answers1

8

Thats because lemmatization is done before stop word removal. And then the lemmatized stopwords are not found in the stopwords set provided by stopwords.words('spanish').

For complete working order of CountVectorizer, please refer to my other answer here. Its about TfidfVectorizer but the order is same. In that answer, step 3 is the lemmatization and step 4 is stopword removal.

So now to remove the stopwords, you have two options:

1) You lemmatize the stopwords set itself, and then pass it to stop_words param in CountVectorizer.

my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
vectorizer = CountVectorizer(stop_words=my_stop_words, 
                             tokenizer=LemmaTokenizer())

2) Include the stop word removal in the LemmaTokenizer itself.

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]

Try these and comment if not working.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thanks. I tried with this: `tokenizer=lambda text: [lemma(t) for t in word_tokenize(text) if (t not in stopwords.words('spanish')) and (t not in punctuation)]` and works. What do you think? – ambigus9 May 03 '18 at 15:09