n-gram vectorization using TfidfVectorizer

Question

I am using TfidfVectorizer with following parameters:

smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)

I am vectorizing following text: "red sun, pink candy. Green flower."

Here is output of get_feature_names():

['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']

Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?

Is there a way to prevent creation of n-grams spawning multiple sentences?

I believe there is no `sentence` in vectorizer. A book is just a collection of series of words. — notilas, May 16 '20 at 04:59

score 1 · Accepted Answer · edited Apr 06 '22 at 14:46

Depends on how you are passing that to TfidfVectorizer!

If passed as a single document, TfidfVectorizer will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:

['red', 'sun', 'pink', 'candy', 'green', 'flower']

Now from these tokens, ngrams are generated.

Since TfidfVectorizer is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document. If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.

Or else, pass your own analyzer and ngram generator to the TfidfVectorizer.

For more information on how TfidfVectorizer actually works, see my other answer:

sklearn TfidfVectorizer : Generate Custom NGrams by not removing stopword in them

Got it, thanks. So it seem if I want to continue to pass it as file input, I will need to provide my own analyzer. — leon, Aug 31 '18 at 18:23

n-gram vectorization using TfidfVectorizer

1 Answers1