I am using TfidfVectorizer
with following parameters:
smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)
I am vectorizing following text: "red sun, pink candy. Green flower."
Here is output of get_feature_names():
['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']
Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?
Is there a way to prevent creation of n-grams spawning multiple sentences?