I am using TfidfVectorizer from scikit-learn to extract features, And the settings are:
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = []
for token in tokens:
token = re.sub("[^a-zA-Z]","", token)
stems.append(EnglishStemmer().stem(token))
return stems
vectorizer = TfidfVectorizer(tokenizer=tokenize, lowercase=True, stop_words='english')
After feeding the training set to the vectorizer, I call
vectorizer.get_feature_names()
the output contains some duplicate words with space: e.g.
u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'
And the acceptable output should be:
u'low', u'lower', u'lower high', u'lower low'
How can I solve that? Thank you.