1

I am using TfidfVectorizer from scikit-learn to extract features, And the settings are:

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for token in tokens:
        token = re.sub("[^a-zA-Z]","", token)
        stems.append(EnglishStemmer().stem(token))
    return stems

vectorizer = TfidfVectorizer(tokenizer=tokenize, lowercase=True, stop_words='english')

After feeding the training set to the vectorizer, I call

vectorizer.get_feature_names()

the output contains some duplicate words with space: e.g.

u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'

And the acceptable output should be:

u'low', u'lower', u'lower high', u'lower low'

How can I solve that? Thank you.

James
  • 153
  • 3
  • 15
  • The input is a bunch of tweets from stocktwits.com which contains a lot of slang – James Feb 12 '15 at 03:26
  • The `stems` list in your `tokenize` function is a local variable and is born and dies with each call of the function. Why are you bothering to build that list at all? It can't possibly serve any purpose. – Alex Martelli Feb 12 '15 at 03:36
  • Sorry, I miss the return statement. – James Feb 12 '15 at 03:39

1 Answers1

0

You could do like the below,

>>> l = ['lower low', 'lower high','lower ', ' lower', u'lower', ' ', '', 'low']
>>> list(set(i.strip() for i in l if i!=' ' and i))
['lower', 'lower low', 'lower high', 'low']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • So, what is the regular expression of or without punctuation or leading space? – James Feb 12 '15 at 04:30
  • what do you want to do with the above? What's you expected output? – Avinash Raj Feb 12 '15 at 04:36
  • I read the instruction from sklearn, maybe I can set the token_pattern in regex form. see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html – James Feb 12 '15 at 04:51
  • i just answered this 'how to remove dulicates,empty strings in' `u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'` – Avinash Raj Feb 12 '15 at 04:51