How to fix token pattern in scikit-learn?

Question

I am using TfidfVectorizer from scikit-learn to extract features, And the settings are:

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for token in tokens:
        token = re.sub("[^a-zA-Z]","", token)
        stems.append(EnglishStemmer().stem(token))
    return stems

vectorizer = TfidfVectorizer(tokenizer=tokenize, lowercase=True, stop_words='english')

After feeding the training set to the vectorizer, I call

vectorizer.get_feature_names()

the output contains some duplicate words with space: e.g.

u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'

And the acceptable output should be:

u'low', u'lower', u'lower high', u'lower low'

How can I solve that? Thank you.

The input is a bunch of tweets from stocktwits.com which contains a lot of slang — James, Feb 12 '15 at 03:26
The `stems` list in your `tokenize` function is a local variable and is born and dies with each call of the function. Why are you bothering to build that list at all? It can't possibly serve any purpose. — Alex Martelli, Feb 12 '15 at 03:36

score 0 · Answer 1 · answered Feb 12 '15 at 03:30

0

You could do like the below,

>>> l = ['lower low', 'lower high','lower ', ' lower', u'lower', ' ', '', 'low']
>>> list(set(i.strip() for i in l if i!=' ' and i))
['lower', 'lower low', 'lower high', 'low']

answered Feb 12 '15 at 03:30

Avinash Raj

172,303
28
230
274

So, what is the regular expression of or without punctuation or leading space? – James Feb 12 '15 at 04:30
what do you want to do with the above? What's you expected output? – Avinash Raj Feb 12 '15 at 04:36
I read the instruction from sklearn, maybe I can set the token_pattern in regex form. see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html – James Feb 12 '15 at 04:51
i just answered this 'how to remove dulicates,empty strings in' `u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'` – Avinash Raj Feb 12 '15 at 04:51

How to fix token pattern in scikit-learn?

1 Answers1