PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

Question

I have a custom tokenizer function with some keyword arguments:

def tokenizer(text, stem=True, lemmatize=False, char_lower_limit=2, char_upper_limit=30):
    do things...
    return tokens

Now, how can I can pass this tokenizer with all its arguments to CountVectorizer? Nothing I tried works; this did not work either:

from sklearn.feature_extraction.text import CountVectorizer
args = {"stem": False, "lemmatize": True}
count_vect = CountVectorizer(tokenizer=tokenizer(**args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

Any help is much appreciated. Thanks in advance.

yangjie · Accepted Answer · 2015-08-06T01:13:45.680

9

The tokenizer should be a callable or None.

(Is tokenizer=tokenize(**args) a typo? Your function name above is tokenizer.)

You can try this:

count_vect = CountVectorizer(tokenizer=lambda text: tokenizer(text, **args), stop_words='english', strip_accents='ascii', min_df=0, max_df=1., vocabulary=None)

edited Aug 06 '15 at 01:13

answered Aug 06 '15 at 00:42

yangjie

6,619
1
33
40

PYTHON: How to pass tokenizer with keyword arguments to scikit's CountVectorizer?

1 Answers1