2

I have a pandas dataframe with the following structure:

    pd.DataFrame({"user_id": ['user_id1', 'user_id1', 'user_id1', 'user_id2', 'user_id2'],
      'meeting': ['text1', 'text2', 'text3', 'text4', 'text5'], 'label': ['a,b', 'a', 'a,c', 'x', 'x,y' ]})

There a total of 12 user_id's. I have a pipeline as follows:

    knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
                 ('model', OneVsRestClassifier(KNeighborsClassifier())])

a parameter grid as follows:

    param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
             'tf_idf__ngram_range': [(1, 1), (1, 2), (2,2) (1, 3)],
              'model__estimator_n_neighbors' : [np.range(1,30)]
             }

And finally GridSearchCV:

    Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)

I need to create a model for each user with the corresponding X and y values. For one user, I can do the following:

    t = df[df.user_id == 'user_id1']

Extract X and y from t. Pass y to a Multi labelBinarizer(), then after instantiating the pipeline, param_grid and GridsearchCV, I can do:

    Grid_Search_tune.fit(X, y)

Doing this 12 times for each user is repetitive. So I looped through the grouped pandas Dataframe. Here is what I have done:

    g = df.groupby('user_id')

    for names, groups in g:

X = groups.meeting_subject.as_matrix()
labels = [x.split(', ') for x in groups.priority_label.tolist()]
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)

knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
                     ('model', OneVsRestClassifier(KNeighborsClassifier()))])

param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
                 'tf_idf__ngram_range': [(1, 2), (2,2), (1, 3)], 'model__estimator__n_neighbors': np.arange(1,4)}

Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)

all_estimators = Grid_Search_tune.fit(X, y)

best_of_all_estimators = Grid_Search_tune.best_estimator_

print(best_of_all_estimators)

This gives me an output like:

    user_id1
    Pipeline(memory=None,
 steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
    dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
    lowercase=True, max_df=0.25, max_features=None, min_df=1,
    ngram_range=(2, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
       weights='uniform'),
      n_jobs=1))])

user_id2

    Pipeline(memory=None,
 steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
    dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
    lowercase=True, max_df=0.25, max_features=None, min_df=1,
    ngram_range=(1, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
       weights='uniform'),
      n_jobs=1))])

And so on till user_id12 and the corresponding pipeline. I don't know if this is the correct way of doing it, and from here on I am lost. If I do:

    best_of_all_estimators.predict(['some_text_string'])

I get a prediction for all the 12 models. How do I key or index my models with the for loop variable 'names' so that when I do:

    str(raw_input('Choose user_id from above list:'))

Say I choose user_id3 , and then

    str(raw_input('Enter text string:'))

I enter 'some random string'. The model trained for the X and y belonging to user_id3 is pulled up and a prediction is done on that model, and not for all the models. A very similar question is linked here. training an ML model on selected parts of a data frame. I am beginner and I'm really struggling! Please, please help! Thanks a ton in advance.

Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133
Shiva Kumar
  • 175
  • 1
  • 12

1 Answers1

1

Apparently Pipeline doesn't support changing the number of samples, such as in groupby or other aggregation.

Here is a similar question and possible workaround.

sklearn: Have an estimator that filters samples

Bert Kellerman
  • 1,590
  • 10
  • 17
  • I have edited the question, added some code, and linked to a very similar question. Except here I am not using Spark, and looping for me is fine. Please help! Thank you. – Shiva Kumar Dec 12 '17 at 04:48