0

I am currently trying to implement NER model using sklearn_crfsuite library.

The training code is simply as follows:

for repeat in range(10):
    crf = sklearn_crfsuite.CRF(
                            algorithm='lbfgs',
                            c1=0.1,
                            c2=0.1,
                            max_iterations=100,
                            all_possible_transitions=True,
                            verbose=True
                        )
    crf.fit(X_train, y_train)
    pred_list = crf.predict(X_test)

The code is do training for ten repeat, my goal is to observe 10 different scores and average them as a final score. However, each repeat gives the same score, although I reinitialize the model in each loop.

The question is, how I properly set random seed so that each repeat can give different results?

NOTE: After I shuffle the training data in each loop, it still gives the same results. Finally, I changed the training algorithm from 'lbfgs' (Gradient descent using the L-BFGS method) to 'l2sgd' (Stochastic Gradient Descent with L2 regularization), then I started to obtain different results.

Oguzhan
  • 154
  • 1
  • 9
  • 1
    Well as I understand, you are re-creating the model with the same parameters in each iteration of the loop and fitting the model with the same train data, so you will probably get the same results every time. My question is why do you want to average scores for 10 times? and is it an option for you to use `cross_validation` with `cv=10` instead? – AlirezaAsadi Jan 05 '22 at 09:30
  • 1
    I may use `cross_validation` but I did not prefer it. However, I figured out that after your suggestions, I should shuffle the train data with random seed, before the each initialization of CRF. – Oguzhan Jan 05 '22 at 10:58

1 Answers1

2

You don't search for a random seed, you probably search for cross validation:

the full documentation you can find here.

if you want to run 10 different iterations you can use:

crf = sklearn_crfsuite.CRF(
                            algorithm='lbfgs',
                            max_iterations=100,
                            all_possible_transitions=True,
                            verbose=True
                        )
    
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=10,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

and you will get the best parameters

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Thank you for your suggestion; however, I am currently trying several NER models by using random seeds to compare the results, I could have chosen cross-validation, which is a better way, but I did not prefer it because of several reasons. – Oguzhan Jan 05 '22 at 11:00