0

I'm currently working on multi-label classification task for text data. I have a dataframe with an ID column, text column and several columns which are text label containing only 1 or 0.

I used an existing solution proposed on this website Kaggle Toxic Comment Classification using Bert which permits to express in percentage its degree of belonging to each label.

Now, that I've train my model I would like to test it on few text extracts with no label in order to obtain percentage of belonging to each label :

I've tried this solution :

def getPrediction(in_sentences):
  label = ['S1, S2, S3']
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label=label) for x in in_sentences]
  input_features = run_classifier.convert_examples_to_features(input_examples, LABEL_COLUMNS, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

pred_sentences = [
  "here is an exemple of sentence"]

pred_sentences = ''.join(pred_sentences)

predictions = getPrediction(pred_sentences)

And I got :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-490-770bf0871d3e> in <module>
----> 1 predictions = getPrediction(pred_sentences)

<ipython-input-486-3de7328d60db> in getPrediction(in_sentences)
      2   label = ['S1','S2',
      3    'S3']
----> 4   input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, labels=label) for x in in_sentences]
      5   input_features = run_classifier.convert_examples_to_features(input_examples, LABEL_COLUMNS, MAX_SEQ_LENGTH, tokenizer)
      6   predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)

<ipython-input-486-3de7328d60db> in <listcomp>(.0)
      2   label = ['S1,
      3    S2,S3']
----> 4   input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, labels=label) for x in in_sentences]
      5   input_features = run_classifier.convert_examples_to_features(input_examples, LABEL_COLUMNS, MAX_SEQ_LENGTH, tokenizer)
      6   predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)

TypeError: __init__() got an unexpected keyword argument 'labels'

Any idea what I need to change to make the last part of my algorithm functional?

JEG
  • 154
  • 1
  • 15

1 Answers1

0

You have made a typo, InputExample expects a keyword argument named label, not labels:

[run_classifier.InputExample(guid="", text_a = x, text_b = None, labels=label) for x in in_sentences]
                                                                      ^
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • In fact, I've already try that but I got the error : TypeError: unhashable type: 'list' – JEG Jun 25 '20 at 11:42
  • That is because the keyword argument `label` expects a string, not a list. See https://github.com/google-research/bert/blob/master/run_classifier.py#L139 – BioGeek Jun 25 '20 at 11:46
  • I've tried to add : def getPrediction(in_sentences): label = ["S1","S2","S3""] label = " ".join(str(x) for x in label) which permit to transform the list in string but I still get an error : KeyError: 'S1 S2 S3' – JEG Jun 25 '20 at 12:35
  • `InputExample` is for **single** training/test example, so should only a receive a single label. – BioGeek Jun 25 '20 at 12:39
  • See the `Data Preprocessing` step in this notebook: https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb – BioGeek Jun 25 '20 at 12:41
  • I'm using the same code available on this website : https://www.kaggle.com/javaidnabi/toxic-comment-classification-using-bert/output – JEG Jun 29 '20 at 14:05
  • but it is different from the code available on Github https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb (which permits to test the trained algorithm on new sentences). – JEG Jun 29 '20 at 14:17
  • My goal here is to adapt the part code starting by "def getPrediction(in_sentences):" to my main algorithm (Kaggle one). To my mind, the major difference is that in the Github code, they take into account only one label_columns but with a new variable "label_list", in my main code I have many label_columns and no label_list. – JEG Jun 29 '20 at 14:17
  • So I tried to change for example label_list parameter in label_columns, I've also tried to change label = 0 in label = labels where labels = [S1, S2, S3] but in this last example I had an error saying that the variable label can take only string as you mentioned so I tried to convert this list in string with : labels = ''.join(labels) but I got "KeyError: 'S1S2S3'. – JEG Jun 29 '20 at 14:17
  • Finally, I also tried : label='S1' (which is none sense in my case because i have 3 labels but not only one) but I got "ValueError: logits and labels must have the same shape ((?, 3) vs (?,))" – JEG Jun 29 '20 at 14:19