1

I want to finetune a Bert model for an NER task using Hugging face transformers, but the problem is that most of my texts are longer than 512 and I don't prefer to truncate or chunk long texts. so I tried to implement a sliding window approach but it doesn't seem to work. is it even applicable to use sliding window in my case?

The code snippets of what I tried to implement:

The Dataframe contains two columns 1-Token which is a list of tokens and ner_tags which contains list of labels in BIO format



tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
model = AutoModelForTokenClassification.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

custom_labels=['O', 'B-Disease_disorder', 'I-Disease_disorder']
label_encoding_dict= {'O': 0,  'B-Disease_disorder': 1, 'I-Disease_disorder': 2}

def tokenize_and_align_labels(examples):
    label_all_tokens = True
    stride = 200
    window_size = 512

    tokenized_inputs = tokenizer(
        list(examples["token"]),
        truncation=False,
        is_split_into_words=True,
       padding=False, 
       max_length = 512
    )



    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        for start in range(0, len(word_ids), stride):
            end = start + window_size
            window_word_ids = word_ids[start:end]
            window_label = label[start:end]

            previous_word_idx = None
            for j, word_idx in enumerate(window_word_ids):
                if j >= len(window_label):
                    break

                if word_idx is None:
                    label_ids.append(-100)
                elif window_label[j] == '0':
                    label_ids.append(0)
                elif word_idx != previous_word_idx:
                    label_ids.append(label_encoding_dict[window_label[j]])
                else:
                    label_ids.append(label_encoding_dict[window_label[j]] if label_all_tokens else -100)
                previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


trainer = Trainer(
    model,
    args,
    train_dataset=train_tokenized_datasets,
    eval_dataset=test_tokenized_datasets,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

i keep getting two different errors: 1- ValueError: expected sequence of length 4079 at dim 1 (got 5846) 2- IndexError: Invalid Key: 49 is Out of bounds for size 0

Danial
  • 23
  • 3
  • Can you please add a small sample of your data (preferably one that throws an error.) – cronoik May 29 '23 at 11:20
  • you think it's a data issue not that the implementation is wrong? my data is huge and it something like this tokens: ['75m','s/p','fall','4','steps','at','home,','found','by','her','husband','in','name2','(ni)','573','and','vomiting',',','no','recollection','of','event,','transfer','from','osh','w/sdh,','SAH', ',' ,'large', 'occipital','laceration','.','was','hypotensive'] labels are: ['O','O','O','O','O','O','O','O','O','O','O','O','O','O','O','O','B-Disease_disorder','O','O','O','O','O','O','O','O','O','B-Disease_disorder','O','O','B Disease_disorder', 'I-Disease_disorder','O','O', 'O'] – Danial Jun 01 '23 at 06:23
  • No, I think it is an implementation issue. It is just that I don't want to waste time with guessing your data. The `labels` you showed are the `ner_tags` of the `examples` dict? – cronoik Jun 02 '23 at 10:40
  • yes they are the same. – Danial Jun 03 '23 at 10:47

0 Answers0