How to handle sequences longer than 512 tokens in layoutLMV3?

Question

How to work with sequences longer than 512 tokens. I don't wanted to use truncates =True. But actually wanted to handle the longer sequences

score 2 · Answer 1 · answered Feb 20 '23 at 06:21

2

You can use the stride with max length parameter to handle the larger documents.

encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True,padding="max_length", max_length = 512, stride = 128, return_overflowing_tokens = True,return_offsets_mapping = True)

This would help to handle the larger files.

Let me know if this is useful.

answered Feb 20 '23 at 06:21

Chetan

31
2

Doing this raises KeyError: 'overflow_to_sample_mapping' – Pablo Apr 21 '23 at 09:56
update with the latest transformers package. – Chetan Apr 21 '23 at 11:20

score 1 · Answer 2 · answered Feb 23 '23 at 14:33

I had the same issue with LayoutLMv3 and because I think this problem is common for document information extraction task so I will describe how I dealt with that:

1. Training:

As you may know, first of all we have to change configurations of processor by using stride and padding and offset_mapping:

....
......

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True, stride =128, 
         padding="max_length", max_length=512, return_overflowing_tokens=True, return_offsets_mapping=True)

offset_mapping = encoding.pop('offset_mapping')

overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')

I'm not completely sure why should we turn offset_mapping True while we have to pop it after encoding but I think it's necessary to do that. To clarify, if you followed the NielsRogge's notebook (you can find the notebook here), we have to change the prepare_examples method like this:

def prepare_examples(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']] 
  words = examples[text_column_name]
  boxes = examples[boxes_column_name]
  word_labels = examples[label_column_name]
  encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True, stride =128, 
         padding="max_length", max_length=512, return_overflowing_tokens=True, return_offsets_mapping=True)  offset_mapping = encoding.pop('offset_mapping')
  overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')
  return encoding

next you have to follow the steps normally, without any changes and train the model!

Note: It's completely normal if the number of rows in your dataset after mapping doesn't match with to number of your data. It's due to that now if we pass 512 tokens for a document, we would store the next tokens in another row of data (from token 512 to 1024, next from 1025 to 1536 and...)

2. Inference:

Inference section is a little bit tricky. Again I will describe the setups based on mentioned notebook. Obviously we have to implement our processor with stride and padding like training phase.

encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True, stride =128, 
         padding="max_length", max_length=512, return_overflowing_tokens=True, return_offsets_mapping=True)

offset_mapping = encoding.pop('offset_mapping')

overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')

Next, we have to change the shape of encoding to handle multiple pages (as I said in part 1, we divided large tokens to separate entities, so now the results are 2D. I have shaped them in this way:

# change the shape of pixel values
x = []
for i in range(0, len(encoding['pixel_values'])):
     x.append(encoding['pixel_values'][i])
x = torch.stack(x)
encoding['pixel_values'] = x

so if we print encoding items, we will have something like this:

for k,v in encoding.items():
  print(k,v.shape)

results:

input_ids torch.Size([3, 512])
attention_mask torch.Size([3, 512])
bbox torch.Size([3, 512, 4])
pixel_values torch.Size([3, 3, 224, 224])

As we can see, in my case, document is divided to 3 parts (for example size of input_ids is [3,512] or 3x512 which if we use normal processing, we will get just one array [1, 512] for all the cases). So we're doing fine till now. We have to pass encoding to the model to get the predictions:

with torch.no_grad():
  outputs = model(**encoding)

# The model outputs logits of shape (batch_size, seq_len, num_labels).
logits = outputs.logits
print(logits.shape)

# We take the highest score for each token, using argmax. This serves as the predicted label for each token.
predictions = logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()

if (len(token_boxes) == 512):
  predictions = [predictions]
  token_boxes = [token_boxes]

Last lines (if clause) in code above, is because in cases that the number of tokens are less than 512, we will get 1D array, we have to put them in a list to prevent errors for next step.

Finally, now we have predictions and token_boxes from the model, you can also reach the text of each bbox by using: processor.tokenizer.decode(encoding["input_ids"][i][j]) whichiandjcorresponds to the entity that you want to extract the text of it. Just as an example, we could find predictions by traversing token_boxes with for loop (you could do whatever you want, because we needed predictions and bboxes and we have them now! processing them is up to you ;) )

# this is just an example, change this code for your project!
for i in range(0, len(token_boxes)):
   for j in range(0, len(token_boxes[i])):
        print("label is: {}, bbox is: {} and the text is: {}".format(predictions[i][j], 
                    token_boxes[i][j],  processor.tokenizer.decode(encoding["input_ids"][i][j]) )

I get an error if I try to pop "offset_mapping", but it works fine without that line. Note that you don't need to assign the return of pop to any variable, so instead of overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping') simply encoding.pop('overflow_to_sample_mapping') is good. — Pablo, Apr 21 '23 at 10:07
I forgot to mention that I also need to remove return_offsets_mapping=True to avoid errors. — Pablo, Apr 21 '23 at 10:22
I have one question, if i am passing multiple images into my processor, how can I distinguish the prediction by image? — infinity911, Jun 20 '23 at 16:24

How to handle sequences longer than 512 tokens in layoutLMV3?

2 Answers2

1. Training:

2. Inference: