json2token not found when using the Donut VisionEncoderDecoderModel from Huggingface transformers

Question

I am trying to fine-tune a Donut (Document Understanding) Huggingface Transformer model, but am getting hung up trying to create a DonutDataset object. I have the following code (running in google colab):

!pip install transformers datasets sentencepiece donut-python

from google.colab import drive
from donut.util import DonutDataset
from transformers import DonutProcessor, VisionEncoderDecoderModel, VisionEncoderDecoderConfig

drive.mount('/content/drive/')
projectdir = 'drive/MyDrive/donut'


donut_version = 'naver-clova-ix/donut-base-finetuned-cord-v2'  # 'naver-clova-ix/donut-base'
config = VisionEncoderDecoderConfig.from_pretrained(donut_version)
config.decoder.max_length = 768

processor = DonutProcessor.from_pretrained(donut_version)
model = VisionEncoderDecoderModel.from_pretrained(donut_version, config=config)


train_dataset = DonutDataset(f'{projectdir}/input_doc_images', 
                             model,
                             #'naver-clova-ix/donut-base-finetuned-cord-v2',
                             max_length=config.decoder.max_length,
                             split="train", 
                             task_start_token="", 
                             prompt_end_token="",
                             sort_json_key=True,
                             )

...however, the last line is throwing the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-9d831be996e6> in <cell line: 4>()
      2 
      3 max_length = 768
----> 4 train_dataset = DonutDataset(f'{projectdir}/input_doc_images', 
      5                              model,
      6                              #'naver-clova-ix/donut-base-finetuned-cord-v2',

2 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
   1612             if name in modules:
   1613                 return modules[name]
-> 1614         raise AttributeError("'{}' object has no attribute '{}'".format(
   1615             type(self).__name__, name))
   1616 

AttributeError: 'VisionEncoderDecoderModel' object has no attribute 'json2token'

I'm a little confused because my model object is a 'naver-clova-ix/donut-base-finetuned-cord-v2' model, which according to this line from the model.py of the Donut github repo seems to suggest does in fact have a json2token method???

What am I missing?

btw, you can view/copy my underlying data (images and json-lines metdata file) from my google drive 'donut' folder here: https://drive.google.com/drive/folders/1Gsr7d7Exvtx5PqjZQv2nXP9-pPDUEIOx?usp=sharing

Which verison of transformers are you using? – alvas Jun 09 '23 at 09:24 — alvas, Jun 09 '23 at 09:24

alvas · Answer 1 · 2023-06-09T18:16:00.870

To use the DonutDataset correctly, you should use the model class from donut instead of transformers, and the json2token function would work correctly e.g.

from donut.util import DonutDataset
from donut import DonutModel
from transformers import DonutProcessor, VisionEncoderDecoderModel, VisionEncoderDecoderConfig

import torch

pretrained_model = DonutModel.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-rvlcdip",
    ignore_mismatched_sizes=True)

pretrained_model.encoder.to(torch.bfloat16)


train_dataset = DonutDataset(f'my_dataset/', 
                             pretrained_model,
                             max_length=config.decoder.max_length,
                             split="train", 
                             task_start_token="", 
                             prompt_end_token="",
                             sort_json_key=True,
                             )

Note that the json2token function is in the donut repo https://github.com/clovaai/donut/blob/master/donut/model.py#L498 for the DonutModel object.

And if we look at the transformers, there's no json2token https://github.com/huggingface/transformers/blob/main/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py#L151 for the VisionEncoderDecoderModel object.

To use the model from transformers instead of donut, you might need to read the data differently and not use donut.util.DonutDataset re-create the dataset into Huggingface friendly dataset, like this:


import PIL.Image

from datasets import Dataset

i1 = PIL.Image.open('my_dataset/alex_cannon_dep_first_page.png')
i2 = PIL.Image.open('my_dataset/mcentee_dep_first_page.png')

train_dataset = Dataset.from_dict({'images': [i1, i2]})

Then you have to do all the features processing by yourself before feeding it to the model, see https://huggingface.co/docs/transformers/model_doc/donut

Hey thanks a lot Alvas! this is definitely on the path of what I was looking for, as I was tying myself in knots going down the more complicated way via transformers. I now have this warning tho: ``` Some weights of DonutModel were not initialized from the model checkpoint at naver-clova-ix/donut-base-finetuned-rvlcdip and are newly initialized because the shapes did not match: ``` with several lines that look like this: ``` - encoder.model.layers.1.downsample.reduction.weight: found shape torch.Size([512, 1024]) in the checkpoint and torch.Size([256, 512]) in the model instantiated ``` — Max Power, Jun 12 '23 at 16:05
...the warning ends with "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference." ...but I'm thinking I have definitely enough data to fine-tune for my task, but not sure I have enough data to retrain several layers from scratch/initialization. Can you help me understand how I might remedy this? Or if I'm overestimating the issue here? — Max Power, Jun 12 '23 at 16:07

json2token not found when using the Donut VisionEncoderDecoderModel from Huggingface transformers

1 Answers1