Fine-Tune/Train EASY OCR on any language/Korean handwritten dataset

Question

I would like to fine-tune the EASY OCR library on the Korean handwritten samples, I am assuming that the pre-trained model is already trained on Korean and English samples.

My idea is to enhance the Korean handwritten accuracy on EASY OCR. How I can achieve it? I know how to train custom models but due to the large size of English datasets, I don't want to train on Korean and English from scratch. I have already 10 M KOREAN handwritten images.

Easy OCR Custom Training from Scratch

https://github.com/JaidedAI/EasyOCR/blob/master/custom_model.md

Khawar Islam · Answer 1 · 2023-04-25T07:16:18.787

Step 1: Dataset Generation

Firstly, you have to generate Korean handwritten dataset based Hangul dictionary (collection of words). The dataset size must be over 10M samples at least to obtain satisfactory results to some extent. You can generate dataset from below repositories:

https://github.com/parksunwoo/ocr_kor
https://github.com/clovaai/synthtiger

I do not recommend SynthTIGER: Synthetic Text Image Generator code because it generates very distorted images which directly impact on training loss and you will not get good results. After dataset generation, the finalize dataset contain train and val folder consists of images folder along with labels.csv file inside images folder. The structure of directory like below

train
├── 1.jpeg
├── 2.jpeg
├── 3.jpeg
└── labels.csv

CSV contains two column (filename, words) and row would be image name and words (44.png,남항법) in image.

Fine-Tune/Train EASY OCR on any language/Korean handwritten dataset

1 Answers1