Unable to improve the mask RCNN model for document images?

Question

I am training a model to extract all the necessary fields from a resume for which I am using mask rcnn to detect the fields in image. I have trained my mask RCNN model for 1000 training samples with 49 fields to extract. I am unable to improve the accuracy. How to improve the model? Is there any pretrained weights that may help?

Difficulty in reading following text -

Feed more examples by trying different augmentations, like croping the subsections of the resume and feed it into the network to extract the field. Try adaptive learning rates. — Bharath M Shetty, Nov 03 '19 at 11:04
Okay I am training of image with size (256, 256, 3), should I increase the size of the images as well? also what should be the minimum size of the training samples? — hR 312, Nov 03 '19 at 11:08
Is Mask RCNN a good approach? or should I go for some other method like yolo maybe? — hR 312, Nov 03 '19 at 11:09
MaskRCNN is a bit of an overkill, are fields segmented or are squared boxes? — Eliethesaiyan, Nov 14 '19 at 05:44
To be honest, I don't know where to start. I believe you can only get somewhat satisfactory answers if you detail your question. — Daniel Möller, Nov 14 '19 at 14:38
@DanielMöller I want to create a portal in which user uploads there resume and my AI engine reads all the fields like name, there skills by itself which isn't possible to achieve with just pretrained OCR engines. So I am using an object detection algorithm to detect the skill portion the name portion etc. and use ocr to achieve the output. I'll update the question with screenshots of the detected region, — hR 312, Nov 15 '19 at 06:36
If you want to identify name, skill etc, what is the logic behind treating the resume as an image and extracting image sections compared to reading text out of the resume and then trying to classify the text? — Deepak Garud, Nov 15 '19 at 06:49
@hR312, I understand , but you are facing a two major problems, are all CVs in the same format?if so, OCR will be the best for it. Objectection is related to the form of the object rather than the content(text) , i think it will be hard to achieve this.How will you prepare the ground truth dataset? — Eliethesaiyan, Nov 15 '19 at 06:57
Future resume can be of any format @Eliethesaiyan and all resumes are images only — hR 312, Nov 15 '19 at 08:07
Ths solution I am developing is for resume as images only that's the challenging part @DeepakGarud — hR 312, Nov 15 '19 at 08:08
@hR312, i meant the resume formating style? is the skillls section always in the same area of the image(top, down, middle of the image)? I believe not since someone with more education or skills will make the section longer — Eliethesaiyan, Nov 15 '19 at 08:36
Yes correct @Eliethesaiyan. Then which approach according to you is suitable for this task? — hR 312, Nov 15 '19 at 08:42
i am not really sure, but this repo seems to do something close to what you want to achieve https://github.com/elifesciences/sciencebeam — Eliethesaiyan, Nov 15 '19 at 09:02
I have to complete the task using ML algorithm from scratch actually. So it'll great if you can suggest an ml way. — hR 312, Nov 15 '19 at 12:15
@hR312 Extracting text using OCR and then doing text classification makes more sense to me, because as you said resume format can be different. But text inside will be more or less similar for each classes. — Vivek Mehta, Nov 17 '19 at 05:47
Indeed fields like skills will be easier for @VivekMehta but fields like name and JOB Role and many more will be more difficult to identify with text classification. — Hrithik Puri, Nov 18 '19 at 11:53
@DanielMöller The problem is to parse resumes which are in image format and store them in structured form. — Hrithik Puri, Nov 18 '19 at 12:00
@HrithikPuri No actually, this problem is of text-classification only. Using something like object detection/masking would not be right choice specially with varying templates. — Vivek Mehta, Nov 19 '19 at 05:26
@VivekMehta How will we extract a table column wise? Kindly refer the updated image inside the question. — hR 312, Nov 19 '19 at 05:28
@VivekMehta and OCR might fail to read a word appropriately won't that effect the text classifier? — hR 312, Nov 19 '19 at 05:29
At last OCR will be used to read the field, mask rcnn is to guide the model for a suitable region where skill or some other field might be — hR 312, Nov 19 '19 at 05:31
@hR312 many OCR engines (tesseract for example) have many page segmentation modes which can be used to extract tabular format text. Also, of course there will be challenges, but given that you have 1000 (which is limited) images and as per your above input that _"Future resume can be of any format"_ doing OCR first and then classifying text is more suitable in your case. — Vivek Mehta, Nov 19 '19 at 05:38
Is that possible to share the code on the GitHub? So we can help more? — Bill Chen, Nov 20 '19 at 21:01

Suman · Answer 1 · 2019-11-18T16:02:38.300

3

Looks like you want to do text classification/processing, you need to extract details from the text but you are applying object detection algorithms. I believe you need to use OCR to extract text (if you have cv as an image) and use the text classification model. Check out the below links more information about text classification -

https://medium.com/@armandj.olivares/a-basic-nlp-tutorial-for-news-multiclass-categorization-82afa6d46aa5

https://www.tensorflow.org/tutorials/tensorflow_text/intro

edited Nov 18 '19 at 16:02

answered Nov 18 '19 at 15:53

Suman

354
3
10

1

upvoted. Exactly what I said last time this same question was asked (for which I was downvoted) https://stackoverflow.com/questions/58748719/what-are-some-good-data-augmentation-techniques-for-document-images/58768294#58768294 – ezekiel Nov 18 '19 at 16:05
@ezekiel How will we extract a table column wise? Kindly refer the updated image inside the question. Also, OCR might fail to read a word appropriately, won't that effect the text classifier? and at last OCR will be used to read the field, mask rcnn is to guide the model for a suitable region where skill or some other field might be. – hR 312 Nov 19 '19 at 05:32
So all CVs are an in identical format? It's difficult to suggest an approach without more information. It might make sense to do line detection using opencv or similar and use that to specify areas to be fed to OCR. You could perhaps use regular expressions and dictionaries or similar to try and deal with small mistakes in reading. – ezekiel Nov 19 '19 at 15:11
@ezekiel I tried line detection but it doesn't work on examples as given above in the question. And I haven't got any solution for problems as above – hR 312 Jan 17 '20 at 13:25

score 2 · Answer 2 · answered Nov 19 '19 at 23:05

You can break up the problem two different ways: Step 1- OCR seems to be the most direct way to get to your data. But increase the image size, thus resolution, otherwise, you may lose data. Step 2- Store the coordinates of each OCRed word. This is valuable information in this context. How words line up have significance. Step 3- At this point you can try to use basic positional clustering to group words. However, this can easily fail on a columnar vs row-based distribution of related text.
Step 4- See if you can identify which of 49 tags these clusters belong to. Look at text classification for Hidden Markov models, Baum-Welch Algorithms. i.e. Go for basic models first.

OR The above ignores the inherent classification opportunity that is the image of a, well, a properly formatted cv.

Step 1- Train your model to partition the image into sections without OCR. A good model should not break up the sentences, tables etc. This approach may leverage separators lines etc. There is also opportunity to decrease the size of your image since you are not OCRing yet. Step 2 -OCR image sections and try to classify similar to above.

Thanks for the input. The last approach is exactly what I am doing and my question is that how can I improve the maskrcnn model for the same. — hR 312, Nov 21 '19 at 11:50

score 0 · Answer 3 · answered Dec 17 '19 at 15:18

0

Another option is to use the neural networks like - PixelLink: Detecting Scene Text via Instance Segmentation

https://arxiv.org/pdf/1801.01315.pdf

answered Dec 17 '19 at 15:18

Suman

354
3
10

Unable to improve the mask RCNN model for document images?

3 Answers3

Linked