1

I have several types of images that I need to extract text from. I can manually classify the images into 3 categories based on the noise on the background:

  1. Images with no noise.
  2. Images with some light noise in the background.
  3. Heavy noise in the background. Updated

For the category 1 images, I could apply OCR’ing fine without problems. → basic case.

For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:

  • Grayscale, Gaussian blur, Otsu’s threshold
  • Morph open to remove noise and invert the image → then perform text extraction.

For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?

Please all suggestions are welcome. Thanks in advance.

NguyenHai
  • 57
  • 1
  • 8
  • OCR based on CNNs applied to the whole image (instead of thresholding and ripping the picture into pieces) should be able to stomach all four, but the last picture is very marginal. no need to "remove noise". that's what the network is trained for. -- that specific image could be thresholded sharply because the text is near perfect black while most of the noise isn't, and where it is, it's just a pixel or two. that can be removed by erasing all connected components/contours that have too small an area. – Christoph Rackwitz Jan 06 '22 at 10:01
  • @ChristophRackwitz Thank you for your comment. However, I am using Tesseract to extract texts from document images. Tesseract works well on the text block images, but when there are too many in fonts, size, etc, it can't OCR' the whole image correctly. Besides, to extract some specific text fields, I think it is still necessary to crop the whole doc image into text block images. What I target here is to apply appropriate image processing methods for getting better OCR results. It goes back to my original question that is how to measure the background noise level of the images. :( – NguyenHai Jan 06 '22 at 10:41

2 Answers2

1

Following up on your comment from other question here are some things you could try. Some combinations of ideas below should help.

Image Embedding and Vector Clustering

Manual

Use a pretrained network such as resnet on imagenet (may not work good) or a simple pretrained network trained on MNIST/EMNIST.

Extract and concat some layers flattened weight vectors toward end of network. Apply dimensionality reduction and apply nearest neighbor/approximate nearest neighbor algorithms to find closest matches. Set number of clusters 3 as you have 3 types of images.

For nearest neighbor start with KNN. There are also many libraries in github that may help such as faiss, annoy etc.

More can be found from,

https://github.com/topics/nearest-neighbor-search

https://github.com/topics/approximate-nearest-neighbor-search

If result of above is not good enough try finetuning only last few layers MNIST/EMNIST trained network.

Using Existing Libraries

For grouping/finding similar images look into,

https://github.com/jina-ai/jina

You should be able to find more similarity clustering using tags neural-search, image-search on github.

https://github.com/topics/neural-search

https://github.com/topics/image-search

OCR

  • Try easyocr as it worked better for me than tesserect last time used ocr.
  • Run it first on whole document to see if requirements met.
  • Use not so tight cropping instead some/large padding around text if possible with no other text nearby. Another way is try padding in all direction in tight cropped text to see if it improves ocr result.
  • For tesserect see if tools mentioned in improving quality doc helps.

Classification

Noise Removal

  • Here also you can go with a neural network. Train a denoising autoencoder with Category 1 images, corrupted type 1 images by adding noise that mimicks Category 2 and Category 3 images. This way the neural network will classify the 3 image categories without needing manually create dataset and in post processing you can use another neural network or image processing method to remove noise based on category type. enter image description here Image from, https://keras.io/examples/vision/autoencoder/

  • Try existing libraries or pretrained networks on github to remove noise in the whole document/cropped region. Look into rembg if it works on text documents.

B200011011
  • 3,798
  • 22
  • 33
  • Hi @B200011011, Thank you very much for your suggestions. I am currently targeting classification task, but I couldn't go for ML methods since I didn't have enough data and computation resource. Any other ideas which I can use for classifying the type 2 and 3 images with non-ML methods ? – NguyenHai Jan 22 '22 at 01:04
  • 1
    If you have at least 500-1000 I suggest try deep learning training in colab or kaggle. If not first try and update question on other methods I suggested so others can give ideas. You also have to provide more example images from each category. I am adding another method you could try. – B200011011 Jan 22 '22 at 11:33
  • Hi @B200011011, Thank you for your comment. I currently don't have until 500 to 1000. These cropped images were extracted from scanned documents. I definitely want to try the DL methods which you suggested once I collected enough data. For now, I am hoping to use some other techniques to separate type 3 images from the type 2. Since these Kanji characters were pretty sensitive so I hope to use different cleaning noise methods for different types of images. – NguyenHai Jan 22 '22 at 12:56
  • Hi @B200011011, thank you for your last reply. I have added more images of the category 2 and 3, I have checked your suggested answer and its references. However, I could not find the satisfied classification method for the current images. Your answer from the other post is still the closest one that I have so far. I wonder if you could suggest further some non-ML/DL approaches. – NguyenHai Jan 24 '22 at 04:59
0

Your samples are not very convincing. All images binarize easily (threshold 25).

enter image description here

  • Thank you for your comment. As I mentioned, applying grayscale, Gaussian blur, and thresholding could remove noise pretty well in some cases. However, I do not wish to apply the same techniques for all types of images. The samples are just general examples to explain the problem. As you can see, when binarizing at a specific threshold value, it affected all the images. My point is that for category 1 images, I do not need to do preprocessing, and I want to try different processing methods for category 3 images. So how can I check the noise level of the background? – NguyenHai Jan 07 '22 at 01:23
  • @NguyenHai: please post true images, not fakes. –  Jan 07 '22 at 07:43
  • Hi Yves Daoust, I updated some real samples for your references. – NguyenHai Jan 07 '22 at 12:13
  • @NguyenHai: hem, these images are even less convincing, sorry. They are not noisy, and they are very well contrasted. As regards the textured backgrounds, you might try to erase them with morphological operations (open/close). –  Jan 07 '22 at 13:05
  • Hi, Yves Daoust. Yes, you are right these images were already processed and contrasted. I can also process and erase certain types of background, but is there any ways to differentiate them ? By that, I can avoid doing unnecessary processing to certain types of images. Thank you for your comment. I really appreciate it. – NguyenHai Jan 07 '22 at 23:50
  • @NguyenHai: you may have a try with the total (absolute) gradient of the image, possibly rescaled by the standard deviation for normalization. –  Jan 08 '22 at 14:15
  • Hi Yves Daoust, Thanks for the suggestion. Do you have any references or some examples? – NguyenHai Jan 09 '22 at 03:41
  • @NguyenHai: no. Just thinking. –  Jan 09 '22 at 11:26