0

I am using OCR method to read the image, the challenge is i want to read the text(For Ex: From a Passport or some other document which has the background image in it plus the quality of the image is also not good),so can you suggest any of the ideas to execute so that it reads each of the text clearly,any suggestions are welcomed for example increasing the brightness or any such ideas. Kindly don't mark it as a copy because my question is a copy but the challenge is different. Below is the code which I got through stack overflow itself.

protected void Button1_Click(object sender, EventArgs e)
{
    string filePath = Server.MapPath("~/Uploads/" 
                    + Path.GetFileName(FileUpload1.PostedFile.FileName));
    FileUpload1.SaveAs(filePath);
    string extractText = this.ExtractTextFromImage(filePath);
    lblText.Text = extractText.Replace(Environment.NewLine, "<br />");
}

private string ExtractTextFromImage(string filePath)
{
    Document modiDocument = new Document();
    modiDocument.Create(filePath);
    modiDocument.OCR(MiLANGUAGES.miLANG_ENGLISH);
    MODI.Image modiImage = (modiDocument.Images[0] as MODI.Image);
    string extractedText = modiImage.Layout.Text;
    modiDocument.Close();
    return extractedText;
}
Mayur
  • 11
  • 7

1 Answers1

0

you may refer to Tessar-OCR's suggested methods. Apologies for now showing any code for improving the image quality for scanning, but I think the idea is there in the article.

Also, given the code you have, it appears its using MODI which is not supported since 2010. In my case we used Tesseract wrapper for .net which is quiet active (main branch) and supports wide range of programming languages and dialects.

My 2 cents :)

ken lacoste
  • 894
  • 8
  • 22
  • `Apologies for no[t] showing any code [..] but I think the idea is there in the article.` Answers should contain the answer itself, not just a link to the answer. Links can decay, thus rendering your answer potentially useless in the future. – Flater Apr 17 '18 at 09:02
  • @ken lacoste :I tried tesseract OCR, the output is worst than MODI, In MODI the output percentage will be 25-30% but in tesseract its just 10%. Please kindly suggest me some other safer and good quality method – Mayur Apr 25 '18 at 02:46
  • @Mayur then it may mean that the image is that bad. Have you tried following the improvements on the suggested methods? I mean it seems you have to have certain adjustments on the image first before letting it go thru OCR. – ken lacoste Apr 25 '18 at 07:20
  • I tried with grayscaling the image, but if I grayscale it will pick certain value which is not picked in the normal image but same way it will skip certain values which were picked in the normal image. I tried sharpening the image, when i pass the sharpened image for OCR(through MODI dll), it will not at all read the data, it will through some error. I promised the team that I will resolve this issue, but I am completely stuck. Please guide to resolve this asap. Thank you for the support – Mayur Apr 30 '18 at 03:49
  • The clients are providing scanned outputs of different dpi to test this project, In all those samples 400dpi (coloured and black&white) scanned is read somewhat ok, i can say around 50% it will be read, in other dpi the reading percentage by OCR is very less. Here the images basically include foreign passports which have a background image in it since it has so much noise on the image. Since it is Passport i cannot share the sample over here – Mayur Apr 30 '18 at 03:55
  • @Mayur that's quiet tough right there since you need at least like 60% or 80% all readable..in my experience, only part of the image is required to be read so as long as that part is clear, everything is good. Apologies man, I can't help you any further with that, it seems you need to have a very clean black and white image combination to make the reading to reach at least 60%. – ken lacoste Apr 30 '18 at 06:24
  • I tried one thing that is changing the scanned output to black and white using following code var color = new Bitmap(img); var bw = color.Clone(new Rectangle(0, 0, color.Width, color.Height), PixelFormat.Format1bppIndexed); – Mayur Apr 30 '18 at 10:38
  • After running through this code the image gets converted to black and white and all the background image of the passport gets cleared,so I will get a clean output with text highlighted, but when I pass this image for OCR it doesn't read that image, – Mayur Apr 30 '18 at 10:42
  • were in the image looks very clear to read, but I don't know why the OCR fails to read the same image after making it black and white. It gives the error message in this line: modiDocument.OCR(MiLANGUAGES.miLANG_ENGLISH); – Mayur Apr 30 '18 at 10:42
  • And the error is : COM exception was unhandled by user code An exception of type 'System.Runtime.InteropServices.COMException' occurred in OCRGuardian.dll but was not handled in user code Additional information: OCR running error – Mayur Apr 30 '18 at 10:42
  • Sorry for commenting it in parts, since there is a limit in commenting section, I have written it in parts.Thanks for the support – Mayur Apr 30 '18 at 10:47
  • I've never been fond of MS OCR but if I remember correctly you need to feed it with .TIFF / .TIF? – ken lacoste May 01 '18 at 17:08
  • No, not exactly if I pass .jpeg image, it will read, since the image is not good, the readability percentage is less or else it will read even .jpeg as well – Mayur May 02 '18 at 04:48
  • Hmm..if only you can share the problem image..i'll try and troubleshoot it for you..but as read its a passport so totally not possible..:( – ken lacoste May 02 '18 at 09:28