0

We have requirement for extracting tiff and scanned PDF document.

I have already searched on internet and forums, and I have come to know that Tesseract is providing best approach with max accurate outcome.

But here is a problem that I have developed some earlier extraction program in C# language. So I want to know if there is any way to use Tesseract with C# language?

halfer
  • 19,824
  • 17
  • 99
  • 186
S.P Singh
  • 1,267
  • 3
  • 17
  • 23

3 Answers3

0

The best way is to use visual studio last version 2022, just search and adding nuggets package tesseract 4.1.1 directly from visual Studio into your projet. (From Tool menu and package manager)

Add; Using tesseract; //into your header

See my article for code sample.

Export Images to PDF

Best regards Francis

0

To extract image from pdf the best way is to execute python script from c#

You can use this python script Just install python and python library package with pip command from DOS for having your imports available

pip install pillow For PIL pip install pymupdf For Fitz To adapt the resolution change matrix line here is 150 dpi for output you can put 200 or more.

page.get_pixmap(matrix=fitz.Matrix(150/72,150/72))

Python Script working ————————————————————-

from PIL import Image
import fitz
doc = fitz.open("c:/temp/pdfSample.pdf")
p=0
for page in doc:
 p=p+1
 pix = page.get_pixmap(matrix=fitz.Matrix(150/72,150/72))
 pix.save("c:/temp/out"+str(p)+".jpg")
-1

I have checked OCR technology with C# and I found following link, Please check it can be useful.

https://code.google.com/p/tesseract-ocr/

S.P Singh
  • 1,267
  • 3
  • 17
  • 23