Convert scanned pdf to .txt files using tesseract

Question

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?

score 22 · Accepted Answer · answered Jan 31 '14 at 11:11

22

Use Imagemagick:

convert -density 600 input.pdf output.tif

Density is in DPI, from my experience 600 DPI works the best.

answered Jan 31 '14 at 11:11

Karol S

9,028
2
32
45

1

Can convert command be used to produce multiple output files? please help me with the usage of it. – Ganesh Nannaware Apr 12 '14 at 07:28
3

@GaneshNannaware Yes, it can. Put `%04d` in the name of the output file and see how it works. – Karol S Apr 12 '14 at 07:46

Convert scanned pdf to .txt files using tesseract

1 Answers1