I'm using tesseract on a project and want to know the best image input type for tesseract to give the best output. Is Binary&TIFF the best input or there's something else?
3 Answers
I had excellent results using TIFF in the past for a similar task. At the time I did some pre-processing using OpenCV and exported the result to a TIFF file that later was sent to tesseract. It was pretty good.

- 1
- 1

- 92,053
- 36
- 243
- 426
I've found TIFF to give far superior results to jpg, as well as being the best against all other types.
The original Tesseract programme would only work with TIFF files, leading me to believe it would be the most appropriate

- 51
- 1
- 6
The advantages to using .tif is that (1) scantailor outputs .tif files and (2) it is possible to use tiffcp to merge individual .tif's into a single multi-page file that can be fed to tesseract. The difficulty is that if you have tesseract output a .pdf, then you have no control over the type of .pdf created. Using pdfimages -list
, I find it outputting a combination of .ccitt and .jpeg at the same dpi as the input. Then, attempting to use imagemagic to convert it to lower dpi or an other compression gives poor results.
The alternative I found is to first use imagemagic to convert all .tif's to .png. Then feed the .png's to tesseract one-by-one, producing a .pdf for each .png. In that case, the .pdf's now contains raster images. Those can then be combined and re-encoded with imagemagic.
The only downside I can see here is that if tesseract is learning as it OCR's the document (I don't know that it is, but it may be), then we would want to give it the whole document at once rather than one page at a time.

- 473
- 5
- 16