Command line software to batch convert TIFF to indexable PDF

Question

I need a utility to batch convert TIFF files to indexable PDF's. The software needs to run on linux and must work from the command line. The software does not need to be open source. I've tried the conversion using tesseract and hocr2pdf however they produce PDF's with garbled text (Note: the text is only garbled if you "select all" text in the PDF). I've found other utilities but they only run under Windows or don't work from the command line. Thanks in advance.

perhaps [this](http://www.moreno.marzolla.name/software/scan_to_pdf/) can help — Fredrik Pihl, May 29 '12 at 15:03
As I noted in my question, I already wrote a program to do the conversion using tesseract and hocr2pdf. To my knowledge, hocr2pdf is the the only open source tool capable of making an indexable PDF document. Your link doesn't outline anything I didn't already know, sorry. — William Seemann, May 29 '12 at 15:12
There's two problems here - getting the OCR done, then converting to PDF. I wonder if the problem would be easier to search for if you look to OCR your TIFF into plain text first, and then you can use something like `wkhtmltopdf` to convert it to a PDF afterwards? — halfer, May 29 '12 at 16:01
Also a good suggestion, however, wkhtmltopdf doesn't maintain the integrity of the original document. It only creates a new PDF using only the text from the original TIFF file. — William Seemann, May 29 '12 at 17:24

score 1 · Answer 1 · answered May 29 '12 at 15:09

1

Mogrify should be able to help you:

http://linux.die.net/man/1/mogrify

answered May 29 '12 at 15:09

Herr von Wurst

2,571
5
32
53

I don't see an option to make the converted image indexable. Can you provide a sample usage? – William Seemann May 29 '12 at 15:15

score 1 · Answer 2 · answered May 30 '12 at 12:05

1

This is exactlyu what you are looking for:

http://ocr4linux.com/en:start

Command line OCR tool for Linux based on best on the market OCR from ABBYY. (Disclaimer: I work for ABBYY)

answered May 30 '12 at 12:05

Tomato

2,169
15
24

1

Thanks but I tried purchasing this software and my experience was awful. It took several days for a salesperson to even respond and I was quoted twice what was listed on the website. Apparently they have different pricing for people in Europe and the United States. – William Seemann Jul 03 '12 at 04:56
Why did not you just purchase online? Sales do not deal with this product very often, somethimes there can be a confusion – Tomato Jul 04 '12 at 12:18

score 0 · Answer 3 · answered May 29 '12 at 15:14

This answer is oblique and only partial. Disregard if it does not apply to you.

There may exist such software, but I am not familiar with it. If your need is strong enough that you will write 2000 lines of code or so to meet it, then there is the Linux-oriented Libpoppler, which gives you the interface to write a program to make its own, custom PDF, exactly the way you want it. Unfortunately, Libpoppler though valuable is not particularly pleasant to code to; and, unfortunately, if you do code to it, then you will probably find yourself reading long tracts of the PDF standard.

If you do write such software, you might consider publishing it as open source.

Good luck.

score 0 · Accepted Answer · answered Jul 03 '12 at 05:00

0

After trying several tools (including Abbyy) I decided on: Vividata. They have decent pricing, run under Linux, and don't have a page per year limit.

answered Jul 03 '12 at 05:00

William Seemann

3,440
10
44
78

score 0 · Answer 5 · answered Sep 11 '16 at 15:51

0

I wrote a bash script that uses Tesseract 3 or Abbyy OCR 11. It can batch convert or run in directory monitor mode.

In your case

pmocr.sh --batch --target=PDF /path/to/tiff/files

See the script here: https://github.com/deajan/pmOCR

answered Sep 11 '16 at 15:51

Orsiris de Jong

2,819
1
26
48

Command line software to batch convert TIFF to indexable PDF

5 Answers5