Ephesoft error with learning tiff documents that have been converted from PDF

Question

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. I am having issues with ephesoft reading certain tiff documents. I have about 100 different tiff documents and about 70% of them work. These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. We are using the following commands with ghostscript

-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH

with imagemagick we are doing the following command

-compress group4

When learning one of the tiff files that isn't working we are getting the following error in the log files

Drop Box Link to Stack Trace

And this is one of the Tiff document we are trying to have ephesoft learn

Drop Box Link to Tiff Document

Is there something that I can do with ghostscript, imagemagick or any other software to fix this; or do I need to modify ephesoft in some way?

What do you mean by 'the latest version of Ghostscript' ? The latest release, the HEAD of the development branch ? Whatever your package manager has as the latest packaged version ? It would be **much** better to state the actual version. I can't see anything immediately wrong with the TIFF file, your best bet is probably to have someone tell you what 'ephesoft' doesn't like about the image. — KenS, Jan 24 '15 at 13:16
I am using Ghostscript 9.15, sorry about the confusion but i meant the latest stable version from http://ghostscript.com/download/. I would be interested in finding out what ephesoft doesn't like about the TIFF and how to solve this so that I can either fix the TIFF document or fix Ephesoft. Any advice on how to figure this out? — craig_nelson, Jan 24 '15 at 22:53
After doing some additional research it looks like Tesseract is putting a > character in for the word texas. It appears that when Ephesoft tries to ingest the html file that it doesn't account for characters like > in their hocr files. Do you know of a way to remove < > or any other xml files from Tesseract? — craig_nelson, Jan 25 '15 at 01:01

score 1 · Accepted Answer · answered Jan 25 '15 at 03:28

1

I found the solution by doing some more research.

The problem didn't involve Ghostscript or Imagmagick. It involved Tesseract and creating the HOCR file. When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result.

The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. My PDF's seem to be working correctly now and I am able to process them.

answered Jan 25 '15 at 03:28

craig_nelson

176
1
1
8

Hi, I am getting the following error [ERROR] [pool-2-thread-1] [com.ephesoft.dcma.util.ProcessUtils] - Error occured while executing the command: [tesseract, /opt/Ephesoft/SharedFolders/BC7/lucene-search-classification-sample/in/in_First_Page/1.tiff, /opt/Ephesoft/SharedFolders/BC7/lucene-search-classification-sample/in/in_First_Page/1, -l, eng, -psm, 4, +hocr.txt] in the working directory: /opt/Ephesoft/Dependencies/tesseract-ocr java.io.IOException: Cannot run program "tesseract" (in directory "/opt/Ephesoft/Dependencies/tesseract-ocr"): error=2, No such file or directory – Kintu Barot Apr 05 '17 at 10:13

Ephesoft error with learning tiff documents that have been converted from PDF

1 Answers1