0

I'm using Ephesoft community edition 4.0.2.0 with tif images (tested by ephesoft) the problem that ephesoft can classify or extract data from certain images but from others he can't with no error message in files log, i dont now why.

When i click on Learn files the HOCR and HTML generated files are empty with no data just metadata like this :

Application_Checklist_HOCR.xml :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<HocrPages<HocrPage>
<Title></Title><Spans/>
<HocrContent></HocrContent>
</HocrPage></HocrPages>

But for US-invoice_HOCR.xml ephesoft can learn and the file look like this :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><HocrPages><HocrPage>

<Title></Title><Spans><Span><Value>INVOICE</Value><Coordinates><x0>579</x0>

<y0>247</y0><x1>881</x1><y1>304</y1></Coordinates></Span><Span>

<Value>ACME</Value><Coordinates><x0>168</x0><y0>394</y0><x1>311</x1><y1>431</y1>

</Coordinates></Span><Span><Value>Company</Value><Coordinates><x0>329</x0>

<y0>395</y0><x1>541</x1><y1>442</y1></Coordinates></Span><Span>

<Value>lnvoice</Value><Coordinates>............
KenS
  • 30,202
  • 3
  • 34
  • 51
salah eddine
  • 101
  • 1
  • 10
  • I have exactly the same pb. But If I use directly tesseract on command line it works fine. This is the command line : > tesseract myfile.tif myfile hocr. But when run from Ephesoft it does produce a useless HOCR html file. – ElArbi Jun 06 '16 at 16:21
  • You can modify the tesseract config file in /Path-To-Ephesoft/Application/WEB-INF/classes/META-INF/dcma-tesseract/tesseract-reader.properties and comment this line #tesseract.command_parameters=-psm 4 to let tesseract use the default segmentation. – salah eddine Jun 24 '16 at 11:05

1 Answers1

1

You can modify the tesseract config file in /Path-To-Ephesoft/Application/WEB-INF/classes/META-INF/dcma-tesseract/tesseract-‌​reader.properties and comment this line #tesseract.command_parameters=-psm 4 to let tesseract use the default segmentation.

salah eddine
  • 101
  • 1
  • 10