1

I use Ghostscript to strip images from PDF files into jpg and run Tesseract to save txt content like this:

  • Ghostscript located in c:\engine\gs\
  • Tesseract located in c:\engine\tesseract\
  • web located pdf/jpg/txt dir = file/tmp/

Code:

$pathgs = "c:\\engine\\gs\\";
$pathtess = "c:\\engine\\tesseract\\";
$pathfile = "file/tmp/"

// Strip images
putenv("PATH=".$pathgs);
$exec = "gs -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=".$pathfile."strip%d.jpg ".$pathfile."upload.pdf -q -c quit";
shell_exec($exec);

// OCR
putenv("PATH=".$pathtess);
$exec = "tesseract.exe '".$pathfile."strip1.jpg' '".$pathfile."ocr' -l eng";
exec($exec, $msg);
print_r($msg);
echo file_get_contents($pathfile."ocr.txt");

Stripping the image (its just 1 page) works fine, but Tesseract echoes:

Array
  (
    [0] => Tesseract Open Source OCR Engine v3.01 with Leptonica
    [1] => Cannot open input file: 'file/tmp/strip1.jpg'
  )

and no ocr.txt file is generated, thus leading into a 'failed to open stream' error in PHP.

  • Copying strip1.jpg into c:/engine/tesseract/ folder and running Tesseract from command (tesseract strip1.jpg ocr.txt -l eng) runs without any issue.
  • Replacing the putenv() quote by exec(c:/engine/tesseract/tesseract ... ) returns the a.m. error
  • I kept strip1.jpg in the Tesseract folder and ran exec(tesseract 'c:/engine/tesseract/strip1.jpg' ... ) returns the a.m. error
  • Leaving away the apostrophs around path/strip1.jpg returns an empty array as message and does not create the ocr.txt file.
  • writing the command directly into the exec() quote instead of using $exec doesn't make the change.

What am I doing wrong?

halfer
  • 19,824
  • 17
  • 99
  • 186
droehn
  • 63
  • 2
  • 10
  • Rather than a relative path (file/tmp/strip1.jpg), try a fully-qualified path? – halfer Apr 17 '12 at 21:12
  • @halfer: I have tried many different paths - also full path from c: to tmp - with and without apostroph - but did not make any change at all. Wrong was to have apostrophs around the path/file name so I left them all away. exec(dir path) gives me clearly the content of the /file/tmp folder and also the strip1.jpg. It looks like tesseract finds the file but crashes before start of operation, returning no $msg as well as no ocr.txt. But why is it working from command line and not in PHP? Ghostscript does not worry about this at all. – droehn Apr 19 '12 at 17:50

2 Answers2

1

Halfer, you made my day:-)

Not exactly the way as described in your post but like this:

$path = str_replace("index.php", "../".$pathfile, $_SERVER['SCRIPT_FILENAME']);

$descriptors = array(
   0 => array("pipe", "r"),
   1 => array("pipe", "w"),
   2 => array("pipe", "w")
);
$cwd = $pathtess;
$command = "tesseract ".$path."strip1.jpg" ".$path."ocr -l eng";

$process = proc_open($command, $descriptors, $pipes, $cwd);

if(is_resource($process)) {
    fclose($pipes[0]);
    fclose($pipes[1]);
    fclose($pipes[2]);
    proc_close($process);
}

echo file_get_contents($path."ocr.txt");
halfer
  • 19,824
  • 17
  • 99
  • 186
droehn
  • 63
  • 2
  • 10
  • Out of interest, what was the problem? I can't see any environment stuff being set in there. – halfer Apr 19 '12 at 21:33
  • If I would only know; I formally experimented with the full path to /file/tmp but under exec() it did not work out. With proc_open it works and thats the major thing. Anyway I will try running this path under exec() again to rule out mistakes during my studies. – droehn Apr 20 '12 at 07:02
  • Well... what ever I did during felt-like 2'000 efforts I performed complete bullshit, wasting my and other people's time :-s running $command with exec() works absolutely fine, no complaints, perfect! Suppose I've typed rubbish during my efforts with the full path or kept apostrophs around it or whatever. Well, at least I learned something about proc_open... Please accept my appologies! brgds David – droehn Apr 20 '12 at 17:32
0

Perhaps the missing environment variables in PHP is the problem here. Have a look at my question here to see if setting HOME or PATH sorts this out?

Community
  • 1
  • 1
halfer
  • 19,824
  • 17
  • 99
  • 186