1

My problem is that I have to convert multi page PDF to text so for that I use imagemagick software which simply convert PDF to image and I extract text from image very easily. But the problem is that if the PDF is of more than 40 page then it only converts last page from PDF so what to do either convert each page into image format or make single image of all PDF pages. How should I do that?

Here I have done it for single page PDF Here every PDF is converted into image and store in uploads folder here you can see that:

$image = new Imagick(__DIR__.'/'.$target_file);
$image->setImageFormat('png');   
$imageName = __DIR__.'/uploads/'.time().'.png';
$image->writeImage($imageName); 

Please help me. I am waiting for the response.

arghtype
  • 4,376
  • 11
  • 45
  • 60
Rahul Sinha
  • 1,969
  • 14
  • 17
  • Why is this question tagged `JSON`? – arkascha Feb 24 '17 at 10:32
  • The answer depends on the OCR solution you use, I'd say. More specific what input format delivers the best recognition result. – arkascha Feb 24 '17 at 10:33
  • I just want to know how I converted all pages of PDF into image format. Means either to convert all pages of PDF into separated image file or should I make a single image of whole PDF file. what should I do in this case. – Rahul Sinha Feb 24 '17 at 11:06
  • We very well understood that question, no sense to repeat it. But as said: we cannot give a recommendation, since the preferred approach obviously depends on what OCR solution you use and what input that requires. We do not know your setup, you did not answer to my questions about that, so we still cannot help. Sorry. – arkascha Feb 24 '17 at 11:12
  • Actually I extract data from images so now I want to extract data from PDF so, for this I use google API and In that I only extract the content of last page of PDF as when I convert my PDF file to image then it only converts last page so how to convert all pages so that I can easily extract the content from the PDF – Rahul Sinha Feb 24 '17 at 11:19
  • Ah, ok you do not use any local OCR solution at all but rely on the google service. So in that case my question morphs to: what is easier to feed into that API and what delivers better results? That certainly would be the preferred approach for you then... – arkascha Feb 24 '17 at 11:21
  • Another question: if you have PDF documents, then why don't you simply extract the text from them instead of taking the two additional steps to first convert it into an image only to then re-extract the text from that? Or are those "cheap PDFs" that don't contain text but only images of texts? – arkascha Feb 24 '17 at 11:22
  • Yes there are some images in that PDF so I want to convert them in image format. How can I do that can you suggest me some idea. Means through imageMagick how to convert all pages of PDF to image. – Rahul Sinha Feb 24 '17 at 11:25
  • Oh, so your real question is _not_ as you wrote above whether you should convert the document to one single or to multiple images, but how to convert to multiple images _at all_? Why don't you say so in your question? – arkascha Feb 24 '17 at 11:32
  • Actually my question is right because I want to know how this code implements in that. In the above code I can only convert the last page of my PDF file to image but should I do for all. – Rahul Sinha Feb 24 '17 at 11:35
  • Or you know another way to direct extract text from PDF file in PHP. Let me know if yes then. – Rahul Sinha Feb 24 '17 at 11:36
  • Your question clearly asks "what to do" - single image for full document or separate image for pages. – arkascha Feb 24 '17 at 11:36
  • Certainly you can extract text embedded in a PDF. _Unless_ that is not real text but an image of a text. – arkascha Feb 24 '17 at 11:37
  • There are some parsers for PDF documents on the internet. A simple google search will find them. This might be a starting point: http://www.pdfparser.org/ – arkascha Feb 24 '17 at 11:38
  • This might give you an idea how to extract single pages of the PDF into separate images: http://stackoverflow.com/questions/20598936/saving-each-pdf-page-to-an-image-using-imagick – arkascha Feb 24 '17 at 11:39
  • Once more: the first question to answer is: is there text in the PDF or is that text on images in the PDF? If it is real text, then do not convert to images but extract the text using a PDF parser. If that PDF contains images containing text, then extract the images and pump them through the OCR. – arkascha Feb 24 '17 at 11:40
  • Ok thank you very much. I solve my own problem. But I also use your idea of using PDF parser. – Rahul Sinha Feb 24 '17 at 12:16

1 Answers1

2

I solve my problem, If someone face similar problem can see here.

    $image = new Imagick(__DIR__.'/'.$target_file);
    $num_pages = $image->getNumberImages();
    for($i = 0; $i < $num_pages; $i++) 
    {         
       $image->setIteratorIndex($i);
       $image->setImageFormat('png');  
       $imageName = __DIR__.'/uploads/'.$i.time().'.png';
       $var = $image->writeImage($imageName); 
    }
Rahul Sinha
  • 1,969
  • 14
  • 17