How to convert multi page PDF to multi image in imagemagick through php

Question

My problem is that I have to convert multi page PDF to text so for that I use imagemagick software which simply convert PDF to image and I extract text from image very easily. But the problem is that if the PDF is of more than 40 page then it only converts last page from PDF so what to do either convert each page into image format or make single image of all PDF pages. How should I do that?

Here I have done it for single page PDF Here every PDF is converted into image and store in uploads folder here you can see that:

$image = new Imagick(__DIR__.'/'.$target_file);
$image->setImageFormat('png');   
$imageName = __DIR__.'/uploads/'.time().'.png';
$image->writeImage($imageName);

Please help me. I am waiting for the response.

The answer depends on the OCR solution you use, I'd say. More specific what input format delivers the best recognition result. — arkascha, Feb 24 '17 at 10:33
I just want to know how I converted all pages of PDF into image format. Means either to convert all pages of PDF into separated image file or should I make a single image of whole PDF file. what should I do in this case. — Rahul Sinha, Feb 24 '17 at 11:06
We very well understood that question, no sense to repeat it. But as said: we cannot give a recommendation, since the preferred approach obviously depends on what OCR solution you use and what input that requires. We do not know your setup, you did not answer to my questions about that, so we still cannot help. Sorry. — arkascha, Feb 24 '17 at 11:12
Actually I extract data from images so now I want to extract data from PDF so, for this I use google API and In that I only extract the content of last page of PDF as when I convert my PDF file to image then it only converts last page so how to convert all pages so that I can easily extract the content from the PDF — Rahul Sinha, Feb 24 '17 at 11:19
Ah, ok you do not use any local OCR solution at all but rely on the google service. So in that case my question morphs to: what is easier to feed into that API and what delivers better results? That certainly would be the preferred approach for you then... — arkascha, Feb 24 '17 at 11:21
Another question: if you have PDF documents, then why don't you simply extract the text from them instead of taking the two additional steps to first convert it into an image only to then re-extract the text from that? Or are those "cheap PDFs" that don't contain text but only images of texts? — arkascha, Feb 24 '17 at 11:22
Yes there are some images in that PDF so I want to convert them in image format. How can I do that can you suggest me some idea. Means through imageMagick how to convert all pages of PDF to image. — Rahul Sinha, Feb 24 '17 at 11:25
Oh, so your real question is _not_ as you wrote above whether you should convert the document to one single or to multiple images, but how to convert to multiple images _at all_? Why don't you say so in your question? — arkascha, Feb 24 '17 at 11:32
Actually my question is right because I want to know how this code implements in that. In the above code I can only convert the last page of my PDF file to image but should I do for all. — Rahul Sinha, Feb 24 '17 at 11:35
Or you know another way to direct extract text from PDF file in PHP. Let me know if yes then. — Rahul Sinha, Feb 24 '17 at 11:36
Your question clearly asks "what to do" - single image for full document or separate image for pages. — arkascha, Feb 24 '17 at 11:36
Certainly you can extract text embedded in a PDF. _Unless_ that is not real text but an image of a text. — arkascha, Feb 24 '17 at 11:37
There are some parsers for PDF documents on the internet. A simple google search will find them. This might be a starting point: http://www.pdfparser.org/ — arkascha, Feb 24 '17 at 11:38
This might give you an idea how to extract single pages of the PDF into separate images: http://stackoverflow.com/questions/20598936/saving-each-pdf-page-to-an-image-using-imagick — arkascha, Feb 24 '17 at 11:39
Once more: the first question to answer is: is there text in the PDF or is that text on images in the PDF? If it is real text, then do not convert to images but extract the text using a PDF parser. If that PDF contains images containing text, then extract the images and pump them through the OCR. — arkascha, Feb 24 '17 at 11:40
Ok thank you very much. I solve my own problem. But I also use your idea of using PDF parser. — Rahul Sinha, Feb 24 '17 at 12:16

Rahul Sinha · Accepted Answer · 2017-03-10T14:51:58.553

2

I solve my problem, If someone face similar problem can see here.

    $image = new Imagick(__DIR__.'/'.$target_file);
    $num_pages = $image->getNumberImages();
    for($i = 0; $i < $num_pages; $i++) 
    {         
       $image->setIteratorIndex($i);
       $image->setImageFormat('png');  
       $imageName = __DIR__.'/uploads/'.$i.time().'.png';
       $var = $image->writeImage($imageName); 
    }

edited Mar 10 '17 at 14:51

answered Feb 24 '17 at 12:47

Rahul Sinha

1,969
14
17

How to convert multi page PDF to multi image in imagemagick through php

1 Answers1