-1

I need to extract the same rectangular area (in the same position) on different pages in a PDF file with several hundred pages.

I am running Linux, and have found a way to do this manually using Tesseract and the front-end gImageReader, and am looking for a way to automate this process.

The information i need to extract is Hindi text (written in Devanagari), so extracting the data as text (without Hindi OCR) would probably yield bad results, but if there is a way to extract it as an image that would also be ok, i could then OCR the collected data in Tesseract in a separate step.

So what i am looking for, is a way to copy the same area from different pages of a PDF, and output them to another file (another PDF or image file for example).

I have seen other similar questions posted, but they are asking specifically to extract text, which is not necessarily needed in this case.

If there is a way to do this by converting the PDF to image files, that would also be interesting.

PS: I am now looking at doing this in the terminal (using Gimp), along the lines of what Dmitri Z is proposing.

For those interested in a GUI, i have found Phatch for Linux, which is great for batch processing images, as well as (batch) cropping PDF files directly.

If someone knows of a way to extract 2 different rectangular areas from 1 image, that would be helpful.

badaboum
  • 31
  • 6
  • are you looking only for excusing tools or would you also do some programming? – mkl Dec 30 '17 at 23:28
  • I am open to anything (as long as it is not too complex), i am currently looking a using Gimp in the terminal (similar to what Dmitri Z. mentions below, i guess). – badaboum Dec 30 '17 at 23:46

2 Answers2

1

The solution consists of 2 steps: 1) Convert PDF to image The most common tool for that is imagemagick. You can use it as command line tool

$ convert foo.pdf foo.png

as well as with using API python example. You can use c++ API but unfortunately i don't have much experience in Magic++ c++ API.

You might need to install GhostScript for reading PDF.

2) Extracting region of interest (ROI) from image You can use imagemagick here as well

-extract widthxheight{{+-}offset}

would be an option to use, example:

convert -extract 640x480+1280+960 bigImage.rgb extractedImage.rgb

Other option would be to use OpenCV. In C++ it would be pretty easy:

Mat image = imread("yourimage.png");
int x = 10, y = 20, w = 100, h = 100;
imwrite("roiImage", image(Rect(x, y, w, h)));
Dmitrii Z.
  • 2,287
  • 3
  • 19
  • 29
  • Is there a way to extract 2 different rectangular areas from 1 image? – badaboum Dec 30 '17 at 23:50
  • `@badaboum` You can always do the same procedure for one region a second time for the second region. If you could extract two different regions simultaneously, what would you need to do with them? That is, do they need to be combined into one image? If so, how? What spacing? Please be more specific! – fmw42 Dec 31 '17 at 00:36
  • `@badaboum`. If you specify a -density in the Imagemagick command before the input, then with a little computation and a trial, you can figure out the crop coordinates to extract the region. Then in the same command output back to PDF. However, note that Imagemagick is not a vector to vector processor. Imagemagick will rasterize your PDF when reading and then output to a raster image imbedded in a PDF vector shell. Thus the output size will be dramatically larger, unless the input was also a raster image in a PDF vector shell. `convert -density X image.pdf -crop WxH+X+Y +repage output.pdf` – fmw42 Dec 31 '17 at 00:40
  • @fmw42 I am now batch cropping the 2 areas in two steps, was just wondering if there was a way to do it in 1 step. I have got pages with one word on the upper left corner, and one on the upper right corner, and want to extract these words from the PDF/images, and get them into a list. – badaboum Dec 31 '17 at 00:51
1

You can crop two (or more) regions in the same Imagemagick command as follows:

convert image +write mpr:img +delete \
\( mpr:img -crop W1xH1+X1+Y1 +repage +write out1 \) \
\( mpr:img -crop W2xH2+X2+Y2 +repage +write out2 \) \
null:

or

convert image \
\( -clone 0 -crop W1xH1+X1+Y1 +repage +write out1 \) \
\( -clone 0 -crop W2xH2+X2+Y2 +repage +write out2 \) \
null:
fmw42
  • 46,825
  • 10
  • 62
  • 80