0

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image?

Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.

Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)

Cindy Meister
  • 25,071
  • 21
  • 34
  • 43

3 Answers3

0

I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..

Ref: doc2txt

0

docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.

Shay
  • 1,368
  • 11
  • 17
0

Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.

As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .

docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True. Well, you get what your image called in pagaraph text and list of image files. Match as you like.

Szymon
  • 19
  • 4