How to extract images from PDF or Word, together with the text around images?

Question

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image？

Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.

Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)

score 0 · Answer 1 · answered Apr 12 '19 at 13:46

0

I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..

Ref: doc2txt

answered Apr 12 '19 at 13:46

Jinyu Liu

1

score 0 · Answer 2 · answered Jul 10 '19 at 21:10

0

docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.

answered Jul 10 '19 at 21:10

Shay

1,368
11
17

score 0 · Answer 3 · answered Jan 06 '22 at 02:44

Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.

As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .

docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True. Well, you get what your image called in pagaraph text and list of image files. Match as you like.

How to extract images from PDF or Word, together with the text around images?

3 Answers3