0

I am trying to copy images from one word document to the other. For that, I extracted all the images from the word document into a folder(img_folder) using the following code:

docx2txt.process('Sample2.docx', 'img_folder/')

Using the above code the images get saved into the img_folder in jpeg/png format However, when I try to insert the images from this folder I get the following error:

File "D:/Pycharm Projects/pydocx/newaddImg.py", line 60, in copy
    newp.add_run().add_picture(imgPath)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\text\run.py", line 62, in add_picture
    inline = self.part.new_pic_inline(image_path_or_stream, width, height)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 56, in new_pic_inline
    rId, image = self.get_or_add_image(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\parts\story.py", line 29, in get_or_add_image
    image_part = self._package.get_or_add_image_part(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 31, in get_or_add_image_part
    return self.image_parts.get_or_add_image_part(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\package.py", line 74, in get_or_add_image_part
    image = Image.from_file(image_descriptor)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 55, in from_file
    return cls._from_stream(stream, blob, filename)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 176, in _from_stream
    image_header = _ImageHeaderFactory(stream)
  File "D:\Pycharm Projects\pydocx\env\lib\site-packages\docx\image\image.py", line 199, in _ImageHeaderFactory
    raise UnrecognizedImageError
docx.image.exceptions.UnrecognizedImageError

Can anyone suggest how to resolve this error.

3 Answers3

0

Please check if your question is a duplicate of this. In either case, the same answer should be able to give you a lot more insight into the problem you seem to be facing currently.

veedata
  • 1,048
  • 1
  • 9
  • 15
  • I don't think OP has that issue. Cuz in that question there is clearly mentioned that image format is .wmf and OP has image in jpeg/png format. – imxitiz Jul 24 '21 at 19:00
  • 1
    True, I agree, but in the solution there, is has been mentioned that the library recognises images using the image header. So, it would be preferrable to first use something like pillow to resave the images to verify if there are no header issues. I do realise, my wording made my takeaway ambiguous, but it might be a potential problem. – veedata Jul 24 '21 at 19:04
0

According to the documentation, you can insert image like this:

from docx import Document
import os

document = Document()

directory="img_folder"

for x in [os.path.abspath(os.path.join(directory, p)) for p in os.listdir(directory) if p.endswith(('jpg', 'png'))]:
    document.add_picture(x)

Here we are just extracting all the file which ends with jpg and png from img_folder directory and adding those images in document.

imxitiz
  • 3,920
  • 3
  • 9
  • 33
0

There are some images, perhaps taken on an older iPhone, that python-docx can't recognize. These images omit some parts of the standard image header and that prevents python-docx from recognizing them.

The fix is to load the image into Pillow and then save it as the same type. This process restores the header and python-docx should then be able to read it without error.

Something roughly like:

from PIL import Image

Image.open("corrupted-header.jpg")
Image.save("fixed-header.jpg")

You could do this in a try-except block and only process the files that raised UnrecognizedImageError, or you could pre-emptively pre-process all the image files.

The relevant parts of the Pillow documentation are here:
https://pillow.readthedocs.io/en/stable/reference/Image.html

scanny
  • 26,423
  • 5
  • 54
  • 80