1

Im using Scrapy and I want to save some of the .svg images from the webpage locally on my computer. The urls for these images have the structure '__.com/svg/4/8/3/1425.svg' (and is a full working url, https included).

Ive defined the item in my items.py file:

class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()

Ive added the following to my settings:

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True

In the main parse function im calling:

imageItem = ImageItem()
imageItem['image_urls'] = [url]

yield imageItem

But it doesn't save the images. Ive followed the documentation and tried numerous things but keep getting the following error:

StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>

During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>

Am I missing something? Can anyone help? I am fully stumped.

Maximilian
  • 143
  • 1
  • 8
  • can you at least the image url before redirection? – wishmaster Aug 28 '20 at 00:31
  • Maybe Pillow (PIL) does not support SVG images, and the image pipeline uses it to convert all images to JPEG. If you are fine with downloading the images as SVG, then use the FilesPipeline instead, which does not try to convert downloaded files. – Gallaecio Aug 28 '20 at 10:56

2 Answers2

2

Gallaecio was right! Scrapy was having an issue with the .svg file type. Changed the imagePipeline to the filePipeline and it works!

For anyone stuck the documentation is here

Maximilian
  • 143
  • 1
  • 8
2

Python Imaging Library (PIL), which is used by the ImagesPipeline, does not support vector images.

If you still want to benefit from the ImagesPipeline capabilities and not switch to the more general FilesPipeline, you can do something along those lines

from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO

class SvgCompatibleImagesPipeline(ImagesPipeline):

    def get_images(self, response, request, info, *, item=None):
        """
        Add processing of SVG images to the standard images pipeline
        """
        if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
            b = BytesIO()
            renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
            res = response.replace(body=b.getvalue())           
        else:
            res = response

        return super().get_images(res, request, info, item=item)

This will replace the SVG image in the response body by a PNG version of it, which can be further processed by the regular ImagesPipeline.

runningwild
  • 129
  • 12