How to retrieve all the images, js, css urls

Question

I was going through all the scrapy examples and tutorials I can find and I couldn't find an example where I can go and get all the urls of the images, css, and js files being sent from the server.

Is there a way to do that with scrapy? If not with scrapy, then is there a way to do it with something else?

I basically want to go through my website and get all the urls/resources and output them to a log file.

You can list all the URL from website. You can test this code [http://stackoverflow.com/questions/9561020/how-do-i-use-the-python-scrapy-module-to-list-all-the-urls-from-my-website?rq=1] — Jose Raul Barreras, Apr 24 '15 at 00:29
@JoseRaulBarreras very much appreciate the response! However, its the resources of the website I want the urls from - not the URLs. I was able to go through and get all the URLs already. I just don't know how to get the resources' urls if that makes sense. — airborne4, Apr 24 '15 at 00:40

Elias Dorneles · Accepted Answer · 2015-04-24T01:00:11.143

3

You can use a link extractor (more specifically, I've found the LxmlParserLinkExtractor works better for this kind of thing), customizing the elements and attributes like this:

from scrapy.contrib.linkextractors.lxmlhtml import LxmlParserLinkExtractor

tags = ['img', 'embed', 'link', 'script']
attrs = ['src', 'href']
extractor = LxmlParserLinkExtractor(lambda x: x in tags, lambda x: x in attrs)
resource_urls = [l.url for l in extractor.extract_links(response)]

edited Apr 24 '15 at 01:00

answered Apr 24 '15 at 00:53

Elias Dorneles

22,556
11
85
107

Thank you! I think this is my answer - won't have time to implement it until tomorrow but this seems to be the right direction. – airborne4 Apr 24 '15 at 00:58

How to retrieve all the images, js, css urls

1 Answers1