Python urllib get HTML page requisites

Question

I would like to ask if is there a proper way to retrieve (do not save/download locally) all the files that are necessary to properly display a given HTML page and their information (page size etc.) with python urllib? This includes such things as inlined images, sounds, and referenced stylesheets.

I searched and found that wget can perform the described procedure using --page-requisites flag but the performance is not the same and I don't want to download anything locally. Furthermore, the flag -O/dev/null is not working with what I want to achieve.

My final goal is to hit the page (hosted locally), gather page info and move on.

Any tips, reading references is appreciated.

Do see my answer and let me know if it helps. – AzyCrw4282 Apr 06 '20 at 17:41 — AzyCrw4282, Apr 06 '20 at 17:41

score 1 · Answer 1 · answered Apr 05 '20 at 20:08

I would recommend Scrapy. It's simple to use and you can set an xpath to locate and retrieve just the information you need, e.g. inlined images, sounds, and referenced stylesheets.

An example to retrieve text and links

import  scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
    name = 'ikea'

    allowed_domains = ['http://www.ikea.com/']

    start_urls = ['http://www.ikea.com/']

    def parse(self, response):
        for sel in response.xpath('//tr/td'):
            item = IkeaItem()
            item['name'] = sel.xpath('//a/text()').extract()#change here
            item['link'] = sel.xpath('//a/@href').extract()

            yield item

As you can see you can set an Xpath to extract just what you want.

For example,

image, item['link'] = sel.xpath('//img').extract()

sound, item['link'] = sel.xpath('//audio').extract()

And as for hosting locally, it would work just as same, you would simply have to change the url. You can then save the data or do whatever you want.

Python urllib get HTML page requisites

1 Answers1