3

I'm trying to parse the title of different listings from this webpage. The titles are not dynamic as they are available in page source. However, it is necessary to send cookies in the first place to grab the titles. I've tried with the following way to scrape the titles of the listings but it doesn't seem to work.

My attempt so far:

import scrapy
from scrapy.crawler import CrawlerProcess

class ControllerSpider(scrapy.Spider):
    name = 'controller'
    start_urls = [
        'https://www.controller.com/listings/aircraft/for-sale/list?SortOrder=23&scf=False&page=1'
    ]

    def start_requests(self):
        for i,url in enumerate(self.start_urls):
            yield scrapy.Request(url,meta={'cookiejar': i},callback=self.parse)
    
    def parse(self,response):
        for item in response.css(".listing-name > a[href]::text").getall():
            yield {"title":item}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(ControllerSpider)
    c.start()

How can I grab the titles of different listings from that webpage making use of cookies?

PS I do not wish to hardcode the cookies.

SMTH
  • 67
  • 1
  • 4
  • 17
  • 3
    I tried it but it's better to use selenium – bigbounty Jul 14 '20 at 18:52
  • To clarify, why do you not want to hard-code the cookies? – zmike Jul 16 '20 at 04:55
  • If I hardcode cookies now and run the script, I'll get success. However, when I run it some other time, I may not find the script working unless I renew the cookies as cookies are not always static. Hope you understand the reason why I don't wish to go for hardcoded cookies @zmike. Thanks. – SMTH Jul 16 '20 at 07:53
  • 2
    I agree with what @bigbounty said: use selenium. Also note that this specific website seems to undergo "renovation" and that you might want to rather write your scraping process for the preview at beta.controller.com/listings/search instead – Asmus Jul 20 '20 at 06:09

1 Answers1

0

If you use a scraping browser extension you don't have to manually deal with the cookies. Visit the site normally thus getting the cookies and afterwards scrape it.

https://github.com/get-set-fetch/extension is an open source extension that can handle your scenario just by specifying CSS selectors for link navigation and content extraction.

I've played a bit with the site and created a scraping configuration for you containing the required CSS selectors for navigation (next page, aircraft detail page) and scraping (year, model, manufacturer, price)

"eLtI4gnapZQ9b8MgEIb/CuoQtQO4SZolktWta4Z2zELwYSNhjI5z3P77AmrS2ulH0gwI7OM9ne7eh+PVVH9qPKBQXXsCBRmywKSrWMxopGWub3eAR1YaIh/WRTEMg1CjVIU1gYyrQyENKpSaCt0hD9JCDj0+d0gbrADLxXIWlC6fkhdmPjannI8ZrIF4iEsnYPghEA8oPfBoUTKKf3r4Gl7v/wPoVPcDAlPXj5BQjXQ1vHw1djblxgNmzsP0+Tjb62JHjsfmt+kgHLxSRD5vqdlbV5m9+JgWT7mZjPHD1PKdES2rKSzflnUuHKLl1Hk+XwmlmcBuWDtquGqMrW4Xd0wED4rvpe3TO/UGErfuN8lyKmml63Wsp0f4Q/pwIu3ihKLGo1Fw/Ju/bi4g9TIy3wGjz0AS"

Inside the extension do: new project > config hash > paste the above hash (without the quotes) > save, scrape, view results > export as csv.

Each csv row will have year, manufacturer, model, price. I've set some limits so only the first 4 result pages will be scraped but you can disable it by setting the corresponding value to -1.

Disclaimer: I'm the extension author.

a1sabau
  • 49
  • 3
  • 1
    Please consider adding a disclaimer that you wrote that extension, like you [did in your other answer](https://stackoverflow.com/a/62952414/565489) – Asmus Jul 20 '20 at 06:03