-1

I want to get JSON response with a link using proxies but when i gather all proxies and loop through them, i get the valid JSON file after 2 to 4 attempts and now i want to quit if this condition is met.

But my spider still runs after trying to close when my condition mets or after getting "response 200". I have tried sys.exit() and raise CloseSpider(reason) but nothing works for me. Here is my code:

import scrapy
from scrapy.crawler import CrawlerProcess
import json
from scrapy.exceptions import CloseSpider
import sys

class ScrapyProxy(scrapy.Spider):
    name = 'scrapy_proxy'
    start_urls = ['https://free-proxy-list.net']
    
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'cache-control': 'no-cache',
        'pragma': 'no-cache',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    
    def parse(self, response):
        table = response.css('table')
        rows = table.css('tr')
        cols = [row.css('td::text').getall() for row in rows]
        
        proxies = []
        
        for col in cols:
            if col and col[4] == 'elite proxy' and col[6] == 'yes':
                proxies.append('https://' + col[0] + ':' + col[1])
            
        print('proxies:', len(proxies))
        
        for proxy in proxies[0:5]:
            print(proxy)
            
            url = 'https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=2426&newest=0&order=desc&page_type=search&version=2'
            
            yield scrapy.Request(url, dont_filter=True, headers=self.headers, meta={'proxy': proxy}, callback=self.check_response)
            
         
    def check_response(self, response):
        print('\n\nRESPONSE:', response.status)
        try:
            data = json.loads(response.body)
            if data['items']:
                print(f'Received data with: {len(data["items"])} items.')
                # HERE I WANT TO CLOSE MY SPIDER
                # self.close(reason='Closign spider')
                # sys.exit('Exiting from the spider')
                # raise CloseSpider(reason='Closing the spider')
        except:
            print(f'got error in url {response.url}')

# run spider
process = CrawlerProcess()
process.crawl(ScrapyProxy)
process.start()

This is a standalone spider. Please help me to terminate this. Thanks in advance.

FAIZ AHMED
  • 35
  • 7
  • What happens when you try this `os._exit(0)`? – SIM Oct 24 '20 at 17:19
  • 1
    Does this answer your question? [Unable to make my script stop when some urls are scraped](https://stackoverflow.com/questions/55792062/unable-to-make-my-script-stop-when-some-urls-are-scraped) – SIM Oct 24 '20 at 18:55

1 Answers1

0

I wonder why raise CloseSpider doesn't work. According to the docs it should. see Georgiys comment

The reason why sys.exit does not work is probably that it is caught by twisted. You could try to get the reactor and stop it

from twisted.internet import reactor
...
reactor.stop() 

if that doesn't work try reactor.crash()

Raphael
  • 1,731
  • 2
  • 7
  • 23
  • 1
    No. In [CloseSpider extension docs](https://docs.scrapy.org/en/latest/topics/extensions.html#scrapy.extensions.closespider.CloseSpider) (not in CloseSpider exCeption docs mentioned in your answer) there is a note: When a certain closing condition is met, requests which are currently in the downloader queue (up to CONCURRENT_REQUESTS requests) are still processed. - exact this thing happened. – Georgiy Oct 25 '20 at 15:10