Issue running Scrapy with CrawlerProcess, TwistedScheduler, and a couple middlewares. Can run the crawler for a x time and it will eventually fail

Question

As stated above after running the code for some time it fails. Logs do not show anything it will just cease to work.

I will show some of the warnings and errors I got as well as the code and settings file.

Keep in mind the code it fully functional and can scrape the website without any issue, but after x time it fails.

I have had periods where the scraper works for 2+ hours and times it fails right away after a few minutes. I have 6 User Agents in use and 150 proxies being run. When it fails I immediately go to the website manually with the proxies that were being run and test them to see if they are the issue, they will always work so it is not probable that proxies are the issue and the site seems to have very low protection against scrapers and crawlers.

Spider file:

# -*- coding: utf-8 -*-
import os
import requests
from discord import SyncWebhook
import discord
import aiohttp
import scrapy
import datetime
from datetime import date
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

namelist = []
timelist = []

def send_embed(price, name, stock, image, response):
    neg = ['sticker', 'pivot tool', 't-shirt']
    #neg = []

    if (price and name and stock) and (not any(x in name.lower() for x in neg) or ("https://www.scottycameron.com/store/speed-shop-creations/" in str(response.request.headers.get('Referer', None)) and not "t-shirt" in name.lower())):
        
        temptime = datetime.datetime.now()

        global namelist
        global timelist

        if name not in namelist:
            namelist.append(name)
            timelist.append(temptime)

            stock = stock.replace('(', '')
            stock = stock.replace(')', '')
            image = image.replace(' ', '%20')
            webhook = SyncWebhook.from_url('REDACTED')
            embed = discord.Embed(
                title=str(name),
                url=str(response.request.url),
                colour=0xDB0B23
            )
            embed.add_field(name = "Price", value = str(price), inline = True)
            embed.add_field(name = "Stock", value = str(stock), inline = True)
            embed.set_thumbnail(url = str(image))
            embed.set_footer(text = "Notify Test Monitors")
            webhook.send(embed = embed)
        else:
            index = namelist.index(name)
            diff = (temptime - timelist[index]).total_seconds()
            if diff > 120:

                del timelist[index]
                timelist.insert(index, temptime)

                stock = stock.replace('(', '')
                stock = stock.replace(')', '')
                image = image.replace(' ', '%20')
                webhook = SyncWebhook.from_url('REDACTED')
                embed = discord.Embed(
                    title=str(name),
                    url=str(response.request.url),
                    colour=0xDB0B23
                )
                embed.add_field(name = "Price", value = str(price), inline = True)
                embed.add_field(name = "Stock", value = str(stock), inline = True)
                embed.set_thumbnail(url = str(image))
                embed.set_footer(text = "Notify Test Monitors")
                webhook.send(embed = embed)

class scottycameronSpider(CrawlSpider):
    name = 'scottycameron'
    allowed_domains = ['scottycameron.com']
    start_urls = ['https://www.scottycameron.com/']

    rules = (
        Rule(LinkExtractor(allow = 'store/'), callback = 'parse', follow = True),
    )

    def parse(self, response):
        for products in response.xpath('//*[@id="layout-content"]'):
            price = products.xpath('//*[@id="product_Detail_Price_Div"]/p/text()').get()
            name = products.xpath('//*[@id="layout-product"]/div[2]/div/div[2]/h1/text()').get()
            stock = products.xpath('//*[@id="dynamic-inventory"]/span/text()').get()
            image = products.xpath('//*[@id="product-image"]/@src').get()

            send_embed(price, name, stock, image, response)

    def close(self, reason):
        start_time = self.crawler.stats.get_value('start_time')
        finish_time = self.crawler.stats.get_value('finish_time')
        with open("spiders/test_scrapy/times.txt", 'a') as f:
            f.write(str(finish_time - start_time) + "\n")
            f.close()

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[scottycameronSpider], seconds=5)
scheduler.start()
process.start(False)

Settings.py

# Scrapy settings for scrapy_monitors project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import asyncio
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

BOT_NAME = 'scrapy_monitors'

SPIDER_MODULES = ['scrapy_monitors.spiders']
NEWSPIDER_MODULE = 'scrapy_monitors.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_monitors (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
CONCURRENT_ITEMS = 100

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 80
#CONCURRENT_REQUESTS_PER_IP = 32

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_monitors.middlewares.ScrapyMonitorsSpiderMiddleware': 543,
#}

# Comment it out if you want to see more log items to debug
LOG_LEVEL = "WARNING"
LOG_FILE = "spiders/test_scrapy/log.txt"

# Insert Your List of Proxies Here
ROTATING_PROXY_LIST_PATH = 'spiders/proxies.txt'

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #'scrapy_monitors.middlewares.ScrapyMonitorsDownloaderMiddleware': 543,
    #'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    #'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

# Used for User Agents
DOWNLOADER_MIDDLEWARES.update({
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
})

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
    # ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    #  'AppleWebKit/537.36 (KHTML, like Gecko) '
    #  'Chrome/58.0.3029.110 '
    #  'Safari/537.36'),  # chrome
    # ('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) '
    #  'Gecko/20100101 '
    #  'Firefox/53.0'),  # firefox
    # ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0) '),
    # ('Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0; MDDCJS) '),
    # ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    #  'AppleWebKit/537.36 (KHTML, like Gecko) '
    #  'Chrome/51.0.2704.79 '
    #  'Safari/537.36 '
    #  'Edge/14.14393'),  # chrome
    # ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) '),
]
# Used for User Agents

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
    #'scrapy.telnet.TelnetConsole': None
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_monitors.pipelines.ScrapyMonitorsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# Schedule order
#SCHEDULER_ORDER = 'BFO'

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

As stated above I tried to fix the issue with the errors I had but no such luck, I tested proxies after the errors and all worked just fine, I tried multiple user agents to see if that fixed it, I cannot get enough out of the logger to give me the best diagnosis. If there are also suggestions on how to log better I would love to hear about it so I can better understand the issue.

I WILL ATTACH ERROR LOG 1 IN THE COMMENTS

Error Log 2: After it was run for a little while with the error this was produced:

Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\tcp.py", line 1334, in startListening
    skt.bind(addr)
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\defer.py", line 292, in maybeDeferred_coro
    result = f(*args, **kw)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\extensions\telnet.py", line 65, in start_listening
    self.port = listen_tcp(self.portrange, self.host, self)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\reactor.py", line 23, in listen_tcp
    return reactor.listenTCP(x, factory, interface=host)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\posixbase.py", line 369, in listenTCP
    p.startListening()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\tcp.py", line 1336, in startListening
    raise CannotListenError(self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted.
2023-01-27 17:17:02 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method TelnetConsole.start_listening of <scrapy.extensions.telnet.TelnetConsole object at 0x0000028AE20831F0>>

This error I attempted to fix myself but ultimately it didn't fix it since I believe this issue stemmed from my code failing and continuously trying to connect.

Edit: Cannot easily add the other "error", it's more so just showing the flow of the code then where it just stops without showing any warnings or issues in the log. Ill post a pastebin of the log. https://pastebin.com/tGc68013 — kian, Jan 28 '23 at 08:28
@Alexander It is used to continuously run the script every 5 seconds — kian, Jan 28 '23 at 18:30
but why do you run the same script every 5 seconds? Does the information change that ofter? — Alexander, Jan 28 '23 at 18:34
I use it as a website monitor to check for new stock, so ideally I want it to be fast. There is probably a much better way to do this but I am new in the realm of webscraping/webcrawling. This is the furthest I have gotten in terms of a functional monitor for a website, may look to monitor for site changes rather than scrape the full site (with some narrowed down search) but being new there is just a lot I need to look into still. If you have any suggestions I would be very grateful. — kian, Jan 28 '23 at 18:39

score 0 · Answer 1 · answered Jan 30 '23 at 07:43

0

Error log 2:

This error is caused because you "ran out of ports" for the TelnetConsole.

If you look at the telnetconsole page in the documentation you can see the port range:

portrange = [6023, 6073]

Either extend your port range by adding TELNETCONSOLE_PORT = [6023, 7000] for example to settings.py file, or even better just disable the telnet console TELNETCONSOLE_ENABLED = False.

answered Jan 30 '23 at 07:43

SuperUser

4,527
1
5
24

hey, thanks for the reply. I did set the extension for it to none but never set the other setting to false, so I went ahead and did that. See if it fixes my issue. Thanks! This is what I originally did. I thought that maybe it was just an error that was stemming from another error running for too long which caused some issues with too many requests to the port. `EXTENSIONS = { 'scrapy.extensions.telnet.TelnetConsole': None, #'scrapy.telnet.TelnetConsole': None }` – kian Jan 30 '23 at 16:12
Ok so it seems like it did not fix the issue @SuperUser as I was assuming it wouldn't since the second error stemmed from something that I can't quite seem to log properly – kian Jan 30 '23 at 17:28

Issue running Scrapy with CrawlerProcess, TwistedScheduler, and a couple middlewares. Can run the crawler for a x time and it will eventually fail

1 Answers1