1

I have a list of data objects each of them containing a url to be scraped. Some of these urls are not valid but I still want the data object to fall through to reach item pipelines.

After @tomáš-linhart reply I understood that using a middleware will not work in this case as scrapy will not allow me to create request object in the first place.

An alternative is to return item instead of request if url is not valid.

Following is my code:

def start_requests(self):
        rurls = json.load(open(self.data_file))
        for data in rurls[:100]:
            url = data['Website'] or ''
            rid = data['id']

            # skip creating requests for invalid urls
            if not (url and validators.url(url)):
                yield self.create_item(rid, url)
                continue

            # create request object
            request_object = scrapy.Request(url=url, callback=self.parse, errback=self.errback_httpbin)

            # populate request object
            request_object.meta['rid'] = rid

            self.logger.info('REQUEST QUEUED for RID: %s', rid)
            yield request_object

The above code is throwing an error as shown. More than the error , I am not sure how to trace the issue's origin. :(

2017-09-22 12:44:38 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x10f603ef0>>
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
    redirected_urls = request.meta.get('redirect_urls', [])
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
AttributeError: meta
Unhandled Error
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
builtins.AttributeError: dont_filter

2017-09-22 12:44:38 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
builtins.AttributeError: dont_filter
comiventor
  • 3,922
  • 5
  • 50
  • 77

3 Answers3

1

You can't achieve the goal using your current approach as the error you are getting is raised in constructor of a Request, see the code.

Anyway, I don't understand why you would even want to do it this way. Based on your requirement:

I have a list of data objects each of them containing a url to be scraped. Some of these urls are not valid but I still want the data object to fall through to reach item pipelines.

If I understand it correctly, you already have a complete item (data object in your terminology) and you just want it to go through item pipelines. Then do the URL validation in the spider and if it's not valid, just yield the item instead of yielding request for URL it contains. No need for spider middleware.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • I tried to return the item in the start_requests method but I am getting an "AttributeError: meta" and builtins.AttributeError: dont_filter – comiventor Sep 22 '17 at 07:01
  • i meant yield the item. let me update my original post – comiventor Sep 22 '17 at 07:08
  • Then do the URL validation in the spider and if it's not valid, just yield the item instead of yielding request for URL it contains. I am not sure where I can yield an item instead of request. Could you point to sample code? The only place I know I can yield both is parse function which means I will have to fake at least 1 request. – comiventor Oct 03 '17 at 14:38
0

You can not yield Item object from start_requests method. Only Request object.

Verz1Lka
  • 406
  • 4
  • 15
  • I knew this part but unable to follow what Tomas is pointing to. He is saying " do the URL validation in the spider and if it's not valid, just yield the item instead of yielding request for URL it contains". Not sure how one can achieve that? – comiventor Sep 23 '17 at 07:04
0

It is late to answer your question, but I did it like this way,

class ImageSpider(scrapy.Spider):
    name = "image"
    allowed_domains = []

    def start_requests(self):
        yield scrapy.Request("https://www.example.org", callback=self.parse)

    def parse(self, response):
        while True:
            task = get_random_task()
            yield ImageItem(image_urls=task.pic_urls.split(","), mid=task.mid)

Simply make a dummy request, then call the parse function to yield items.

novice
  • 23
  • 1
  • 5