3

I am trying to use SCRAPY to scrape this website's search reqults for any search query - http://www.bewakoof.com .

The website uses AJAX (in the form of XHR) to display the search results. I managed to track the XHR, and you notice it in my code as below (inside the for loop, wherein i am storing the URL to temp, and incrementing 'i' in the loop)-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)


spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

Now, as I execute this, I get unexpected errors-:

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

If you correctly see my code, I have also set the DOWNLOAD_DELAY=5, still it gives the same errors as to when I didn't keep it. I also increased DOWNLOAD_DELAY=10, still it gives the same errors. I have read many questions related to this on Stack Overflow, also on GitHub , but none of them seem to help.

I read in one ofthe answers, that TOR with Polipo, can help. But, I am a bit doubtful for using it, because I don't know whether is it legal to use the combination of TOR with Polipo to scrape websites using Scrapy? (I don't want to run into trouble with any legal issues.) That is the reason why I didn't prefer to use it. So, in case if it is legal, please provide the code for my SPECIFIC CASE, using TOR and POLIPO.

Or rather, if that is illegal, Help me resolve it without using them.

Please help me resolve these errors!

EDIT:

This is my updated code-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()




class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def _monkey_patching_HTTPClientParser_statusReceived(self):

        from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
        old_sr = HTTPClientParser.statusReceived
        def statusReceived(self, status):
            try:
                return old_sr(self, status)
            except ParseError, e:
                if e.args[0] == 'wrong number of parts':
                    return old_sr(self, status + ' OK')
                raise
        statusReceived.__doc__ == old_sr.__doc__
        HTTPClientParser.statusReceived = statusReceived




    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        self._monkey_patching_HTTPClientParser_statusReceived()
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

And this is my updated output, as displayed on the terminal-:

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

So, as you see the errors are still the same! :( . So, please help me resolve this!

UPDATED-:

This the output when I try to catch the exception that @JoeLinux suggested to do-:

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')
Ashutosh Saboo
  • 364
  • 1
  • 8
  • 16

2 Answers2

2

I got the same error

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

and now it works.

I think you could try this:

  • in method _monkey_patching_HTTPClientParser_statusReceived, change from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError to from twisted.web._newclient import HTTPClientParser, ParseError;

  • in method start_requests, call _monkey_patching_HTTPClientParser_statusReceived for every request in start_urls, for example: def start_requests(self): for url in self.start_urls: self._monkey_patching_HTTPClientParser_statusReceived() yield Request(url, dont_filter=True)

Hope it helps.

Michael
  • 1,667
  • 2
  • 17
  • 18
1

I was able to replicate your situation in scrapy shell. Here is the error I received in the interactive shell:

$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

Note that where I put ..., there are lines of text that aren't as important. That last line shows "wrong number of parts". After a little googling, I found this issue:

Error download page: twisted.python.failure.Failure 'scrapy.xlib.tx._newclient.ParseError'

The best that was suggested there was a monkeypatch. Read through the thread and give that a shot.

JoeLinux
  • 4,198
  • 1
  • 29
  • 31
  • Please check my question! I have edited the above question! Please help me resolve it, as I am still unable to resolve it! :( – Ashutosh Saboo Jul 10 '15 at 07:38
  • What do you get when you follow my troubleshooting steps above? – JoeLinux Jul 10 '15 at 10:38
  • As in which one's? I didn't understand which ones? – Ashutosh Saboo Jul 10 '15 at 10:39
  • The first ones I mentioned, in order to get the specific reasons for the ParseError. – JoeLinux Jul 10 '15 at 10:40
  • No, but since you haven't put those '...' lines, what should I change in my code and where should I put? Please elaborate a bit with my code! I didn't understand still, about what code to put where. I am Sorry! But please, a code if you could provide as to what to change and where should I put it in my code, that would be better. – Ashutosh Saboo Jul 10 '15 at 10:45
  • Maybe you could provide the code after pasting it at paste.ubuntu.com . – Ashutosh Saboo Jul 10 '15 at 10:49
  • Type `scrapy shell` in a terminal, then type every command I showed you following the `>>>` prompt. It's all there. Everything I excluded with "..." is output, not input. – JoeLinux Jul 10 '15 at 11:18
  • Please check out my question again. I have updated the question with the output that you suggested, to print. This was what I wrote - http://postimg.org/image/yzaf7v2wp/ - and then I pressed Ctrl+Enter to execute it. – Ashutosh Saboo Jul 10 '15 at 12:18
  • You still didn't do the last command: `print e.reasons[0].getTraceback()`. That's the important one. – JoeLinux Jul 10 '15 at 12:19
  • Oh! Sorry! Now, I have updated the question. Please check it out! – Ashutosh Saboo Jul 10 '15 at 12:23
  • You're getting a "500" error from their server. There could be any number of reasons for that. You should try modifying your USER_AGENT string, and maybe inspecting the headers and cookies that get sent in a regular browser session and duplicating some of them. But we might be at about the limit of what I can help you with, unfortunately. – JoeLinux Jul 10 '15 at 12:36
  • But, if you look closely. Your output and mine output are the same. Even you got a 500 Error. So does that mean that the website's server doesn't allow Scrapy Bots totally? – Ashutosh Saboo Jul 10 '15 at 12:41
  • That's what I mean when I say modify your request headers. If they reject Scrapy bots, then don't make the request look like it came from a Scrapy bot. Make it look like a Firefox request by modifying the USER_AGENT string and duplicating some valid request headers that you can find yourself from a valid browser session. Check in DevTools/Firebug to see what headers might be present to indicate a valid session to the web server. – JoeLinux Jul 10 '15 at 12:45
  • Ohh! Will try that out! And maybe will let you know, once I try it out! Meanwhile, I may sound a little bit jiffy , but could you try helping me with this - http://stackoverflow.com/questions/31340373/scrapy-spider-scripts-issue-when-called-from-a-single-script-multiprocessing-i . Maybe, I shouldn't ask for this, but since I thought that you were really helpful to me and also willing to help, could you help me with that question? – Ashutosh Saboo Jul 10 '15 at 12:52