0

I have to scrap a page in a site to which I have to post a parameter, but I have an array of value to request the same page. I don't want to scrap the page for each value of the array sequentially . I want to scrap it parallel(that means search "google.com/query=a" and search "google.com/query=b" I would like to run this two request parallel(for now I cant able to run because of the site login restriction which is like the below. The problem is the site allows only login session per account at a time, that means if the account is logged in through a browser then if we try to log in again with any other browser(incognito mode) the logged in session will be logged out.

for now, I guess there are two possibilities, first, I should be able to keep the cookie values saved in a file once logged in so that I can use the same cookie when the same spider runs with different parameter.But I tried to set Cookie but the cookie is not setting to the request properly. so request failed .(if I can save the cookie and can manually set to the request then i thought to run the spider using scrapyd with different param value)

second one: when I request that page i have to scrap the page parallel through multiprocessing or Parallel using python default libraries.

But as I see some posts multiprocessing is not good while scraping they suggested.

does anyone has any idea related to this or can save a cookie to use it for next instance of spider? I can update all the code tried but it will be so huge so waiting for someone help based on that I can update with what I have tried.

Manikandan Arunachalam
  • 1,470
  • 3
  • 17
  • 32

1 Answers1

0

Scrapy executes requests concurrently by default. There are 8 concurrent requests by default, you can increase this value to higher level by manipulating setting key CONCURRENT_REQUESTS_PER_DOMAIN.

Cookies are managed by default cookie middleware and it is very rare to need to store cookies between crawls. Just login for every crawl and it should be fine. You can just login when spider starts and process all your requests concurrently reusing one spider session.

class SomeSpider(Spider)
    def start_requests(self):
        yield FormRequest("http://login.here.com", formdata={"user": "foo", "password": "bar"}, callback=self.logged_in)

    def logged_in(self, response):
        # check if login went fine
        # make all requests you need
        for url in self.list_of_urls_to_start_with:
            yield Request(url, callback=self.process_data)

   def process_data(self, response):
       # extract things, yield items
       pass

Persisting cookies between different crawls is currently not supported by default. If you would like to store cookies between different spider runs you would have to add methods to serialize cookielib.Cookie and cookielib.CookieJar objects, e.g. to json or pickle or whatever, storing them somewhere and loading them into spider (probably some small database, best thing would be some key-value storage, redis? anydbm?). It could be done in cookie middleware, loading cookies can be done on spider_opened and saving them on spider_closed. If you really need this functionality you can subclass default cookie middleware and add features you need.

Pawel Miech
  • 7,742
  • 4
  • 36
  • 57
  • I am getting this error when using yield. Filtered duplicate request: - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) – Manikandan Arunachalam Apr 06 '16 at 07:17
  • if you're issuing duplicate requests they are filtered. Why are you issuing request to same url twice? If you have good reasons for that disable duplicate filters with dont_filter=True argument for Request() – Pawel Miech Apr 06 '16 at 07:19
  • i need to request the same url with different parameter to be posted. – Manikandan Arunachalam Apr 06 '16 at 07:20
  • querystring parameters are not filtered by duplicate filter, e.g. http://foo.com?a=b; and http://foo.com?a=alfa will both be issued not filtered. POST requests with different body are also not filtered out by default. – Pawel Miech Apr 06 '16 at 07:26
  • ok now i am not getting any problem but my request is not executed concurrently. how can i requests concurrently – Manikandan Arunachalam Apr 06 '16 at 07:28
  • why do you think they are not concurrent? to check this you can telnet to spider and use scrapy.utils.trackref to see which requests are being downloaded by spider. Usually there will be bunch of them: http://doc.scrapy.org/en/latest/topics/leaks.html#debugging-memory-leaks-with-trackref – Pawel Miech Apr 06 '16 at 08:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/108376/discussion-between-manikandan-arunachalam-and-pawel-miech). – Manikandan Arunachalam Apr 06 '16 at 09:25