2

I am new to Scrapy. I am trying to scrape this website in asp, that contains various profiles. It has a total of 259 pages. To navigate over the pages, there are several links at the bottom like 1,2,3....and so on.These links use _dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn','')"

For each page only the bold text changes. How do I use scrapy to iterate over several pages and extract the information? the form data is as follows:

__EVENTTARGET: ctl00%24ContentPlaceHolder1%24RepeaterPaging%24ctl01%24Pagingbtn
__EVENTARGUMENT: 
__VIEWSTATE: %2FwEPDwUKMTk1MjIxNTU1Mw8WAh4HdG90cGFnZQKDAhYCZg9kFgICAw9kFgICAQ9kFgoCAQ8WAh4LXyFJdGVtQ291bnQCFBYoZg9kFgJmDxUFCDY0MzMuanBnCzggR2VtcyBMdGQuCzggR2VtcyBMdGQuBDY0MzMKOTgyOTEwODA3MGQCAQ9kFgJmDxUFCDMzNTkuanBnCDkgSmV3ZWxzCDkgSmV3ZWxzBDMzNTkKOTg4NzAwNzg4OGQCAg9kFgJmDxUFCDc4NTEuanBnD0EgLSBTcXVhcmUgR2Vtcw9BIC0gU3F1YXJlIEdlbXMENzg1MQo5OTI5NjA3ODY4ZAIDD2QWAmYPFQUIMTg3My5qcGcLQSAmIEEgSW1wZXgLQSAmIEEgSW1wZXgEMTg3Mwo5MzE0Njk1ODc0ZAIED2QWAmYPFQUINzc5Ni5qcGcTQSAmIE0gR2VtcyAmIEpld2VscxNBICYgTSBHZW1zICYgSmV3ZWxzBDc3OTYKOTkyOTk0MjE4NWQCBQ9kFgJmDxUFCDc2NjYuanBnDEEgQSBBICBJbXBleAxBIEEgQSAgSW1wZXgENzY2Ngo4MjkwNzkwNzU3ZAIGD2QWAmYPFQUINjM2OC5qcGcaQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24aQSBBIEEgJ3MgIEdlbXMgQ29ycG9yYXRpb24ENjM2OAo5ODI5MDU2MzM0ZAIHD2QWAmYPFQUINjM2OS5qcGcPQSBBIEEgJ3MgSmV3ZWxzD0EgQSBBICdzIEpld2VscwQ2MzY5Cjk4MjkwNTYzMzRkAggPZBYCZg8VBQg3OTQ3LmpwZwxBIEcgIFMgSW1wZXgMQSBHICBTIEltcGV4BDc5NDcKODk0Nzg2MzExNGQCCQ9kFgJmDxUFCDc4ODkuanBnCkEgTSBCIEdlbXMKQSBNIEIgR2VtcwQ3ODg5Cjk4MjkwMTMyODJkAgoPZBYCZg8VBQgzNDI2LmpwZxBBIE0gRyAgSmV3ZWxsZXJ5EEEgTSBHICBKZXdlbGxlcnkEMzQyNgo5MzE0NTExNDQ0ZAILD2QWAmYPFQUIMTgyNS5qcGcWQSBOYXR1cmFsIEdlbXMgTi4gQXJ0cxZBIE5hdHVyYWwgR2VtcyBOLiBBcnRzBDE4MjUKOTgyODAxMTU4NWQCDA9kFgJmDxUFCDU3MjYuanBnC0EgUiBEZXNpZ25zC0EgUiBEZXNpZ25zBDU3MjYAZAIND2QWAmYPFQUINzM4OS5qcGcOQSBSYXdhdCBFeHBvcnQOQSBSYXdhdCBFeHBvcnQENzM4OQBkAg4PZBYCZg8VBQg1NDcwLmpwZxBBLiBBLiAgSmV3ZWxsZXJzEEEuIEEuICBKZXdlbGxlcnMENTQ3MAo5OTI4MTA5NDUxZAIPD2QWAmYPFQUIMTg5OS5qcGcSQS4gQS4gQS4ncyBFeHBvcnRzEkEuIEEuIEEuJ3MgRXhwb3J0cwQxODk5Cjk4MjkwNTYzMzRkAhAPZBYCZg8VBQg0MDE5LmpwZwpBLiBCLiBHZW1zCkEuIEIuIEdlbXMENDAxOQo5ODI5MDE2Njg4ZAIRD2QWAmYPFQUIMzM3OS5qcGcPQS4gQi4gSmV3ZWxsZXJzD0EuIEIuIEpld2VsbGVycwQzMzc5Cjk4MjkwMzA1MzZkAhIPZBYCZg8VBQgzMTc5LmpwZwxBLiBDLiBSYXRhbnMMQS4gQy4gUmF0YW5zBDMxNzkKOTgyOTY2NjYyNWQCEw9kFgJmDxUFCDc3NTEuanBnD0EuIEcuICYgQ29tcGFueQ9BLiBHLiAmIENvbXBhbnkENzc1MQo5ODI5MTUzMzUzZAIDDw8WAh4HRW5hYmxlZGhkZAIFDw8WAh8CaGRkAgcPPCsACQIADxYEHghEYXRhS2V5cxYAHwECCmQBFgQeD0hvcml6b250YWxBbGlnbgsqKVN5c3RlbS5XZWIuVUkuV2ViQ29udHJvbHMuSG9yaXpvbnRhbEFsaWduAh4EXyFTQgKAgAQWFGYPZBYCAgEPDxYKHg9Db21tYW5kQXJndW1lbnQFATAeBFRleHQFATEeCUJhY2tDb2xvcgoAHwJoHwUCCGRkAgEPZBYCAgEPDxYEHwYFATEfBwUBMmRkAgIPZBYCAgEPDxYEHwYFATIfBwUBM2RkAgMPZBYCAgEPDxYEHwYFATMfBwUBNGRkAgQPZBYCAgEPDxYEHwYFATQfBwUBNWRkAgUPZBYCAgEPDxYEHwYFATUfBwUBNmRkAgYPZBYCAgEPDxYEHwYFATYfBwUBN2RkAgcPZBYCAgEPDxYEHwYFATcfBwUBOGRkAggPZBYCAgEPDxYEHwYFATgfBwUBOWRkAgkPZBYCAgEPDxYEHwYFATkfBwUCMTBkZAINDw8WAh8HBQ1QYWdlIDEgb2YgMjU5ZGRkfEDzDJt%2FoSnSGPBGHlKDPRi%2Fbk0%3D
__EVENTVALIDATION: %2FwEWDALTg7oVAsGH9qQBAsGHisMBAsGHjuEPAsGHotEBAsGHpu8BAsGHupUCAsGH%2FmACwYeS0QICwYeW7wIC%2FLHNngECkI3CyQtVVahoNpNIXsQI6oDrxjKGcAokIA%3D%3D

I looked at multiple solutions and posts which are suggesting to see the parameters of post call and use them but I am not able to make sense of the parameters which are provided in post.

Pramay Nikhade
  • 25
  • 1
  • 10

1 Answers1

3

In short, all you need is sending __EVENTTARGET, __EVENTARGUMENT, __VIEWSTATE and __EVENTVALIDATION.

  • __EVENTTARGET: ctl00$ContentPlaceHolder1$RepeaterPaging$ctl00$Pagingbtn, change the bold text to get different pages.
  • __EVENTARGUMENT: always empty
  • __VIEWSTATE: in an input tag with id __VIEWSTATE
  • __EVENTVALIDATION: in an input tag with id __EVENTVALIDATION

It's worth mention that when you extract names, the actual xpath may be different from what you copy from the Chrome.

Actual xpath: //*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()
Chrome version: //*[@id="aspnetForm"]/div[3]/section/div/div/div[1]/div/h3/text()

Update: For pages beyond 05, you should update __VIEWSTATE and __EVENTVALIDATION everytime, and use "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn" as the __EVENTTARGET to get the next page.

The 00 part in __EVENTTARGET is related to current page, for example:

 1  2  3  4  5  6  7  8  9 10
00 01 02 03 04 05 06 07 08 09
               ^^
To get page 7: use index 06
------------------------------
 2  3  4  5  6  7  8  9 10 11
00 01 02 03 04 05 06 07 08 09
               ^^
To get page 8: use index 06
------------------------------
12 13 14 15 16 17 18 19 20 21
00 01 02 03 04 05 06 07 08 09
               ^^
To get page 18: use index 06
------------------------------
current page: ^^

The other part of __EVENTTARGET remains the same, which means the current page is encoded in __VIEWSTATE (and __EVENTVALIDATION? no so sure but it doesn't matter). We can extract and send them again to show the server we are now at page 10, 100, ...

To get the next page, we can use a fixed __EVENTTARGET: ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn.

Of course, you can use ctl00$ContentPlaceHolder1$RepeaterPaging$ctl07$Pagingbtn to get the next 2 page.


Here is a demo(updated):

# SO Debug Spider
# OUTPUT: 2018-07-22 10:54:31 [SOSpider] INFO: ['Aadinath Gems & Jewels']
# The first person of page 4 is Aadinath Gems & Jewels
#
# OUTPUT: 2018-07-23 10:52:07 [SOSpider] ERROR: ['Ajay Purohit']
# The first person of page 12 is Ajay Purohit

import scrapy

class SOSpider(scrapy.Spider):
  name = "SOSpider"
  url = "http://www.jajaipur.com/Member_List.aspx"

  def start_requests(self):
    yield scrapy.Request(url=self.url, callback=self.parse_form_0_5)

  def parse_form_0_5(self, response):
    selector = scrapy.Selector(response=response)
    VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
    EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()

    # It's fine to use this method from page 1 to page 5
    formdata = {
      # change pages here
      "__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl03$Pagingbtn",
      "__EVENTARGUMENT": "",
      "__VIEWSTATE": VIEWSTATE,
      "__EVENTVALIDATION": EVENTVALIDATION,
    }
    yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_0_5)

    # After page 5, you should try this
    # get page 6
    formdata["__EVENTTARGET"] = "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl05$Pagingbtn"
    yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": 6})

  def parse(self, response):
    # use a metadata to control when to break
    currPage = response.meta["PAGE"]
    if (currPage == 15):
      return

    # extract names here
    selector = scrapy.Selector(response=response)
    names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
    self.logger.error(names)

    # parse VIEWSTATE and EVENTVALIDATION again, 
    # which contain current page
    VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
    EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()

    # get next page
    formdata = {
      # 06 is the next 1 page, 07 is the next 2 page, ...
      "__EVENTTARGET": "ctl00$ContentPlaceHolder1$RepeaterPaging$ctl06$Pagingbtn",
      "__EVENTARGUMENT": "",
      "__VIEWSTATE": VIEWSTATE,
      "__EVENTVALIDATION": EVENTVALIDATION,
    }
    yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse, meta={"PAGE": currPage+1})

  def parse_0_5(self, response):
    selector = scrapy.Selector(response=response)
    # only extract name
    names = selector.xpath('//*[@id="aspnetForm"]/div/section/div/div/div[1]/div/h3/text()').extract()
    self.logger.error(names)
Vic
  • 123
  • 1
  • 5
  • Thank you so much, it worked like a charm :), but again I am facing an issue that its doable till page 9, after that "__EVENTTARGET" contains the same value as before like 00,01,02....even for pages 250,251,252. So my question is, how extract data beyond page 09? – Pramay Nikhade Jul 22 '18 at 09:36
  • @PramayNikhade I updated my answer. You can parse and send the `__VIEWSTATE` and `__EVENTVALIDATION` everytime to show the server which page you are visiting. Then you can use index 06 to get the next page. – Vic Jul 23 '18 at 03:35