3

I followed this post SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY to scrape a site that is almost identical. It works well but the problem is that my site has many items and thus has a lot of pagination. I am able to go to the next pages but only if they are viewable from the page I am on. Pagination is up to 10 pages, which means that the ViewState for page 1 only works for the first set when I go to the next set, say page 14, it is unable to get the data since it still uses ViewState from page 1.

Here is the code: First I get to page 1 then use it to go to the last page to determine the number of pages. Then I loop through each page. In the loop, the response passed is from the last page which only works for the last 10 pages which are visible from the last page in the pagination.

def parse(self, response):
    # Fetch the first page from the site
    formdata = update_formdata(FORM_DATA, response)
    formdata["ctl00$Body$ButtonSubmit"] = "Submit"

    # Pass the formdata to mimic what a user does in the browser
    yield scrapy.FormRequest(
        response.url, formdata=formdata, callback=self.parse_first_page
    )

def parse_first_page(self, response):
    # get first page actual data
    yield from get_data(response)

    # Check if there is a last page
    last_page = response.css(
        "tr.pager table td:last-child a::text"
    ).get() or response.css(
        "tr.pager table td:last-child a font::text"
    ).get()

    if last_page is not None:
        last_page = last_page.strip().lower()

        if last_page == "last page":
            # Load the data for the last page
            formdata = update_formdata(FORM_DATA, response)
            formdata["__EVENTARGUMENT"] = "Page$Last"
            formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
            if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
                del formdata["ctl00$Body$ButtonSubmit"]
            yield scrapy.FormRequest(
                response.url,
                formdata=formdata,
                callback=self.parse_last_page,
            )

        elif last_page.isdigit():
            last_page_num = int(last_page)
            yield from self.parse_other_pages(
                last_page_num, response
            )
        else:
            self.logger.error("No last Page")

def parse_last_page(self, response):
    # Get last page actual data
    yield from get_data(response)

    # Get the last page number
    last_page_num = response.css(
        "tr.pager table td:last-child span::text"
    ).get()
    if last_page_num is not None:
        counter = int(last_page_num) - 1
        yield from self.parse_other_pages(counter, response)

def parse_other_pages(self, page_num, response): # last page response
    # get the number of pages and loop through all the pages

    while page_num >= 2: # uses last page response needs to change to current page response??
        formdata = update_formdata(FORM_DATA, response)
        formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
        if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
                del formdata["ctl00$Body$ButtonSubmit"]

        formdata.update(__EVENTARGUMENT="Page$" + str(page_num))
        page_num_cpy = page_num
        page_num -= 1
        yield scrapy.FormRequest(
            response.url,
            method="POST",
            formdata=formdata,
            callback=self.parse_results,
            dont_filter=True,
            headers=HEADERS,
            priority=1,
            meta={"page_num": page_num_cpy},
        )

def parse_results(self, response): #current page response
    # Get the actual data for all the other pages      
    yield from get_data(response)

EDIT how to scrape a page request using Viewstate parameter? This question explains what I am already doing. My issue is not how to fetch the ViewState from the response and pass it to the next request. I can already achieve that. My issue is that I need to update the response within the loop so that it passes the ViewState for the previous page. Right now it's only passing the last page whose view state expires after like 10 pages.

The site I am scraping is https://www.mevzuat.gov.tr/Kanunlar.aspx

Phillis Peters
  • 2,232
  • 3
  • 19
  • 40
  • 2
    Why don't you use `scrapy.FormRequest.from_response`? It will fill almost all fields for you automatically. – gangabass Nov 18 '19 at 00:38
  • Because ```scrapy.FormRequest.from_response``` fetches the response and its data (which is the last page), so in never goes to the next page which I am assigning inside the while loop. – Phillis Peters Nov 18 '19 at 07:09
  • Does this answer your question? [how to scrape a page request using Viewstate parameter?](https://stackoverflow.com/questions/26034607/how-to-scrape-a-page-request-using-viewstate-parameter) – Gallaecio Nov 18 '19 at 11:17
  • @Gallaecion no it does not answer my question, I have edited the question to explain. – Phillis Peters Nov 18 '19 at 12:04
  • Can you show your target website? – gangabass Nov 21 '19 at 12:33
  • @gangabass, here https://www.mevzuat.gov.tr/Kanunlar.aspx I have added it to the question as well. – Phillis Peters Nov 21 '19 at 15:18

0 Answers0