I followed this post SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY to scrape a site that is almost identical. It works well but the problem is that my site has many items and thus has a lot of pagination. I am able to go to the next pages but only if they are viewable from the page I am on. Pagination is up to 10 pages, which means that the ViewState for page 1 only works for the first set when I go to the next set, say page 14, it is unable to get the data since it still uses ViewState from page 1.
Here is the code: First I get to page 1 then use it to go to the last page to determine the number of pages. Then I loop through each page. In the loop, the response passed is from the last page which only works for the last 10 pages which are visible from the last page in the pagination.
def parse(self, response):
# Fetch the first page from the site
formdata = update_formdata(FORM_DATA, response)
formdata["ctl00$Body$ButtonSubmit"] = "Submit"
# Pass the formdata to mimic what a user does in the browser
yield scrapy.FormRequest(
response.url, formdata=formdata, callback=self.parse_first_page
)
def parse_first_page(self, response):
# get first page actual data
yield from get_data(response)
# Check if there is a last page
last_page = response.css(
"tr.pager table td:last-child a::text"
).get() or response.css(
"tr.pager table td:last-child a font::text"
).get()
if last_page is not None:
last_page = last_page.strip().lower()
if last_page == "last page":
# Load the data for the last page
formdata = update_formdata(FORM_DATA, response)
formdata["__EVENTARGUMENT"] = "Page$Last"
formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
del formdata["ctl00$Body$ButtonSubmit"]
yield scrapy.FormRequest(
response.url,
formdata=formdata,
callback=self.parse_last_page,
)
elif last_page.isdigit():
last_page_num = int(last_page)
yield from self.parse_other_pages(
last_page_num, response
)
else:
self.logger.error("No last Page")
def parse_last_page(self, response):
# Get last page actual data
yield from get_data(response)
# Get the last page number
last_page_num = response.css(
"tr.pager table td:last-child span::text"
).get()
if last_page_num is not None:
counter = int(last_page_num) - 1
yield from self.parse_other_pages(counter, response)
def parse_other_pages(self, page_num, response): # last page response
# get the number of pages and loop through all the pages
while page_num >= 2: # uses last page response needs to change to current page response??
formdata = update_formdata(FORM_DATA, response)
formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
del formdata["ctl00$Body$ButtonSubmit"]
formdata.update(__EVENTARGUMENT="Page$" + str(page_num))
page_num_cpy = page_num
page_num -= 1
yield scrapy.FormRequest(
response.url,
method="POST",
formdata=formdata,
callback=self.parse_results,
dont_filter=True,
headers=HEADERS,
priority=1,
meta={"page_num": page_num_cpy},
)
def parse_results(self, response): #current page response
# Get the actual data for all the other pages
yield from get_data(response)
EDIT how to scrape a page request using Viewstate parameter? This question explains what I am already doing. My issue is not how to fetch the ViewState from the response and pass it to the next request. I can already achieve that. My issue is that I need to update the response within the loop so that it passes the ViewState for the previous page. Right now it's only passing the last page whose view state expires after like 10 pages.
The site I am scraping is https://www.mevzuat.gov.tr/Kanunlar.aspx