The problem:
I want to extract data from said URL. I wrote fully functional code in selenium. I want to use scrapy for better performance in terms of time. But, I can't go ahead with present code. I get "None" in return. So I am blocked at first stage. No way to go ahead!
What I tried:
This and this and this. Almost all answers point to setting some values manually. But in the said case, it seems impossible! Or may be I am missing something and if I understand that, I can set some values manually, some through xpath and some I can leave blank.
Interestingly, up to the point where I want to extract data right now (data table with certain dates defined in the table form for each units (districts)) doesn't need javascript. I mean the data still gets populated if I click the search button (in browser) even if the javascript is disabled. But then, that can not be replicated in scrapy. So, I also tried going without splash. The exact error mentioned in that question by me is now solved with help of comments and answer. I shared the link here to see my efforts without splash.
My Code:
import scrapy
from scrapy_splash import SplashFormRequest, SplashRequest
class ExampleSpider(scrapy.Spider):
name = 'example'
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(
url='https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
headers={
'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
},
endpoint='execute',
args={
'lua_source': self.script
},
callback=self.parse
)
def parse(self, response):
yield SplashFormRequest.from_response(
response,
formid='form1',
formdata={
'__EVENTTARGET': "ctl00$ContentPlaceHolder1$ddlDistrict",
'__EVENTARGUMENT': "",
'__LASTFOCUS': "",
'__VIEWSTATE': response.xpath('//*[@id="__VIEWSTATE"]/@value').get(),
'__VIEWSTATEGENERATOR': "6F2EA376",
'__PREVIOUSPAGE': response.xpath('//*[@id="__PREVIOUSPAGE"]/@value').get(),
'__EVENTVALIDATION': response.xpath('//*[@id="__EVENTVALIDATION"]/@value').get(),
'ctl00$hdnSessionIdleTime': "",
'ctl00$hdnUserUniqueId': "",
'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "03/07/2020",
'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': "",
'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': "03/07/2020",
'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState': "",
'ctl00$ContentPlaceHolder1$ddlDistrict': "19372",
'ctl00$ContentPlaceHolder1$ddlPoliceStation': "Select",
'ctl00$ContentPlaceHolder1$txtFirno': "",
'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': "0",
'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': ""
},
callback=(self.after_login),
)
def after_login(self, response):
police_stations = response.xpath('//*[@id="ContentPlaceHolder1_ddlPoliceStation"]/@value').get()
print(police_stations)
Terminal:
2020-07-16 22:34:15 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: first)
2020-07-16 22:34:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-42-generic-x86_64-with-glibc2.29
2020-07-16 22:34:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-16 22:34:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'first',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'first.spiders',
'SPIDER_MODULES': ['first.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:79.0) Gecko/20100101 '
'Firefox/79.0'}
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet Password: 4b34176c2fa9d5f5
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-16 22:34:15 [scrapy.core.engine] INFO: Spider opened
2020-07-16 22:34:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-16 22:34:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/execute> (referer: None)
2020-07-16 22:34:19 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/render.html> (referer: None)
None
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-16 22:34:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6420,
'downloader/request_count': 2,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 51224,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 3.544418,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 16, 17, 4, 19, 333815),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 53067776,
'memusage/startup': 53067776,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'splash/execute/request_count': 1,
'splash/execute/response_count/200': 1,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2020, 7, 16, 17, 4, 15, 789397)}
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Spider closed (finished)
Kindly guide.