0

The problem:

I want to extract data from said URL. I wrote fully functional code in selenium. I want to use scrapy for better performance in terms of time. But, I can't go ahead with present code. I get "None" in return. So I am blocked at first stage. No way to go ahead!

What I tried:

This and this and this. Almost all answers point to setting some values manually. But in the said case, it seems impossible! Or may be I am missing something and if I understand that, I can set some values manually, some through xpath and some I can leave blank.

Interestingly, up to the point where I want to extract data right now (data table with certain dates defined in the table form for each units (districts)) doesn't need javascript. I mean the data still gets populated if I click the search button (in browser) even if the javascript is disabled. But then, that can not be replicated in scrapy. So, I also tried going without splash. The exact error mentioned in that question by me is now solved with help of comments and answer. I shared the link here to see my efforts without splash.

My Code:

import scrapy

from scrapy_splash import SplashFormRequest, SplashRequest


class ExampleSpider(scrapy.Spider):
    name = 'example'

    script = '''
        function main(splash, args)
          assert(splash:go(args.url))
          assert(splash:wait(0.5))
          return splash:html()
        end
    '''

    def start_requests(self):
        yield SplashRequest(
            url='https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
            headers={
                'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
            },
            endpoint='execute',
            args={
                'lua_source': self.script
            },
            callback=self.parse
        )

    def parse(self, response):
        yield SplashFormRequest.from_response(

            response,
            formid='form1',
            formdata={
                '__EVENTTARGET': "ctl00$ContentPlaceHolder1$ddlDistrict",
                '__EVENTARGUMENT': "",
                '__LASTFOCUS': "",
                '__VIEWSTATE': response.xpath('//*[@id="__VIEWSTATE"]/@value').get(),
                '__VIEWSTATEGENERATOR': "6F2EA376",
                '__PREVIOUSPAGE': response.xpath('//*[@id="__PREVIOUSPAGE"]/@value').get(),
                '__EVENTVALIDATION': response.xpath('//*[@id="__EVENTVALIDATION"]/@value').get(),
                'ctl00$hdnSessionIdleTime': "",
                'ctl00$hdnUserUniqueId': "",
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "03/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': "",
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState': "",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19372",
                'ctl00$ContentPlaceHolder1$ddlPoliceStation': "Select",
                'ctl00$ContentPlaceHolder1$txtFirno': "",
                'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': "0",
                'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': ""
            },
            callback=(self.after_login),
        )

      def after_login(self, response):
           police_stations = response.xpath('//*[@id="ContentPlaceHolder1_ddlPoliceStation"]/@value').get()
           print(police_stations)

Terminal:

2020-07-16 22:34:15 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: first)
2020-07-16 22:34:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-42-generic-x86_64-with-glibc2.29
2020-07-16 22:34:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-16 22:34:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'first',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'first.spiders',
 'SPIDER_MODULES': ['first.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:79.0) Gecko/20100101 '
               'Firefox/79.0'}
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet Password: 4b34176c2fa9d5f5
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-16 22:34:15 [scrapy.core.engine] INFO: Spider opened
2020-07-16 22:34:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-16 22:34:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/execute> (referer: None)
2020-07-16 22:34:19 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/render.html> (referer: None)
None
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-16 22:34:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6420,
 'downloader/request_count': 2,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 51224,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 3.544418,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 16, 17, 4, 19, 333815),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 53067776,
 'memusage/startup': 53067776,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2020, 7, 16, 17, 4, 15, 789397)}
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Spider closed (finished)

Kindly guide.

sangharsh
  • 148
  • 1
  • 10
  • Are you getting any response back from the callback after_login ? – AaronS Jul 16 '20 at 20:22
  • For now, as terminal output indicates above, it's "None". – sangharsh Jul 16 '20 at 20:37
  • In case of response values (form data) , iff I submit the form, I identify measure difference only in __EVENTTARGET field which becomes empty. And again, same output I. Terminal – sangharsh Jul 16 '20 at 20:43
  • What I meant was, if you put def after_login(self,response): print(response.text), do you get any html at all ? We're assuming here that you are getting the correct response to scrape the data you're after when you log in. I'd also check your formdata XPATH selectors, make sure they're correct. That would be a simple fix in that case. – AaronS Jul 16 '20 at 20:47
  • ...And response.text says: Invalid postback or callback argument – sangharsh Jul 16 '20 at 21:08

0 Answers0