0

Very new to coding and especially new to scrapy.

I have simple scraper writen with "scrapy". I have written it in hope to produce a list of options (from 'x' element). But it dosen't provide anything. I tried this but failed to get any way ahead in solving issue I am facing.

I am pasting the full code followed by full output. (note: in setting.py I have changed robottext=True to robottext to False)

from scrapy import FormRequest
import scrapy

class FillFormSpider(scrapy.Spider):
    name = 'fill_form'
    allowed_domains = ['citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx']
    start_urls = ['http://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx/']

    def parse(self, response):
        yield FormRequest.from_response(
            response,
            formid='form1',
            formdata={
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "01/07/2020",
                'ContentPlaceHolder1_txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19372"},
            callback=(self.after_login)
        )

    def after_login(self, response):
        print(response.xpath('//*[@id="ContentPlaceHolder1_ddlPoliceStation"]/text()'))

And the out put in terminal after running the command scrapy crawl spider_name:

2020-07-12 21:32:21 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: maha_police)
2020-07-12 21:32:21 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-40-generic-x86_64-with-glibc2.29
2020-07-12 21:32:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-12 21:32:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'maha_police',
 'NEWSPIDER_MODULE': 'maha_police.spiders',
 'SPIDER_MODULES': ['maha_police.spiders']}
2020-07-12 21:32:21 [scrapy.extensions.telnet] INFO: Telnet Password: 02d86496f7d27542
2020-07-12 21:32:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-12 21:32:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-12 21:32:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-12 21:32:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-12 21:32:22 [scrapy.core.engine] INFO: Spider opened
2020-07-12 21:32:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-12 21:32:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-12 21:32:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx> from <GET http://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx/>
2020-07-12 21:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx> (referer: None)
2020-07-12 21:32:26 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'citizen.mahapolice.gov.in': <POST https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx>
2020-07-12 21:32:26 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-12 21:32:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 501,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8149,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 3.542722,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 12, 16, 2, 26, 43681),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 54427648,
 'memusage/startup': 54427648,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 7, 12, 16, 2, 22, 500959)}

I wanted to get list of police stations. So I tried to extract text from related element. Where am I going wrong? Also it I want to click on search, how should I proceed? Thanks for kind help.

sangharsh
  • 148
  • 1
  • 10

1 Answers1

1

As you can see in this line:

2020-07-12 21:32:26 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'citizen.mahapolice.gov.in': <POST https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx>

Your request is being filtered. In your case, since it's filtered because it's considered an offsite by the OffsiteMiddleware. I suggest you to change your allowed_domains from:

allowed_domains = ['citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx']

to:

allowed_domains = ['citizen.mahapolice.gov.in']

This change alone already worked for me, so it's a preferable solution.

If you check the docs you will see that this field is supposed to contain the domains that this spider is allowed to crawl.

What is happening to your spider is that the OffsiteMiddleware compiles a regex with the domains in allowed_domains field and while processing the requests it checks the method should_follow (code) to know if the request should be executed. The regex would return None, and it would cause it to be filtered.

You could also use the dont_filter parameter when building your request, this would also work, however the solution above is better. It would look like this:

        yield FormRequest.from_response(
            response,
            formid='form1',
            formdata={
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "01/07/2020",
                'ContentPlaceHolder1_txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19372"},
            callback=self.after_login,
            dont_filter=True, # <<< Here 
        )
renatodvc
  • 2,526
  • 2
  • 6
  • 17
  • Thanks. But I failed to understand this- you have asked me to change allowed domains. But the one which you have suggested is already there. Am I missing something? – sangharsh Jul 13 '20 at 00:26
  • @sangharsh I suggested you to use **only** the domain in your `allowed_domains` field, instead of using the full url. Will add more detail to the answer. – renatodvc Jul 13 '20 at 00:46
  • Got the point! Thanks. Also if I want to populate list of police stations from the response, how should I go? Can you suggest anything?? – sangharsh Jul 13 '20 at 00:51
  • 1
    I'm not sure I understand your question, I suggest you to open a new question and add more details what you want to do and what you tried and didn't work. As I tested your spider I didn't got the list response, the most common issue is that the `from_response` method is including fields in the body that shouldn't be there. You can check this by printing `response.request.body` and compare it with your browser's request. Sometimes mimicking the request's headers is also needed. – renatodvc Jul 13 '20 at 01:01
  • I have started another question - https://stackoverflow.com/q/62912458/3604513 – sangharsh Jul 15 '20 at 12:16