0

note:

The page I am crawling dosen't use javascript till the point where I am right now. I have also tried using scrapy_splash but got the same error! and I have relied on this course for starting the spider.

The issue:

scrapy spider gives this error:

raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got Selector

What I want:

The string as output which includes "some number of records".

What I tried?

This and this and such other questions. They don't address the questions I am facing.

My Code:

import scrapy
from scrapy import FormRequest


class abcSpider(scrapy.Spider):
    name = 'abc'
    allowed_domains = ['citizen.mahapolice.gov.in']

    def start_requests(self):
        yield scrapy.Request(
            url='http://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
            headers={
                'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
            },
            callback=self.parse
        )

    def parse(self, response):

        yield FormRequest.from_response(
            response,
            formid='form1',
            formdata={
                '__EVENTTARGET': response.xpath("//input[@name='__EVENTTARGET']/@value"),
                '__EVENTARGUMENT': response.xpath("//*[@id='__EVENTARGUMENT']/@value"),
                '__LASTFOCUS': response.xpath("//*[@id='__LASTFOCUS']/@value"),
                '__VIEWSTATE':response.xpath("//*[@id='__VIEWSTATE']/@value"),
                '__VIEWSTATEGENERATOR': "6F2EA376",
                '__PREVIOUSPAGE': response.xpath("//*[@id='__PREVIOUSPAGE']/@value"),
                '__EVENTVALIDATION': response.xpath("//*[@id='__EVENTVALIDATION']/@value"),
                'ctl00$hdnSessionIdleTime': response.xpath("//*[@id='hdnSessionIdleTime']/@value"),
                'ctl00$hdnUserUniqueId': response.xpath("//*[@id='hdnUserUniqueId']/@value"),
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': response.xpath(
                    "//*[@id='ContentPlaceHolder1_meeDateOfRegistrationFrom_ClientState']/@value"),
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "01/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState':
                    response.xpath(
                        "//*[@id='ContentPlaceHolder1_meeDateOfRegistrationTo_ClientState']/@value"),
                'ctl00$ContentPlaceHolder1_txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19409",
                'ctl00$ContentPlaceHolder1$ddlPoliceStation': "",
                'ctl00$ContentPlaceHolder1$txtFirno': "",
                'ctl00$ContentPlaceHolder1$btnSearch': "Search",
                'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': "0",
                'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': ""
            },
            callback=(self.after_login),

        )

    def after_login(self, response):

        police_stations = response.xpath(
            '//*[@id="ContentPlaceHolder1_lbltotalrecord"]/text()').get()
        print(police_stations)

Terminal Output:

2020-07-15 15:11:37 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: xyz)
2020-07-15 15:11:37 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-40-generic-x86_64-with-glibc2.29
2020-07-15 15:11:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-15 15:11:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'xyz',
 'NEWSPIDER_MODULE': 'xyz.spiders',
 'SPIDER_MODULES': ['xyz.spiders']}
2020-07-15 15:11:38 [scrapy.extensions.telnet] INFO: Telnet Password: db3dd9550774d0ab
2020-07-15 15:11:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-15 15:11:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-15 15:11:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-15 15:11:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-15 15:11:39 [scrapy.core.engine] INFO: Spider opened
2020-07-15 15:11:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-15 15:11:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-15 15:11:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx> from <GET http://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx>
2020-07-15 15:11:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx> (referer: https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx)
2020-07-15 15:11:40 [scrapy.core.scraper] ERROR: Spider error processing <GET https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx> (referer: https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx)
Traceback (most recent call last):
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 346, in __next__
    return next(self.data)
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 346, in __next__
    return next(self.data)
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/home/sangharshmanuski/Documents/delet/xyz/xyz/spiders/abc.py", line 20, in parse
    yield FormRequest.from_response(
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 58, in from_response
    return cls(url=url, method=method, formdata=formdata, **kwargs)
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/http/request/form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/sangharshmanuski/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 104, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got Selector
2020-07-15 15:11:40 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-15 15:11:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 648,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8150,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 1.116569,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 15, 9, 41, 40, 607840),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 52281344,
 'memusage/startup': 52281344,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2020, 7, 15, 9, 41, 39, 491271)}
2020-07-15 15:11:40 [scrapy.core.engine] INFO: Spider closed (finished)
sangharsh
  • 148
  • 1
  • 10
  • probably it is problem which I mentioned in previous question - you have to use `.get()` when you get values - `response.xpath(...).get()` in `formdata={...}` . Using only `response.xpath(...)` you gets `Selector` which is mentioned in `TypeError: to_bytes must receive a str or bytes object, got Selector` – furas Jul 15 '20 at 10:18

1 Answers1

1

You have problem which I mentioned in comment to previous question.

You have to use .get() when you get values response.xpath(...).get() in formdata={...}


BTW:

You have still mistake in field name

 'ContentPlaceHolder1_txtDateOfRegistrationTo': "03/07/2020",

it has to be

'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': "03/07/2020",

And you have to use https:// instead of http:// in starting url.

url = 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',

If you use http:// then it redirect to main page

https://citizen.mahapolice.gov.in/Citizen/MH/index.aspx

and later you send form to index.aspx instead of PublishedFIRs.aspx


Minimal working code which you can put in one file and run python script.py without creating problem

It is without previous errors and it sends to correct url but it still have problem with values __VIEWSTATE and __EVENTVALIDATION. If I copy all values from web browser then it works but if I use values from scrapy then page generates error 500. Probably page uses JavaScript to generate these values.

#!/usr/bin/env python3

import scrapy
from scrapy import FormRequest


class abcSpider(scrapy.Spider):
    name = 'abc'
    allowed_domains = ['citizen.mahapolice.gov.in']

    def start_requests(self):
        yield scrapy.Request(
            url='https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
            headers={
                'USER_AGENT': 'Mozilla/5.0',
                'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
            },
            callback=self.parse
        )

    def parse(self, response):

        yield FormRequest.from_response(
            response,
            formid='form1',
            formdata={
                '__EVENTTARGET': response.xpath("//input[@name='__EVENTTARGET']/@value").get(),
                '__EVENTARGUMENT': response.xpath("//*[@id='__EVENTARGUMENT']/@value").get(),
                '__LASTFOCUS': response.xpath("//*[@id='__LASTFOCUS']/@value").get(),
                '__VIEWSTATE':response.xpath("//*[@id='__VIEWSTATE']/@value").get(),
                '__VIEWSTATEGENERATOR': "6F2EA376",
                '__PREVIOUSPAGE': response.xpath("//*[@id='__PREVIOUSPAGE']/@value").get(),
                '__EVENTVALIDATION': response.xpath("//*[@id='__EVENTVALIDATION']/@value").get(),
                'ctl00$hdnSessionIdleTime': response.xpath("//*[@id='hdnSessionIdleTime']/@value").get(),
                'ctl00$hdnUserUniqueId': response.xpath("//*[@id='hdnUserUniqueId']/@value").get(),
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': 
                    response.xpath("//*[@id='ContentPlaceHolder1_meeDateOfRegistrationFrom_ClientState']/@value").get(),
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "01/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState':
                     response.xpath("//*[@id='ContentPlaceHolder1_meeDateOfRegistrationTo_ClientState']/@value").get(),
                #'ContentPlaceHolder1_txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19409",
                'ctl00$ContentPlaceHolder1$ddlPoliceStation': "",
                'ctl00$ContentPlaceHolder1$txtFirno': "",
                'ctl00$ContentPlaceHolder1$btnSearch': "Search",
                'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': "0",
                'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': ""
            },
            callback=(self.after_login),
        )

    def after_login(self, response):

        police_stations = response.xpath(
            '//*[@id="ContentPlaceHolder1_lbltotalrecord"]/text()').get()
        print(police_stations)

# --- run without project and save in `output.csv` ---


from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(abcSpider)
c.start() 

EDIT: code with values which gives me result but I don't know how long values will be correct and if they will works with different dates

#!/usr/bin/env python3

import scrapy
from scrapy import FormRequest


class abcSpider(scrapy.Spider):
    name = 'abc'
    allowed_domains = ['citizen.mahapolice.gov.in']

    def start_requests(self):
        yield scrapy.Request(
            url='https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
            headers={
                'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
            },
            callback=self.parse
        )

    def parse(self, response):

        yield FormRequest.from_response(
            response,
            formid='form1',
            formdata={
                '__EVENTTARGET': '',
                '__EVENTARGUMENT': '',
                '__LASTFOCUS': '',
                '__VIEWSTATE': '/wEPDwUKLTIwNzQyOTkwOA9kFgJmD2QWAgIDD2QWIAIRDw8WAh4EVGV4dAUyPGgxPk1haGFyYXNodHJhIFBvbGljZSAtIFNlcnZpY2VzIGZvciBDaXRpemVuPC9oMT5kZAITDw8WAh8ABT88aDI+Q3JpbWUgYW5kIENyaW1pbmFsIFRyYWNraW5nIE5ldHdvcmsgYW5kIFN5c3RlbXMgKENDVE5TKTxoMj5kZAIVDw8WAh8ABSLigJxFbXBvd2VyaW5nIFBvbGljZSBUaHJvdWdoIElU4oCdZGQCFw8PFgIeCEltYWdlVXJsBRV+L0ltYWdlcy90YWJfSG9tZS5wbmcWBB4Lb25tb3VzZW92ZXIFI3RoaXMuc3JjPScuLi9JbWFnZXMvdGFiX0hvbWVSTy5wbmcnHgpvbm1vdXNlb3V0BSF0aGlzLnNyYz0nLi4vSW1hZ2VzL3RhYl9Ib21lLnBuZydkAhkPDxYCHwEFGH4vSW1hZ2VzL3RhYl9BYm91dFVzLnBuZxYEHwIFJnRoaXMuc3JjPScuLi9JbWFnZXMvdGFiX0Fib3V0VXNSTy5wbmcnHwMFJHRoaXMuc3JjPScuLi9JbWFnZXMvdGFiX0Fib3V0VXMucG5nJ2QCGw8PFgIfAQUffi9JbWFnZXMvdGFiX0NpdGl6ZW5DaGFydGVyLnBuZxYEHwIFLXRoaXMuc3JjPScuLi9JbWFnZXMvdGFiX0NpdGl6ZW5DaGFydGVyUk8ucG5nJx8DBSt0aGlzLnNyYz0nLi4vSW1hZ2VzL3RhYl9DaXRpemVuQ2hhcnRlci5wbmcnZAIdDw8WAh8BBRx+L0ltYWdlcy90YWJfQ2l0aXplbkluZm8ucG5nFgQfAgUqdGhpcy5zcmM9Jy4uL0ltYWdlcy90YWJfQ2l0aXplbkluZm9STy5wbmcnHwMFKHRoaXMuc3JjPScuLi9JbWFnZXMvdGFiX0NpdGl6ZW5JbmZvLnBuZydkAh8PDxYCHwEFKH4vSW1hZ2VzL3RhYl9PbmxpbmVTZXJ2aWNlc19FbmdfYmx1ZS5wbmcWBB8CBTJ0aGlzLnNyYz0nLi4vSW1hZ2VzL3RhYl9PbmxpbmVTZXJ2aWNlc19FbmdfUk8ucG5nJx8DBTR0aGlzLnNyYz0nLi4vSW1hZ2VzL3RhYl9PbmxpbmVTZXJ2aWNlc19FbmdfYmx1ZS5wbmcnZAIhDw8WAh8BBR9+L0ltYWdlcy90YWJfT25saW5lU2VydmljZXMucG5nFgQfAgUtdGhpcy5zcmM9Jy4uL0ltYWdlcy90YWJfT25saW5lU2VydmljZXNSTy5wbmcnHwMFK3RoaXMuc3JjPScuLi9JbWFnZXMvdGFiX09ubGluZVNlcnZpY2VzLnBuZydkAiMPZBYCAgEPZBYIAgEPZBYIAgEPZBYMAgMPDxYEHgdUb29sVGlwBRpFbnRlciBEYXRlIG9mIFJlZ2lzdHJhdGlvbh4JTWF4TGVuZ3RoZmRkAgkPFggeDERpc3BsYXlNb25leQspggFBamF4Q29udHJvbFRvb2xraXQuTWFza2VkRWRpdFNob3dTeW1ib2wsIEFqYXhDb250cm9sVG9vbGtpdCwgVmVyc2lvbj00LjEuNDA0MTIuMCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj0yOGYwMWIwZTg0YjZkNTNlAB4OQWNjZXB0TmVnYXRpdmULKwQAHg5JbnB1dERpcmVjdGlvbgsphgFBamF4Q29udHJvbFRvb2xraXQuTWFza2VkRWRpdElucHV0RGlyZWN0aW9uLCBBamF4Q29udHJvbFRvb2xraXQsIFZlcnNpb249NC4xLjQwNDEyLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49MjhmMDFiMGU4NGI2ZDUzZQAeCkFjY2VwdEFtUG1oZAITDw8WBB8EBRpFbnRlciBEYXRlIG9mIFJlZ2lzdHJhdGlvbh8FZmRkAhkPFggfBgsrBAAfBwsrBAAfCAsrBQAfCWhkAiEPEA8WBh4ORGF0YVZhbHVlRmllbGQFC0RJU1RSSUNUX0NEHg1EYXRhVGV4dEZpZWxkBQhESVNUUklDVB4LXyFEYXRhQm91bmRnZBAVMQZTZWxlY3QKQUhNRUROQUdBUgVBS09MQQ1BTVJBVkFUSSBDSVRZDkFNUkFWQVRJIFJVUkFMD0FVUkFOR0FCQUQgQ0lUWRBBVVJBTkdBQkFEIFJVUkFMBEJFRUQIQkhBTkRBUkESQlJJSEFOIE1VTUJBSSBDSVRZCEJVTERIQU5BCkNIQU5EUkFQVVIFREhVTEUKR0FEQ0hJUk9MSQZHT05ESUEHSElOR09MSQdKQUxHQU9OBUpBTE5BCEtPTEhBUFVSBUxBVFVSC05BR1BVUiBDSVRZDE5BR1BVUiBSVVJBTAZOQU5ERUQJTkFORFVSQkFSC05BU0hJSyBDSVRZDE5BU0hJSyBSVVJBTAtOQVZJIE1VTUJBSQlPU01BTkFCQUQHUEFMR0hBUghQQVJCSEFOSRBQSU1QUkktQ0hJTkNIV0FECVBVTkUgQ0lUWQpQVU5FIFJVUkFMBlJBSUdBRBJSQUlMV0FZIEFVUkFOR0FCQUQOUkFJTFdBWSBNVU1CQUkOUkFJTFdBWSBOQUdQVVIMUkFJTFdBWSBQVU5FCVJBVE5BR0lSSQZTQU5HTEkGU0FUQVJBClNJTkRIVURVUkcMU09MQVBVUiBDSVRZDVNPTEFQVVIgUlVSQUwKVEhBTkUgQ0lUWQtUSEFORSBSVVJBTAZXQVJESEEGV0FTSElNCFlBVkFUTUFMFTEGU2VsZWN0BTE5MzcyBTE5MzczBTE5ODQyBTE5Mzc0BTE5NDA5BTE5Mzc1BTE5Mzc3BTE5Mzc2BTE5Mzc4BTE5Mzc5BTE5MzgxBTE5MzgyBTE5NDAzBTE5ODQ1BTE5ODQ2BTE5Mzg0BTE5MzgwBTE5Mzg2BTE5NDA1BTE5Mzg3BTE5Mzg4BTE5Mzg5BTE5ODQ0BTE5NDA4BTE5MzkwBTE5ODQxBTE5MzkxBTE5MzcxBTE5MzkyBTE5ODQ3BTE5MzkzBTE5Mzk0BTE5Mzg1BTE5ODQ4BTE5NDA0BTE5NDAyBTE5MzgzBTE5Mzk1BTE5Mzk2BTE5Mzk3BTE5NDA2BTE5NDEwBTE5Mzk4BTE5Mzk5BTE5NDA3BTE5NDAwBTE5ODQzBTE5NDAxFCsDMWdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cWAWZkAicPEGQQFQEGU2VsZWN0FQEGU2VsZWN0FCsDAWdkZAIDDw8WAh8ABQZTZWFyY2hkZAIFDw8WAh8ABQVDbGVhcmRkAgcPDxYCHwAFBUNsb3NlZGQCAw9kFgJmD2QWAgIDDxBkDxYBZhYBBQtWaWV3IFJlY29yZBYBZmQCCQ88KwARAgEQFgAWABYADBQrAABkAgsPDxYCHgdWaXNpYmxlZ2QWAgIBD2QWAgIFDw8WAh8ABQJHb2RkAiUPDxYCHwAFB1NpdGVNYXBkZAInDw8WAh8ABRRQb2xpY2UgVW5pdHMgV2Vic2l0ZWRkAikPDxYCHwAFC0Rpc2NsYWltZXJzZGQCKw8PFgIfAAUDRkFRZGQCLQ8PFgIfAAUKQ29udGFjdCBVc2RkAi8PDxYCHwAFBzgzNDkxNDhkZBgCBR5fX0NvbnRyb2xzUmVxdWlyZVBvc3RCYWNrS2V5X18WCAUNY3RsMDAkbG1nSG9tZQUMY3RsMDAkbG1nQWJ0BQ1jdGwwMCRsbWdDaHJ0BQ1jdGwwMCRsbWdJbmZvBQ5jdGwwMCRsbWdEd25sZAULY3RsMDAkbG1nT1MFM2N0bDAwJENvbnRlbnRQbGFjZUhvbGRlcjEkaW1nRGF0ZU9mUmVnaXN0cmF0aW9uRnJvbQUxY3RsMDAkQ29udGVudFBsYWNlSG9sZGVyMSRpbWdEYXRlT2ZSZWdpc3RyYXRpb25UbwUlY3RsMDAkQ29udGVudFBsYWNlSG9sZGVyMSRnZHZEZWFkQm9keQ9nZJLUBB4bd3CH8EeW9a0lIRLz9afH',
                '__VIEWSTATEGENERATOR': '6F2EA376',
                '__PREVIOUSPAGE': '6Fkypj_FbKCMscMOIEbFwiAIl-t4XMDVxhwkenT13SdXVANmcLkeKVNreNUcxzCFPd2Pxt-oh_2N7OVcM2YpQJ9h0re0OFqkn5XLvLpF1J-DFQ0h0',
                '__EVENTVALIDATION': '/wEdAFbuJNLDGfYOJbFWhYC0CtoGCssMeMRH46lUxWxNoH/QjR5JLHBufgCBaXKcLsIHFZg2MfFCqAQ55R5q232FZgK2qoCdmcL8o03Ga7p3SNpVviXoWLdz7AIdB4qHlFb/Ei9/1ch/aUhwAcGED/suJluf7ISsvoU9AiyuaEemMV5BBJnd8M9l/EB8CbzCs/Qj58HeW1DBXpopxThMkmM3IaEA4f83zm8GjIMpdMbZJo0bg/ou0osxK9vw1/I5QAXjT4WAelg7J4xZgxz60IVmQFQVBwFQHg/XFH9pTR8T2Gs+V8qukw1XTUYPesJgPqkOxZQh262jaQ7BxUOV7QoxeNck2w47G8rm/lqu6eH38UvMjATEI1G+tctApp1T0wcXwuNCLn3Z0VPV65eVNYp7hMU8lDrezCJH7PKOMYlCjf6maxW322Wg8dLjJ0oAXaSslqZHs1bB/7i2oDFBz4DJ85TGKEfqFutX9Sc8iba6A2UA3Jbp98jppZoyKABVAKm4ScwkZSsqCnmWlHZE1g1cl5KTdz2wIx74ktDJhIyxSHIwnuUrnMnZVi7M1yfB08jysAZLiaKqyALYmaPTP4iB7/cEzRldPEjwCvWpP992wRUTSVioExXj+mq+aV3ovp1s3PYdGAfIul3shD4atfGh7x1DmI0SjJjBG5MN09bwTja6X4d1tYyTUWpH5kv7kquz0k9MSPwDuX8kZiAr7Go4LvLA1v8x//T/i3cmHZhqsqcHSaOvUIY7oYzJpYB6269Eg/Eet2MzUbATNVMVJ6z0ps0G3+QTnao16M2kNs5Amrrfs7FS6VV+1VO0VlVoBMI3MVL2a5ZuFE0VYjpXs1Ie80zilwTI+Q87lRt0RiHvm7no9Ryh+i/NQ0SvqV6XUmpTvyESyCOHyB4V0JKFy3ngQefeiU1Bhw0YKqnM8XuA/3OuCrtvoVW4iPEPfzfW0U992cke6MjKSj+bFRXox1RixVsclKaaKq2cbpV4jLUztc0v0FIrdBwoILAS39YNKuaLebG44MFbUIrfl2XY7SNDPfSk0ikZQF6oN5BHioH7XmNMzwk0vSNeQ5gKYDKx4xnue5CFyfuTFLCx43hksDtRhvJXkn2iJAzo5kx+7Oa9LqM3/7ZUva29woTjHIRcXIx5V+VUFaPbjpSXixVRCTuVCcHTNPAsoz+6EiXvsfi8lrX0f3D7YCO4ridVhQClK705rktyQAmmeO0iV/Vh5DSf8FhvD58uSORbTqGUZryylC9SPojWj+h3++zOroq6bTLe/itZW7f6vF0eyAgMysofFozRdBhZo5tdiQR7X5+feZXm8Mh9dkmrkjndCY6MJW+Z6GMDEkD2DRN460MRst3Ymkivnm8me7KLtZghplypPrBnBqKdsArB4XzeK7XbSYhMVY6qipQKdH6cU6XeeZcmTS57SquMwbHZEhKbKL6YxYuvZmZZF8nlCQZL4zlr//3g5nyOTFulzGhY80/Z1HbCJJ6LxQbS0yD9Thl7sm6WVjYxB23A6c0dbgG4R+nkAQKMqcH6ZVn78Nu9BSKKrOVmNjQwbSsS5vUv6MFDROG0CrK/eNGU0C14yGuWM5HkGE/DyCzIKRYuDsUr1CVxXS+jyCWdB8LwDiFJV7yxNt0d/PtikuBIdrCGbTGK/JJV58CVginDn/qsq9scauaAbl2FvBQQCQMuNszcsKvvFie32VnIgxjp9PYR0Y1JxT7s4XE1eEASLLIarsRVQGJxRqon8iLGYHzEO3PB1DG6typAyQ+VpxaMiZBUOTlCcsXDdY08Kwd7PKgvFhd/UCrh6PvV7qCAPcsiiHYjV/MyKFDCcqDP506hiuHs8/lYYzvu5lRwgpFGVVnV',
                'ctl00$hdnSessionIdleTime': '',
                'ctl00$hdnUserUniqueId': '',
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': '01/07/2020',
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': '',
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': '03/07/2020',
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState': '',
                'ctl00$ContentPlaceHolder1$ddlDistrict': '19372',
                'ctl00$ContentPlaceHolder1$ddlPoliceStation': 'Select',
                'ctl00$ContentPlaceHolder1$txtFirno': '',
                'ctl00$ContentPlaceHolder1$btnSearch': 'Search',
                'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': '0',
                'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': '',
            },
            callback=(self.after_login),
        )

    def after_login(self, response):

        police_stations = response.xpath(
            '//*[@id="ContentPlaceHolder1_lbltotalrecord"]/text()').get()
        print(police_stations)

# --- run without project and save in `output.csv` ---


from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(abcSpider)
c.start() 
furas
  • 134,197
  • 12
  • 106
  • 148
  • If I use __viewstate and other values from browser instance and use splash with scrapy I get `None` as a result – sangharsh Jul 15 '20 at 12:01
  • tried all the ways suggested by you! Million thanks. But no answer as on yet!! – sangharsh Jul 16 '20 at 04:19
  • I answered for main question `TypeError: ..., got Selector` :) but I still have problem with `_viewstate` and I don't know what is the problem - maybe server has more complex security system and it recognize bot and sends wrong `_viewstate` or it expect that JavaScript will recalculate it. If you get `None `then check if send to correct url - as I menthoned in answer if you use `http` at start the it redirects to different page and then code sends form to wrong URL. So check if you sends form to correct URL - `response.url` – furas Jul 16 '20 at 06:59
  • :D. I checked http and changed as per your suggestion. No luck. Its the same, "None". – sangharsh Jul 16 '20 at 08:37
  • first you should check `response.url` to confirn if you really send it to correct URL - and later you can try to find problem and resolve it. – furas Jul 16 '20 at 08:45
  • Yes the website is probably using some complex method. I wonder why? It is government data freely available for all. But, yet they have so much complexity!! – sangharsh Jul 16 '20 at 08:47
  • pages are created fo people and authors want to give full access for people. They can block bots/scripts because bots use connections, use server's resorces, server's power and they have to pay for this. And some bots may try to inject code to still passwords and other secrect data. – furas Jul 16 '20 at 09:56
  • okay. thanks. That helps me understand the situation – sangharsh Jul 16 '20 at 11:04
  • If I use splash, I can evade error 500 but then, result is none! HTTP is same as in post – sangharsh Jul 16 '20 at 15:02