31

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request.

With the Postman, it works and I get desired response.

enter image description here

but a post request in terminal gives me a server error

For ex: this page https://play.google.com/store/apps/details?id=com.supercell.boombeach

curl -H "Content-Type: application/json" -X POST -d '{"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}' https://play.google.com/store/getreviews

gives a server error and

Scrapy just ignores this line:

frmdata = {"id": "com.supercell.boombeach", "reviewType": 0, "reviewSortOrder": 0, "pageNum":0}
        url = "https://play.google.com/store/getreviews"
        yield Request(url, callback=self.parse, method="POST", body=urllib.urlencode(frmdata))
Amit Tripathi
  • 7,003
  • 6
  • 32
  • 58

3 Answers3

48

The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request.

From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json:

my_data = {'field1': 'value1', 'field2': 'value2'}
request = scrapy.Request( url, method='POST', 
                          body=json.dumps(my_data), 
                          headers={'Content-Type':'application/json'} )
aitorhh
  • 2,331
  • 1
  • 23
  • 35
  • How can I get request body result? I use ```request.body``` it return me form data... – Yuda Prawira Jun 06 '17 at 00:07
  • 1
    If you want the result of the request, you have to get it from the response. The scrapy.Request can have a 'callback' argument which will be called if the request is yieled ('yield request') and the response is received. To read the data in the callback function (for example: 'def parse_entry(self, response)') just do response.body. I used 'jsonresponse = json.loads(response.body_as_unicode())' because I get a json – aitorhh Jun 07 '17 at 17:18
33

Make sure that each element in your formdata is of type string/unicode

frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
url = "https://play.google.com/store/getreviews"
yield FormRequest(url, callback=self.parse, formdata=frmdata)

I think this will do

In [1]: from scrapy.http import FormRequest

In [2]: frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}

In [3]: url = "https://play.google.com/store/getreviews"

In [4]: r = FormRequest(url, formdata=frmdata)

In [5]: fetch(r)
 2015-05-20 14:40:09+0530 [default] DEBUG: Crawled (200) <POST      https://play.google.com/store/getreviews> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f3ea4258890>
[s]   item       {}
[s]   r          <POST https://play.google.com/store/getreviews>
[s]   request    <POST https://play.google.com/store/getreviews>
[s]   response   <200 https://play.google.com/store/getreviews>
[s]   settings   <scrapy.settings.Settings object at 0x7f3eaa205450>
[s]   spider     <Spider 'default' at 0x7f3ea3449cd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Jithin
  • 1,692
  • 17
  • 25
  • Thanks. I am still not able to see the response data. How to get it? – Amit Tripathi May 20 '15 at 15:48
  • 2
    response.body will give you the complete html element. If you want to take some specific entries you can use `response.xpath(YOUR_XPATH_HERE)`. – Jithin May 21 '15 at 04:09
  • Thiis is what I am getting on r.body 'pageNum=0&id=com.supercell.boombeach&reviewType=0&reviewSortOrder=0' – Amit Tripathi May 21 '15 at 12:22
  • Are you getting the html data with r.body? – Amit Tripathi May 21 '15 at 12:23
  • 1
    I have performed a `fetch(r)` operation after that and after that try `response.body`, you will definitely get the results. In your code instead of fetch you can directly use `yield FormRequest(url=url, formdata=frmdata, callback=your_callback_func)` will do. I have tested in the scrapy shell, I cannot use a callback function there to test that. – Jithin May 21 '15 at 12:25
  • i was a bit busy so only tried response.body and it worked but now when I am trying `yield FormRequest(url=url, formdata=frmdata, callback=your_callback_func)` "your_callback_func" is not getting called. it seems like scrapy simply ignores it. – Amit Tripathi Jun 01 '15 at 20:41
  • may be something wrong with your code, that should work, without the complete code I cant say anything specific. – Jithin Jun 03 '15 at 08:49
  • I just asked a question with some snippet http://stackoverflow.com/questions/30614560/crawling-dynamic-content-with-scrapy. Please see this. – Amit Tripathi Jun 03 '15 at 08:51
  • are you sure that whether the program control reaches at the function from where you are calling this `form-request`? – Jithin Jun 03 '15 at 09:05
0

Sample Page Traversing using Post in Scrapy:

def directory_page(self,response):
    if response:
        profiles = response.xpath("//div[@class='heading-h']/h3/a/@href").extract()
        for profile in profiles:
            yield Request(urljoin(response.url,profile),callback=self.profile_collector)

        page = response.meta['page'] + 1
        if page :
            yield FormRequest('https://rotmanconnect.com/AlumniDirectory/getmorerecentjoineduser',
                                        formdata={'isSortByName':'false','pageNumber':str(page)},
                                        callback= self.directory_page,
                                        meta={'page':page})
    else:
         print "No more page available"
Manoj Sahu
  • 2,774
  • 20
  • 18