0

I just want to know on how can I make rule when the website sends me a json response instead of html? On the start url first response, it gives me an html response, but when I navigated through pages, it gives me json response. Here my rule:

 Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="GridTimeline-items"]'), tags=('div'),
                                   attrs=('data-min-position'), allow=(r''), process_value=my_process_value_friends),
                                    callback='parse_friends', follow=True),

My question is, how can I apply xpath on json response?

Thank you ,

Rocky
  • 137
  • 4
  • 12
  • You should use `scrapy.linkextractors.Linkextractor` since `SgmlLinkExtractor` has been deprecated for a while now. Those two are essentially the same thing though. – Granitosaurus Sep 06 '16 at 06:27

1 Answers1

0

You can't parse json with xpath or css selectors. You can however turn the json to python dictionary:

import json
def parse(self, response):
    data = json.loads(response.body)
    # then just parse it, e.g.
    item = dict()
    item['name'] = data['name']
    # ...

Or you can conver json to xml and then parse it with scrapy selectors. There a lot of packages that do that but I'll highlight dicttoxml in my example:

import json
from dicttoxml import dicttoxml
from scrapy import Selector
def parse(self, response):
    data = json.loads(response.body)
    data_xml = dicttoxml(data)
    sel = Selector(root=data_xml)
    # then parse it
    item = dict()
    item['name'] = sel.xpath("//name/text()")
    # ...
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • Thank you, but I am looking for solution for rules, not on the parsing stage – Rocky Sep 06 '16 at 13:39
  • @Reymark You can't use `restrict_xpath` on json source without extending how CrawlSpider works. Easy way of doing it though, would be to do it manually as I described in my answer. Just have `parse` callback in your LinkExtractor and check whether the page is json at the beginning, if so find json urls, otherwise continue as normally. – Granitosaurus Sep 07 '16 at 08:24