55

How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:

{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
    },
    "phoneNumber": [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "fax",
            "number": "646 555-4567"
        }
    ]
}

I would be looking to scrape specific items (e.g. name and fax in the above) and save to csv.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Thomas Kingaroy
  • 575
  • 1
  • 5
  • 7

3 Answers3

89

It's the same as using Scrapy's HtmlXPathSelector for html responses. The only difference is that you should use json module to parse the response:

class MySpider(BaseSpider):
    ...


    def parse(self, response):
         jsonresponse = json.loads(response.text)

         item = MyItem()
         item["firstName"] = jsonresponse["firstName"]             

         return item
starball
  • 20,030
  • 7
  • 43
  • 238
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 9
    You may want to use `json.loads(response.body_as_unicode())` as loads requires a `str` or `unicode` object, not a scrapy Response. – Shane Evans Aug 12 '13 at 10:32
  • 1
    folks, so now you have parsed a json response. how would i follow each link that is potentially in the json? – Cmag Sep 10 '14 at 16:19
  • 3
    @Cmag you would need to `return` or `yield` a `Request`, see more info [here](http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions). – alecxe Sep 10 '14 at 16:20
  • Using an ujson will be more effective – sakost Sep 25 '18 at 08:47
  • 4
    response.text is preferred over body_as_unicode(), see https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.body_as_unicode – exic Dec 10 '18 at 13:56
  • MyItem() is not defined? – sampan0423 May 19 '21 at 08:06
20

Don't need to use json module to parse the reponse object.

class MySpider(BaseSpider):
...


def parse(self, response):
     jsonresponse = response.json()

     item = MyItem()
     item["firstName"] = jsonresponse.get("firstName", "")           

     return item
HARVYS 789
  • 201
  • 2
  • 2
0

The possible reason JSON is not loading is that it has single-quotes before and after. Try this:

json.loads(response.body_as_unicode().replace("'", '"'))
dKen
  • 3,078
  • 1
  • 28
  • 37
Manoj Sahu
  • 2,774
  • 20
  • 18
  • Also can be a strange json, e.g. with unnecessary brackets, then use: `json.loads(response.body_as_unicode().strip('()'))`. – bl79 Jul 13 '18 at 21:39