How do I extract a jsonObj out of a javascript with Scrapy

Question

I want to build a dictionary of the jsonObj. Here's what I have so far. I've not yet figured out how to extract the json in order to parse it.

    def parse_store(self, response):
    jsonobj = response.xpath('//script[@window.appData//text').extract()
    stores = json.loads(jsonobj.body_as_unicode())
    print(stores)
    for stores in response:
        stores = {}
        stores['stores'] = response['stores']
        stores['stores']['id'] = response['stores']['id']
        stores['stores']['name'] = response['stores']['name']
        stores['stores']['addr1'] = response['stores']['addr1']
        stores['stores']['city'] = response['stores']['city']
        stores['stores']['state'] = response['stores']['state']
        stores['stores']['country'] = response['stores']['country']
        stores['stores']['zipCode'] = response['stores']['zipCode']
        stores['stores']['phone'] = response['stores']['phone']
        stores['stores']['latitude'] = response['stores']['latitude']
        stores['stores']['longitude'] = response['stores']['longitude']
        stores['stores']['services'] = response['stores']['services']
    print(stores)

    return stores

paul trmbrth · Accepted Answer · 2016-10-06T13:55:05.530

One way to do this is to use js2xml (disclaimer: I wrote js2xml)

So let's assume you have a scrapy Selector with a <script> element with some JavaScript data:

>>> import scrapy
>>> html = '''<script>
... window.appData = {
...     "stores": [
...     {   "id": "952",
...         "name": "BAYTOWN TX",
...         "addr1": "4620 garth rd",
...         "city": "baytown",
...         "state": "TX",
...         "country": "US",
...         "zipCode": "77521",
...         "phone": "281-420-0079",
...         "locationType": "Store",
...         "locationSubType": "Big Box Store",
...         "latitude": "29.77313",
...         "longitude": "-94.97634"
...     }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")

Let's extract that JavaScript bit from it:

>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n    "stores": [\n    {   "id": "952",\n        "name": "BAYTOWN TX",\n        "addr1": "4620 garth rd",\n        "city": "baytown",\n        "state": "TX",\n        "country": "US",\n        "zipCode": "77521",\n        "phone": "281-420-0079",\n        "locationType": "Store",\n        "locationSubType": "Big Box Store",\n        "latitude": "29.77313",\n        "longitude": "-94.97634"\n    }]\n}\n'

Now, import js2xml and call the .parse() function. You get an lxml tree back, representing the JavaScript code (sort of the AST of it):

>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>

If you're curious, here's what the tree looks like:

>>> print(js2xml.pretty_print(jstree))
<program>
  <assign operator="=">
    <left>
      <dotaccessor>
        <object>
          <identifier name="window"/>
        </object>
        <property>
          <identifier name="appData"/>
        </property>
      </dotaccessor>
    </left>
    <right>
      <object>
        <property name="stores">
          <array>
            <object>
              <property name="id">
                <string>952</string>
              </property>
              <property name="name">
                <string>BAYTOWN TX</string>
              </property>
              <property name="addr1">
                <string>4620 garth rd</string>
              </property>
              <property name="city">
                <string>baytown</string>
              </property>
              <property name="state">
                <string>TX</string>
              </property>
              <property name="country">
                <string>US</string>
              </property>
              <property name="zipCode">
                <string>77521</string>
              </property>
              <property name="phone">
                <string>281-420-0079</string>
              </property>
              <property name="locationType">
                <string>Store</string>
              </property>
              <property name="locationSubType">
                <string>Big Box Store</string>
              </property>
              <property name="latitude">
                <string>29.77313</string>
              </property>
              <property name="longitude">
                <string>-94.97634</string>
              </property>
            </object>
          </array>
        </property>
      </object>
    </right>
  </assign>
</program>

Then, you want to get the right part of the assignment of window.appData, a JavaScript object. You can use regular XPath call to select this:

>>> jstree.xpath('''
...     //assign[left//identifier[@name="appData"]]
...         /right
...             /*
...     ''')
[<Element object at 0x7fc7f257f5f0>]
>>>

(i.e. you want the <assign> node, filtering on the <left> part, and get the child of the <right> part, which is an <object>)

js2xml has helpers to convert <object> nodes into Python dicts and lists (we select the first result of the xpath() call with [0]):

>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
             'city': 'baytown',
             'country': 'US',
             'id': '952',
             'latitude': '29.77313',
             'locationSubType': 'Big Box Store',
             'locationType': 'Store',
             'longitude': '-94.97634',
             'name': 'BAYTOWN TX',
             'phone': '281-420-0079',
             'state': 'TX',
             'zipCode': '77521'}]}
>>>

I certainly appreciate the response. I've installed js2xml, and do think it'll help. I'm just not sure how to initially select the JS (window.appData) in order to iterate through it with js2xml. Is there a predicate that I can use to load the js? — rjdel, Oct 06 '16 at 16:49
You could test for content: `//script[contains(., "window.appData")]/text()` — paul trmbrth, Oct 07 '16 at 15:16
I see the power behind js2xml now. Everything returns as you've posted, thanks for that! One more though, I believe [contains... is the point I need to iterate to stores which is the 9th object within 'window.appData'. So, when I use js2xml.make_dict I want to call stores at [8] but it returns empty. Should my statement to test content include 'stores' at some point? — rjdel, Oct 10 '16 at 13:09
Without the input data, it's difficult to say. You may want to post another question for this. — paul trmbrth, Oct 10 '16 at 13:11
I added all the data here: http://stackoverflow.com/questions/39960174/using-js2xml-and-scrapy-how-can-i-iterate-through-a-json-object-to-select-a-spe — rjdel, Oct 10 '16 at 14:07

How do I extract a jsonObj out of a javascript with Scrapy

1 Answers1

Linked