One way to do this is to use js2xml (disclaimer: I wrote js2xml)
So let's assume you have a scrapy Selector with a <script>
element with some JavaScript data:
>>> import scrapy
>>> html = '''<script>
... window.appData = {
... "stores": [
... { "id": "952",
... "name": "BAYTOWN TX",
... "addr1": "4620 garth rd",
... "city": "baytown",
... "state": "TX",
... "country": "US",
... "zipCode": "77521",
... "phone": "281-420-0079",
... "locationType": "Store",
... "locationSubType": "Big Box Store",
... "latitude": "29.77313",
... "longitude": "-94.97634"
... }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")
Let's extract that JavaScript bit from it:
>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n "stores": [\n { "id": "952",\n "name": "BAYTOWN TX",\n "addr1": "4620 garth rd",\n "city": "baytown",\n "state": "TX",\n "country": "US",\n "zipCode": "77521",\n "phone": "281-420-0079",\n "locationType": "Store",\n "locationSubType": "Big Box Store",\n "latitude": "29.77313",\n "longitude": "-94.97634"\n }]\n}\n'
Now, import js2xml and call the .parse()
function. You get an lxml tree back, representing the JavaScript code (sort of the AST of it):
>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>
If you're curious, here's what the tree looks like:
>>> print(js2xml.pretty_print(jstree))
<program>
<assign operator="=">
<left>
<dotaccessor>
<object>
<identifier name="window"/>
</object>
<property>
<identifier name="appData"/>
</property>
</dotaccessor>
</left>
<right>
<object>
<property name="stores">
<array>
<object>
<property name="id">
<string>952</string>
</property>
<property name="name">
<string>BAYTOWN TX</string>
</property>
<property name="addr1">
<string>4620 garth rd</string>
</property>
<property name="city">
<string>baytown</string>
</property>
<property name="state">
<string>TX</string>
</property>
<property name="country">
<string>US</string>
</property>
<property name="zipCode">
<string>77521</string>
</property>
<property name="phone">
<string>281-420-0079</string>
</property>
<property name="locationType">
<string>Store</string>
</property>
<property name="locationSubType">
<string>Big Box Store</string>
</property>
<property name="latitude">
<string>29.77313</string>
</property>
<property name="longitude">
<string>-94.97634</string>
</property>
</object>
</array>
</property>
</object>
</right>
</assign>
</program>
Then, you want to get the right part of the assignment of window.appData
, a JavaScript object.
You can use regular XPath call to select this:
>>> jstree.xpath('''
... //assign[left//identifier[@name="appData"]]
... /right
... /*
... ''')
[<Element object at 0x7fc7f257f5f0>]
>>>
(i.e. you want the <assign>
node, filtering on the <left>
part, and get the child of the <right>
part, which is an <object>
)
js2xml has helpers to convert <object>
nodes into Python dicts and lists (we select the first result of the xpath() call with [0]
):
>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
'city': 'baytown',
'country': 'US',
'id': '952',
'latitude': '29.77313',
'locationSubType': 'Big Box Store',
'locationType': 'Store',
'longitude': '-94.97634',
'name': 'BAYTOWN TX',
'phone': '281-420-0079',
'state': 'TX',
'zipCode': '77521'}]}
>>>