0

Can we use Scrapy for getting content from a web page which is loaded by Javascript?

I'm trying to scrape usage examples from this page, but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.

Could you suggest what is the best way to deal with such issues?

bouteillebleu
  • 2,456
  • 23
  • 32
Ajay Singh
  • 464
  • 6
  • 10

1 Answers1

4

Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:

examples.json

After removing the JSONP paramter, the URL is pretty straightforward:

https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0

By making the minimal number of requests, your spider will be fast.

If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Blender
  • 289,723
  • 53
  • 439
  • 496
  • I got that url but it returns a text file which again I'm able to scrap using scrappy. I want to extract the sentences from the file. How can I do that? PS-Thanks for your answer :) – Ajay Singh Nov 22 '16 at 05:53
  • 1
    It's JSON. Parse it with `json.loads`: http://stackoverflow.com/questions/18171835/scraping-a-json-response-with-scrapy – Blender Nov 22 '16 at 05:55
  • I already tried that, it returns error ValueError: No JSON object could be decoded json.loads(response.body_as_unicode()) – Ajay Singh Nov 22 '16 at 05:56
  • I don't know how! but it worked after I restarted the scrapy shell. Thanks for your help! :) – Ajay Singh Nov 22 '16 at 06:05
  • can you explain why you removed JSONP parameter from URL? – Ajay Singh Nov 22 '16 at 21:16
  • Because JSONP isn't JSON and wouldn't be parsed by a JSON parser. Leave it in and see what happens. – Blender Nov 22 '16 at 21:20
  • i did that first and it was not getting parsed then i removed it and then i was able to parse the response with json object – Ajay Singh Nov 22 '16 at 21:22