Scrapy: load 'ready-ed' DOM instead of Source

Question

Page Source

<html>
<title>Example Web</title>
<script>

$(document).ready(function(){
    document.getElementById('output').value = "Hi There""
}
)

</script>

<body>
<div id='output'></div>
</body>
</html>

As expected, Page Dom when loaded will be:

<html>
<title>Example Web</title>
<script>

$(document).ready(function(){
    document.getElementById('output').value = "Hi There"
}
)

</script>

<body>
<div id='output'>Hi There</div>
</body>
</html>

It seems that when crawling sites using Scrapy, the response is the Page Source, rather than Page DOM. How do I make scrapy to request for Page DOM so that i can extract the 'Hi There' string in the body?

Perhaps use something like phantomjs instead? – techfoobar May 19 '14 at 10:46 — techfoobar, May 19 '14 at 10:46

score 0 · Accepted Answer · edited May 23 '17 at 12:27

You cannot make Scrapy to request for Page DOM instead of Page Source because Scrapy is not a browser. So, it cannot render Javascript. It simply builds an Element Tree from the response it gets.

Refer Google Group discussion on Scrapy supporting Javascript

1: https://groups.google.com/forum/#!topic/scrapy-users/tOVH-X7H3DI and Another StackOverflow discussion on the same topic

But, you might consider using an external ScrapyJS MiddleWare by ScrapingHub.

Scrapy: load 'ready-ed' DOM instead of Source

1 Answers1