0

My scraper app is searching a Vimeo URL with a query string attached to it which is

'http://vimeo.com/search?q=angularjs'

When I load that URL on Chrome I can see a number of elements that do not show up with I request() that URL from my scraper. The HTML that I can load with both Chrome and my scraper are what seems to be static elements like the HTML found in the nav bar and footer. When I try to access any elements that would be generated by Vimeo processing the query string search?q=angularjs, my scraper does not get access to the video gallery grid that shows up in Chrome. So here is my scraper so far:

var request = require('request'),
  cheerio = require('cheerio'),
  searchURL = 'http://vimeo.com/search?q=angularjs';

request(searchURL, function(err, resp, body){
  if(err)
    throw err;
  $ = cheerio.load(body);
  console.log($('#site_header .join a').text());
  console.log($('#page_header h1').text());
  $('#browse_content .browse_videos li a').each(function(){
    console.log(this.attr('href'));
  });
});

After loading the body into $ with Cheerio I run

console.log($('#site_header .join a').text());

which logs Join to the console. That works. Great. But if I do

console.log($('#page_header h1').text());

what I get logged to the console is Please Try Again which I assume means that the query could not be fulfilled. And when I see that bit of HTML in the page sourcein Chrome I see:

<header id="page_header">
    <h1>Search videos for <mark class="txt_normal">angularjs</mark></h1>
</header>

And just to be certain I ran

console.log($('html').html());

which spit me back an HTML page that is missing the browse_content div which contains the video thumbnail gallery grid. This is why the following code returns nothing:

$('#browse_content .browse_videos li a').each(function(){
  console.log(this.attr('href'));
});

So how come Vimeo does not want to give my scraper the content it is requesting?

1 Answers1

0

Without looking too much into the details of your example I suspect you'll need to use something like http://phantomjs.org/ to parse the javascript on the Vimeo site. Phantom.js will return back an object on which you can apply cheerio methods as usual.

finspin
  • 4,021
  • 6
  • 38
  • 66