Parse data from JavaScript of retrieved page

Question

I'm retrieving a web page with OpenURI:

require 'open-uri'
page = open('http://www.example.com').read.scrub

Now I'd like to parse the values of the attributes playerurl, playerdata and pageurl of the retrieved page. They appear in a <script> tag:

<script>
..
..
  PlayerWatchdog.init({
      'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
      'playerdata': 'http://www.example.com/player',
      'pageurl': 'http://www.example.com?test=2',
      });
..
..
</script>

What's the smartest way to accomplish this?

I'm not sure what you mean by "the three JS attributes". Is there an embeded script that you want to parse? Or are these inside HTML elements? — Max, Nov 03 '14 at 16:51
It's a script-tag inside the html-page itself. By attributes I mean the values of 'playerurl', 'playerdata' and 'pageurl' — Hedge, Nov 03 '14 at 17:07

the Tin Man · Answer 1 · 2014-11-03T18:29:30.467

You can use an HTML parser, such as Nokogiri, to take apart the HTML document, and quickly find the <script> tag you're after. The content inside a <script> tag is text, so Nokogiri's text method will return that. Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head>
    <script>
      PlayerWatchdog.init({
          'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
          'playerdata': 'http://www.example.com/player',
          'pageurl': 'http://www.example.com?test=2',
          });
    </script>
  </head>
</html>
EOT

script_text = doc.at('script').text 
playerurl, playerdata, pageurl = %w[
  playerurl
  playerdata
  pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }

playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"

at returns the first matching <script> Node instance. Depending on the HTML you might not want the first matching <script>. You can use search instead, which will return a NodeSet, similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired.

Once the tag is found, text returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. /'#{ i }': '([^']+')/ is a simple pattern that looks for a word, passed in in i followed by : ' then capture everything up to the next '. That pattern is passed to String's [] method.

score 1 · Accepted Answer · answered Nov 03 '14 at 17:31

Ruby has no built-in javascript parsing capabilities. You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings):

playerurl = page[/'playerurl':\s*'([^']*)'/, 1]

Parse data from JavaScript of retrieved page

2 Answers2