0

I'm retrieving a web page with OpenURI:

require 'open-uri'
page = open('http://www.example.com').read.scrub

Now I'd like to parse the values of the attributes playerurl, playerdata and pageurl of the retrieved page. They appear in a <script> tag:

<script>
..
..
  PlayerWatchdog.init({
      'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
      'playerdata': 'http://www.example.com/player',
      'pageurl': 'http://www.example.com?test=2',
      });
..
..
</script>

What's the smartest way to accomplish this?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Hedge
  • 16,142
  • 42
  • 141
  • 246
  • I'm not sure what you mean by "the three JS attributes". Is there an embeded script that you want to parse? Or are these inside HTML elements? – Max Nov 03 '14 at 16:51
  • It's a script-tag inside the html-page itself. By attributes I mean the values of 'playerurl', 'playerdata' and 'pageurl' – Hedge Nov 03 '14 at 17:07

2 Answers2

3

You can use an HTML parser, such as Nokogiri, to take apart the HTML document, and quickly find the <script> tag you're after. The content inside a <script> tag is text, so Nokogiri's text method will return that. Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head>
    <script>
      PlayerWatchdog.init({
          'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
          'playerdata': 'http://www.example.com/player',
          'pageurl': 'http://www.example.com?test=2',
          });
    </script>
  </head>
</html>
EOT

script_text = doc.at('script').text 
playerurl, playerdata, pageurl = %w[
  playerurl
  playerdata
  pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }

playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"

at returns the first matching <script> Node instance. Depending on the HTML you might not want the first matching <script>. You can use search instead, which will return a NodeSet, similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired.

Once the tag is found, text returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. /'#{ i }': '([^']+')/ is a simple pattern that looks for a word, passed in in i followed by : ' then capture everything up to the next '. That pattern is passed to String's [] method.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

Ruby has no built-in javascript parsing capabilities. You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings):

playerurl = page[/'playerurl':\s*'([^']*)'/, 1]
Max
  • 21,123
  • 5
  • 49
  • 71