7

There is a large html file with many javascript tags in it. I'm trying to scoop out the contents of that variable. The variable name stays the same but the contents change on every request.

examplefile.html

<script type="text/javascript">//.... more js</script>
<script type="text/javascript">//.... more js</script>
<script type="text/javascript">var foo = {"b":"bar","c":"cat"}</script>
<script type="text/javascript">//.... more js</script>
<script type="text/javascript">//.... more js</script>
<script type="text/javascript">//.... more js</script>

desired console result

> var result = $('script').<some_selection_thingy>
result = {"b":"bar","c":"cat"}

Let me explain a little bit... By I mean that my questions is - a) how do I select the array object with has the contents 'var foo' b) how do I get the contents of the var foo variable so that I can import that information into a local json variable for further processing.

when you run $('script') in the console, jquery returns an array.

> $('script')
[<script type="text/javascript">//.... more js</script>,<script type="text/javascript">//.... more js</script>,<script type="text/javascript">var foo = {"b":"bar","c":"cat"}</script>,<script type="text/javascript">...</script>]

Because this is cheerio not actually jquery, the dom isn't loaded so I can't just do $(foo) . There is an alternative that I can use jsdom instead of cheerio but I've read in other stackoverflow responses (while researching this question) that it's less performant so I'd prefer to learn the correct jquery selectors I need to scoop out this variable.

server.js

// some cheerio node code
url = 'someurl';
request(url, function(error, response, html){
    var $ = cheerio.load(html);
    result = $('script').map(&:text).select{ |s| s['var foo'] }
    result = result[0]
//SyntaxError: Unexpected token &

Which is of course expected because .map(&:text) is what I'd do if I was using xpath but doesn't work with cheerio (jquery).

Falieson
  • 2,198
  • 3
  • 24
  • 35

3 Answers3

15

I got it!

function findTextAndReturnRemainder(target, variable){
    var chopFront = target.substring(target.search(variable)+variable.length,target.length);
    var result = chopFront.substring(0,chopFront.search(";"));
    return result;
}
var text = $($('script')).text();
var findAndClean = findTextAndReturnRemainder(text,"var foo =");
var result = JSON.parse(findAndClean);
Falieson
  • 2,198
  • 3
  • 24
  • 35
5
var cheerio = require('cheerio');
$ = cheerio.load(html);

Then you should have your text by $('script')[0].text() for instance.

If it's always a "var foo = {"b":"bar","c":"cat"}" pattern that you parse then you could do something like this to get the object:

var text = $('script')[0].text();
var str = text.substr(text.indexOf('{'), text.indexOf('}'));
JSON.parse(str);
{ b: 'bar', c: 'cat' }
Jimmy Bernljung
  • 429
  • 2
  • 8
  • Tried in cheerio and jquery console https://gist.github.com/Falieson/65ff35f28a91ae955429 – Falieson Feb 21 '15 at 22:50
  • That is strange, it works for me. Are you loading the html as a string into cheerio? – Jimmy Bernljung Feb 21 '15 at 22:51
  • (updated gist) url = 'someurl'; request(url, function(error, response, html){ var $ = cheerio.load(html); – Falieson Feb 21 '15 at 22:58
  • Looks straight forward enough. Any luck if you try to access .html() instead? – Jimmy Bernljung Feb 21 '15 at 23:01
  • I would like to include a little more context just to be certain I've correctly communicated what I'm attempting to do. There is a large html file with many javascript tags in it. I'm trying to scoop out the contents of that variable. The variable name stays the same but the contents change on every request. I'm a little confused why selecting $('script')[0].text() will result in the correct javascript tag which contains the 'var foo'. I read this as the first item in the array, which has a typeof object which you're trying to convert into a string... cheerio and jquery both complain about. – Falieson Feb 21 '15 at 23:07
  • Yeah, I think I get it, but to to parse it we must first be able to access the content.. ($("script")[0]).innerHTML or $("script")[0].html() perhaps? – Jimmy Bernljung Feb 21 '15 at 23:12
  • innerHTML -> undefined innerHTML() -> TypeError: Object # has no method 'innerHTML' same for HTML – Falieson Feb 21 '15 at 23:21
  • innerHTML() works in jquery console but not in the cheerio node code – Falieson Feb 21 '15 at 23:28
  • Okay, somehow you're not getting the element properly, not sure why. Based on your console output it would seem like you are in fact able to fetch the node but obviously that's not the case... Sorry, not sure what else to try. :/ Are you able to get properties from tags other than – Jimmy Bernljung Feb 21 '15 at 23:30
  • No innerHTML should be a property not a method, but that's not working for you either. Are you able to console log the script node ($("script")[0]) from inside node.js, not just in the browser console? – Jimmy Bernljung Feb 21 '15 at 23:34
4

The accepted answer did not work for me in cheerio. Here's my solution:

var scripts = $('script').filter(function() {
    return ($(this).html().indexOf('var foo =') > -1);
});
if (scripts.length === 1) {
    var text = $(scripts[0]).html();
    ...parse the text
}
Joel
  • 15,654
  • 5
  • 37
  • 60