0

Some of the data I want to scrape is contained inside the pages JavaScript. It looks similar to this pattern:

<script type="text/javascript">
        arrayName["field1"] = 12;
        arrayName["field2"] = 42;
        arrayName["field3"] = 1442;
</script>
<script type="text/javascript">
        arrayName["field4"] = 62;
        arrayName["field5"] = 3;
        arrayName["field6"] = 542;
</script>

It's mixed in with a hell of a lot of other Javascript. I need to get these values.

I started like so:

var dom = CQ.CreateFromUrl("http://somesite.xxx");

CQ script = dom["script[type='text/javascript']"];

But I cannot think now how to grab this data. Is the only way to do it to create a regex and loop over everything or is there another way that has better performance?

I can't see how to use CSS selectors for actual JavaScript code. Should I try different approach?

Guerrilla
  • 13,375
  • 31
  • 109
  • 210

2 Answers2

1

It probably won't be very fast, but you could try using a WebBrowser control for this. Let it browse to the page, then execute your own Javascript to retrieve the data. Example:

var url = "http://example.com";
object arrayName;
var thread = new Thread(() =>
{
    var browser = new WebBrowser { ScriptErrorsSuppressed = true };

    // prevent popups
    browser.NewWindow += (sender, e) =>
    {
        e.Cancel = true;
    };

    browser.DocumentCompleted += (sender, eventArgs) =>
    {
        // call the Javascript eval() function, and pass it a string of what we want to evaluate. By passing "arrayName", it will simply return the value of that variable in the global scope.
        arrayName = browser.Document.InvokeScript("eval", new object[] { "arrayName" });

        browser.Dispose();
        Application.ExitThread();
    };

    browser.Navigate(url);

    Application.Run();
});

// you need this when using a WebBrowser control in a console app
thread.SetApartmentState(ApartmentState.STA);
thread.Start();
thread.Join();

// now you should have something stored in the arrayName variable
Jonathan Amend
  • 12,715
  • 3
  • 22
  • 29
  • I need to scrape this data every time I get request so I think web browser control is too much weight. Ideally I want a way to navigate the javascript like I can the HTML, if not I think it will have to be regex or something similar. – Guerrilla Oct 28 '14 at 03:30
  • If performance is a concern, then try using a Javascript interpreter like [jint](https://www.nuget.org/packages/Jint) to evaluate the code that you find with your script tag selector. – Jonathan Amend Oct 28 '14 at 04:19
  • I had a brief look at jint but documentation was brief and I couldnt see an easy way to drop all the script into it and do an eval. I'll have another look as see if I can find some examples online. Thanks – Guerrilla Oct 28 '14 at 13:51
0

It seems like you are really looking for a server-side Javascript engine - CsQuery can get you the contents of the script tags easily enough, but then you need to actually run the script and then be able to refer to the entities that are created. While in theory one could create some kind of query language to parse out lines of script, the reality is, that's basically just running it. If you need to pull out just particular lines containing simple assignments, and context isn't important, then you're probably looking at something as simple as regular expressions (or even grep) to filter out what you need.

I have used the Neosis V8 wrapper -- http://javascriptdotnet.codeplex.com/ -- also on nuget as Neosis.Javascript.

It's as fast as anything (since it uses Google's V8 engine under the hood); the only real downside is it's not a pure .NET solution, but once set up it's pretty painless. An example of using it is in my project https://github.com/jamietre/SharpLinter which uses it to run JsHint.

There are a variety of 100% .NET Javascript engines such as Jint, IronJS and Jurassic. I have used Jurassic before and it's probably the fastest because it compiles to bytecode. It's surprisingly complete, but is not really being actively developed, and so it will probably be difficult to get much support. But all of them are much, much slower than V8 and offer no real advantages other than having no non-.NET references.

Unless you really, really need it to be 100% .net just use JavscriptDotNet.

Jamie Treworgy
  • 23,934
  • 8
  • 76
  • 119
  • Thank Jamie, I am going down this avenue. It's a bit tricky executing script seperate of the page but I am using a regex to tidy things up and it appears to be working. Also just thought I'd say a big thank you for CsQuery. I was using HAP before and CsQuery is so much better it saves me a lot of time! – Guerrilla Oct 30 '14 at 17:42