15

The Chrome Dev Tools network tab has an initiator column that will show you exactly what code initiated the network request.

network tab of chrome dev tools

I'd like to be able to get network request initiator information programmatically, so I could run a script with a url and request search string argument, and it would return details about where every request with a url matching request search string came from on the page at url. So given the arguments www.stackoverflow.com and google the output might look something like this (showing requesting url, line number, and requested url):

/   19  http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
/   4291    http://www.google-analytics.com/analytics.js

I looked into PhantomJS, but its onResourceRequested callback doesn't provide any initiator information, or context from which it can be derived, according to the documentation: http://phantomjs.org/api/webpage/handler/on-resource-requested.html

Is it possible to do with with PhantomJS at all, or some other tool or service such as selenium?

UPDATE

From the comments and answers so far it seems as though this isn't currently supported by Phantom, Selenium or anything else. So here's an alternative approach that might work: Load the page, and all of the assets, and then find any occurrences of request search string in all of the files. How could I do that?

Ben Dowling
  • 17,187
  • 8
  • 87
  • 103
  • Somewhat related: http://stackoverflow.com/questions/17650466/how-to-retrieve-the-initiator-of-a-request-when-extending-chrome-devtool. I doubt you can get the initiators with selenium since, for starters, both webdriver and chrome developer tools are chrome debuggers and cannot be running at the same time: https://sites.google.com/a/chromium.org/chromedriver/help/devtools-window-keeps-closing.. – alecxe Nov 29 '15 at 03:41
  • `window.performance.getEntries()` has the [`initiatorType`](https://w3c.github.io/resource-timing/#widl-PerformanceResourceTiming-initiatorType) for every entry, but no more than that and it's not exactly what you are looking for. – alecxe Nov 29 '15 at 03:44

3 Answers3

4

You should file a feature request in the issue tracker against the DevTools. The initiator information is not exported in the HAR, so getting it out of there isn't going to work. As far as I know, no existing API allows for this either.

Garbee
  • 10,581
  • 5
  • 38
  • 41
1

I've been able to implement a solution that uses PhantomJS to get all of the URLs loaded by a page, and then use a combination of xargs, curl and grep to find the search string at those URLs.

The first piece is this PhantomJS script, which simply outputs every URL requested by a page:

system = require('system');
var page = require('webpage').create();

page.onResourceRequested= function(req) {
    console.log(req.url);
};

page.open(system.args[1], function(status) {
    phantom.exit(1);
});

Here it is in action:

$ phantomjs urls.js http://www.stackoverflow.com | head -n6
http://www.stackoverflow.com/
http://stackoverflow.com/
http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
http://cdn.sstatic.net/Js/stub.en.js?v=06bb9dbfaca7
http://cdn.sstatic.net/stackoverflow/all.css?v=af4b547e0e9f
http://cdn.sstatic.net/img/share-sprite-new.svg?v=d09c08f3cb07

For my problem I'm not interested in images, and those can be fitlered out by adding the phantomjs arg --load-images=no.

The second piece is taking all of the URLs and searching them. It's not enough to just output the match, I also need the context around which URL was matched, and ideally which line number too. Here's how to do that:

$ cat urls | xargs -I% sh -c "curl -s % | grep -E -n -o '(.{0,30})SEARCH_TERM(.{0,30})' | sed 's#^#% #'"

We can wrap this all up in a little script, where we'll pipe the output back through grep to get color highlighting on the search string:

#!/bin/bash
phantomjs --load-images=no urls.js $1 | xargs -I% sh -c "curl -s % | grep -E -n -o '(.{0,30})$2(.{0,30})' | sed 's#^#% #' | grep $2 --color=always"

We can then use it to search for any term on any site. Here we're looking for adzerk.net on stackoverflow.com:

enter image description here

So you can see that the adzerk.net request gets initiated somewhere around line 4158 of the main stackoverflow page. It's not a perfect solution because the invocation might be somewhere completely different from where the URL is defined, but it's probably a close, and certainly a good point to start tracking down the exact invocation site.

There might be a better way to search the contents of each URL. It doesn't look like PhantonJS's onResourceReceived handler currently exposes the resource content, but there is ongoing work to address that, and once that's available all of this will be much simpler.

Ben Dowling
  • 17,187
  • 8
  • 87
  • 103
0

You can use Chrome's debugger protocol from a process external to Chrome or use the chrome.debugger API in a Chrome extension (see How to retrieve the Initiator of a request when extending Chrome DevTool?).

Community
  • 1
  • 1
mfulton26
  • 29,956
  • 6
  • 64
  • 88
  • 1
    The answer you've linked to says it's not possible. – Ben Dowling Dec 04 '15 at 21:06
  • @BenDowling, it says that when using a [Chrome DevTools Extension](https://developer.chrome.com/extensions/devtools) and the exposed data through HAR that it isn't possible but it is possible through the debugger API. I suggest [sniffing the protocol](https://developer.chrome.com/devtools/docs/debugger-protocol#sniffing-the-protocol) as described under [Debugging over the wire](https://developer.chrome.com/devtools/docs/debugger-protocol#remote) to find the necessary APIs you will need to get network details, etc. and then implementing a chrome extension to capture & expose the data you want. – mfulton26 Dec 04 '15 at 21:14
  • I want to do this in a headless env. Would a chrome extension work for that? I think my suggested approach of just gripping all resources for the search string is looking like the simplest way to go. – Ben Dowling Dec 04 '15 at 21:22
  • I'm guessing you mean "grepping". How will you get the directory of all of the sources? I suspect you'll need to use a Chrome extension for that (or a DevTools extension). – mfulton26 Dec 04 '15 at 21:32
  • Yeah, autocorrect! :) using phantomjs. I'll write it up as an answer if nobody beats me to it – Ben Dowling Dec 04 '15 at 21:33
  • But yes, Chrome extensions will run regardless of whether or not the browser is running in a headless environment or not and can execute scripts in the background (e.g. attach to the debugger on tab creation, etc.). – mfulton26 Dec 04 '15 at 21:33