4

I'm playing with Chromium's headless web browser API. Based on chrome_remote_shell source code, I came up with the following code:

#!/usr/bin/env python

import json
import requests
import pprint
import websocket

tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
print(tablist)
wsurl = tablist[0]['webSocketDebuggerUrl']
conn = websocket.create_connection(wsurl)
navcom = json.dumps({"id":0, "method":"Network.enable"})
conn.send(navcom)
navcom = json.dumps({"id":1, "method":"Page.navigate", "params":{"url":"https://news.ycombinator.com/"}})
conn.send(navcom)

while True:
    packet = json.loads(conn.recv())
    if 'method' in packet:
        print(packet['method'])
    else:
        print(packet)

Here's example output:

[{u'description': u'', u'title': u'Hacker News', u'url': u'https://news.ycombinator.com/', u'webSocketDebuggerUrl': u'ws://localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be', u'type': u'page', u'id': u'7d03a57d-77a9-4ceb-b645-3b85461de5be', u'devtoolsFrontendUrl': u'/devtools/inspector.html?ws=localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be'}]
{u'id': 0, u'result': {}}
Network.requestWillBeSent
{u'id': 1, u'result': {u'frameId': u'21045.1'}}
Network.responseReceived
Network.dataReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished

I noticed that I get a long stream of messages, last one of them being Network.loadingFinished, but I got this one for multiple requestIds. How can I modify my script so that it terminates when the page fully loaded and I can escape the loop?

d33tah
  • 10,999
  • 13
  • 68
  • 158

3 Answers3

6

It turns out I should have also subscribed to page events via Page.enable:

#!/usr/bin/env python

import json
import requests
import pprint
import websocket
import sys

tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
print(tablist)
wsurl = tablist[0]['webSocketDebuggerUrl']
conn = websocket.create_connection(wsurl)
navcom = json.dumps({"id":0, "method":"Network.enable"})
conn.send(navcom)
navcom = json.dumps({"id":1, "method":"Page.enable"})
conn.send(navcom)
navcom = json.dumps({"id":2, "method":"Page.navigate", "params":{"url":sys.argv[1]}})
conn.send(navcom)

while True:
    s = conn.recv()
    packet = json.loads(s)
    if packet.get('method') == 'Page.loadEventFired':
        break
    print(s)

What we're doing here is enabling notifications for both Page and Network items, then opening the website and reading all messages that happen after. Once we reach Page.loadEventFired, we can assume that the page finished loading, which is when we can exit the loop and carry out any actions that depend on this condition.

d33tah
  • 10,999
  • 13
  • 68
  • 158
  • Can you explain what is going on here please? Websockets? Why? – Fandango68 May 20 '21 at 23:32
  • @Fandango68 this is - or at least, was because I hadn't checked current state - the standard way of communicating with Chrome in headless mode. – d33tah May 21 '21 at 08:22
  • Thanks, but I still don't understand how this waits fot the page to load. Where is the code in relation to page.enable you mentioned? Again I ask that you explain the code. – Fandango68 Dec 19 '21 at 01:28
  • @Fandango68 I just added some explanation. Does this address your problem? If not, please try to explain with as much detail as possible where your confusion comes from. – d33tah Dec 19 '21 at 18:55
1

In any general sense, you can't... not really.

Given dynamic web pages these days, you need to understand what the page is actually doing and look for some specific event / existence of a DOM element, or other clue.

As you see, you're getting lots of loadingFinished events, but how do you know it's the "last" one? You need to understand the page. For example, can you determine how many requests will be sent by observing that the page will make one request per specific DOM element class, or based on a javascript variable, or XHR response? If so, then you can stop once you get n responses. Or, is there something special about the last request (target, or payload) or the last response (e.g., zero length, contains the text "last", ^D, or ^Z).

Also, if the page is polling the server (often with sockets), what does "finish loading" even mean?

Update for onload

If you're looking for what would be the onload event, you don't have to do anything special. driver.get(<url>) blocks until then.

WebDriver will wait until the page has fully loaded (that is, the onload event has fired) before returning control to your test or script. It's worth noting that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded. If you need to ensure such pages are fully loaded then you can use waits.

pbuck
  • 4,291
  • 2
  • 24
  • 36
  • Thanks pbuck, I was thinking of the same. I'd basically like to catch the moment normal browser would hide the spinning "loading" symbol, signalling end of page load. Does Chrome expose it somehow? – d33tah Apr 21 '17 at 18:48
  • Other than the standard `onload` event? Spec says: "Fired at the Window when the document has finished loading", Mozilla says, "fires at the end of the document loading process. At this point, all of the objects in the document are in the DOM, and all images, scripts, links and sub-frames have finished loading." – pbuck Apr 21 '17 at 20:09
  • Yeah, the question is how to get this event from headless chrome. – d33tah Apr 22 '17 at 00:31
  • `onload` is essentially built into the `driver.get()` (I've updated answer with this info.) Still it may not be sufficient for what you're trying to do. – pbuck Apr 22 '17 at 01:58
  • Keep in mind that I meant webtools API, not selenium's WebDriver which is higher level. Anyway, I found an answer: http://stackoverflow.com/a/43554418/1091116 – d33tah Apr 22 '17 at 02:03
  • IMO this should heve been the acceptable answer. The OP's answer is very thin on why it's the solution. Might have to flag this unless it's edited. – Fandango68 Dec 19 '21 at 01:31
0

I'm not sure how websockets work but on sockets while you Connect to the remote server you receive data in chunks. So to receive the whole response you should do it in a loop and do this till you get a chunk that is smaller then the chunk length, i mean when your chunk is 4096bytes then the last chunk will be 0 or x<4096 where x is the length is the received chunk. So with that information you know that all of the data was received from the remote server. Please read about sockets.

  • I'm sorry, but that's completely not the case. I'm talking to an API of a headless browser, which is giving me signals about how its making numerous requests. My problem is about how should I determine that all requests had been received. – d33tah Apr 21 '17 at 18:19
  • Then im not sure what you want. My first idea od to set a timeout if you Can. Can you point out which linę on Your code youre talking about? – Dawid Dave Kosiński Apr 21 '17 at 18:27
  • The problem is about the DevTools interface. I told the remote browser to enable network debugging and navigate to a certain URL. I also need to tell it to let me know when it would fire the "loaded" event, so I can catch that and escape the loop. – d33tah Apr 21 '17 at 18:28
  • Sorry, i cant help you Because od my lack of knowledge :/ – Dawid Dave Kosiński Apr 21 '17 at 18:34