4

I've created a script in vba using IE to keep clicking on the Load more hits button located at the bottom of a webpage until there is no such button is left.

Here is how my script can populate that button: In the site's landing page there is a dropdown named Type. The script can click on that Type to unfold the dropdown then it clicks on some corporate bond checkbox among the options. Finally, it clicks on the apply button to populate the data. However, that load more hits button can be visible at the bottom now.

My script can follow almost all the steps exactly what I described above. The only thing I am struggling to solve is that the script seems to get stuck after clicking on that button 3/4 times.

How can I rectify my script to keep clicking on that Load more hits button until there is no such button is left?

Website link

I've tried so far:

Sub ExhaustLoadMore()
    Dim IE As New InternetExplorer, I As Long
    Dim Html As HTMLDocument, post As Object, elem As Object
    Dim CheckBox As Object, btnSelect As Object

    With IE
        .Visible = True
        .navigate "https://www.boerse-stuttgart.de/en/tools/product-search/bonds"
        While .Busy Or .readyState < 4: DoEvents: Wend
        Set Html = .document

        Do: Loop Until Html.querySelectorAll(".bsg-loader-ring__item").Length = 0

        Html.querySelector("#bsg-filters-btn-bgs-filter-3").Click
        Do: Set CheckBox = Html.querySelector("#bsg-checkbox-3053"): DoEvents: Loop While CheckBox Is Nothing
        CheckBox.Click

        Set btnSelect = Html.querySelector("#bsg-filters-menu-bgs-filter-3 .bsg-btn__label")
        Do: Loop While btnSelect.innerText = "Close"
        btnSelect.Click

        Do: Loop Until Html.querySelectorAll(".bsg-loader-ring__item").Length = 0
        Do: Set elem = Html.querySelector(".bsg-table__tr td"): DoEvents: Loop While elem Is Nothing

        Do
            Set post = Html.querySelector(".bsg-searchlist__load-more button.bsg-btn--juna")
            If Not post Is Nothing Then
                post.ScrollIntoView
                post.Click
                Application.Wait Now + TimeValue("00:00:05")
            Else: Exit Do
            End If
        Loop
    End With
End Sub

I've tried with selenium but that seems to be way slower. However, it keeps clicking on the load more button after a long wait in between even when no hardcoded wait within it. In case of selenium: I wish to have any solution which might help reduce it's execution time.

Sub ExhaustLoadMore()
    Const Url$ = "https://www.boerse-stuttgart.de/en/tools/product-search/bonds"
    Dim driver As New ChromeDriver, elem As Object, post As Object

    With driver
        .get Url
        Do: Loop Until .FindElementsByCss(".bsg-loader-ring__item").count = 0
        .FindElementByCss("#bsg-filters-btn-bgs-filter-3", timeOut:=10000).Click
        .FindElementByXPath("//label[contains(.,'Corporate Bond')]", timeOut:=10000).Click
        .FindElementByXPath("//*[@id='bsg-filters-menu-bgs-filter-3']//button", timeOut:=10000).Click
        Do: Loop Until .FindElementsByCss(".bsg-loader-ring__item").count = 0
        Set elem = .FindElementByCss(".bsg-table__tr td", timeOut:=10000)
        Do
            Set post = .FindElementByCss(".bsg-searchlist__load-more button.bsg-btn--juna", timeOut:=10000)
            If Not post Is Nothing Then
                post.ScrollIntoView
                .ExecuteScript "arguments[0].click();", post
                Do: Loop Until .FindElementsByCss("p.bsg-searchlist__info--load-more").count = 0
            Else: Exit Do
            End If
        Loop
        Stop
    End With
End Sub
MITHU
  • 113
  • 3
  • 12
  • 41
  • I would suggest implementing AJAX approach but seems WebSockets is used here, so it requires to install additional ActiveX component at least. – omegastripes Mar 09 '19 at 18:50
  • Would you be open to selenium approach? Despite the implementation of shoelace on this site the responsiveness is not there and the long script run becomes blocking. – QHarr Mar 10 '19 at 08:10
  • Yes, at this point I'm open to selenium approach as IE can't do the trick. Thanks @QHarr. – MITHU Mar 10 '19 at 09:47
  • to be honest it is a nightmare page. If @omegastripes posts an answer along the lines they suggest I would be very interested to see. I will have a play with selenium later. – QHarr Mar 10 '19 at 09:56
  • Check out the edit @QHarr. – MITHU Mar 10 '19 at 10:28
  • On what line does the script fail / quit clicking? (F12) –  Mar 10 '19 at 11:31
  • In my first script, It clicks once and then gets stuck and stays right there @peakpeak. Specific link `post.Click`. – MITHU Mar 10 '19 at 12:18
  • Did you got any errors? what if you try to debug the code step by step, Does it move further as expected? Try wait for some more seconds to see whether it helps to move further the execution. as a work around, Try to click on limited numbers of button and than again try to repeat the process to finish the remaining work. – Deepak-MSFT Mar 11 '19 at 09:00
  • 1
    It could be that the load more button is clicked while the list is still refreshing. Count the number of results and only click load more if the count is superior to the previous count. – Florent B. Mar 11 '19 at 10:38
  • Yes it is nightmare as said by @QHarr. Could not get past `bsg-checkbox-3054` both with your codes (html & selenium) and with different approaches in both IE and chrome over 3/4 hrs. While inspecting found blocked by `Failed to load resource: the server responded with a status of 404 (Not Found)` from `VisitorIdentificationCSS.aspx:1` May be your browser cookies helping you to proceed further. – Ahmed AU Mar 14 '19 at 05:07
  • However if you are able to click `button.bsg-btn—juna` even once, may try to introduce some delay between successive click. It is likely server using some type of preventive measures to block quick repetitive attempt ( to block DDoS attack). This approach (with around 10 sec delay) worked in similar cases (i.e. reportaziende.it/italia.php) and also pointed out by @peakpeak – Ahmed AU Mar 14 '19 at 05:07
  • 1
    I won’t have the time to do it soon, but I’ve checked quickly your website and they retrieve their prices making a call to their API (FinderApi.js, you can find the file under the sources of the webpage). Inside this file they “unresponsably” forgot the endpoints of the API hardcoded (something like “/api/v1/bonds..”). If you set a breakpoint inside the function “performSearch” (which is called every time you load more results), you can see the pagination of the API is set to 50. If you send a JavaScript via your VBA and remove that pagination, you will get everything at once directly. – Matteo NNZ Mar 15 '19 at 07:43
  • 2
    Have a look at @QHarr answers, he recently posted an answer where he showed how to feed JavaScript through VBA directly (I think it was something like “execScript”). I think you should just play around a bit with the website and figure out how to target their FinderApi to remove the pagination properties, the results will come all at once and you won’t need to click every time to reload the results. – Matteo NNZ Mar 15 '19 at 07:51

2 Answers2

3

I have studied a bit your website, and since I could not say all of this into a single comment I have decided to post an answer (even though it doesn't provide with a concrete solution, but just with an "answer" and maybe some tips).

The answer to your question

How can I rectify my script to keep clicking on that Load more hits button until there is no such button is left?

Unfortunately, it's just not your fault. The website you are targeting is working through WebSocket communication between the web client (your browser) and the web server providing with the prices you are trying to scrape. You can see it as follows:

enter image description here

Imagine it like this:

  • When you first load your webpage, the web socket is initialized and the first request is sent (Web client: "Hey server, give me the first X results", Web server: "Sure, here you go").
  • Every time you click on the "Load more results" button, the Web client (important: re-using the same WS connection) keeps on asking for X new results to the web server.

So, the communication keeps on going on for some time. At some point, out of your control, it happens that the web socket just dies. It's enough to look at the JavaScript console while clicking on the "Load more results" button: you will see the request going through until at some point you don't just see a NullPointerException raised:

enter image description here

If you click on the last line of the stack before the exception, you will see that it's because of the web socket:

enter image description here

The error speaks clearly: cannot read .send() on null, meaning that _ws (the web socket) is gone.

Starting from now, you can forget about your website. When you click on the button "Load more results", the web client will ask the web socket to deliver the new request to the web server, but the web socket is gone so goodbye communications between the two, and so (unfortunately) goodbye the rest of your data.

You can verify this by just going a bit upper in the stack:

enter image description here

As you can see above, we have:

  1. A message logged in the console saying "performSearch params ...") just before posting the new data request
  2. The post of the new data request
  3. A message logged in the console saying "performed search with result ...") just after posting the new data request

While the web socket is still alive, everytime you click on "Load more results" you will see these two messages in the console (with other messages in between printed over the rest of their code):

enter image description here

However, after the first crash of the web socket, no matter how many times you try to click on the button you will only get the first message (web client sends the request) but never will get the second message (request gets lost in the void):

enter image description here

Please note this corresponds to your behavior observed in VBA:

the script seems to get stuck after clicking on that button 3/4 times.

It doesn't get stuck, actually your script keeps on executing correctly. It's the website that times out.

I have tried to figure out why the web socket crashes, but no luck. It just seems a timeout (I've had this a lot more while debugging their JavaScript, so my breakpoints were causing the timeout) but I can't make you sure it's the only cause. Since you're not controlling the process between the web client and the web server, all you can do is to hope that it doesn't timeout.

Also, I believe using Selenium automatically sets some longer timeouts (because of the long execution time) and this somehow allows you to keep the web socket more tolerant with respect to the timeouts.

The only way I found to restore the connection after a crash of the web socket is completely reload the web page and restart the process from scratch.

My suggestions

I think you might go with building an XHR request and sending through JavaScript, because their API (through which the web client/web socket deliver the request to the web server) is pretty exposed in their front-end code.

If you open their file FinderAPI.js, you will see they've left the endpoints and API configurations harcoded:

var FinderAPI = {
  store: null,
  state: null,
  finderEndpoint: '/api/v1/bsg/etp/finder/list',
  bidAskEndpoint: '/api/v1/prices/bidAsk/get',
  instrumentNameEndpoint: '/api/products/ProductTypeMapping/InstrumentNames',
  nameMappingEndpoint: '/api/v1/bsg/general/namemapping/list',
  apiConfig: false,
  initialize: function initialize(store, finderEndpoint) {
    var apiConfig = arguments.length > 2 && arguments[2] !== undefined ? arguments[2] : false;
    this.store = store;
    this.state = store.getState();
    this.apiConfig = apiConfig;
    this.finderEndpoint = finderEndpoint;
  },

This means you know the URL to which you should send your POST request.

A request also requires a Bearer Token to be validated by the server. Lucky you, they have also forgot to protect their tokens providing (GORSH) a GET end point to get the token:

End-point: https://www.boerse-stuttgart.de/api/products

Response: {"AuthenticationToken":"JgACxn2DfHceHL33uJhNj34qSnlTZu4+hAUACGc49UcjUhmLutN6sqcktr/T634vaPVcNzJ8sHBvKvWz","Host":"frontgate.mdgms.com"}

You'll just have to play around with the website a little bit to figure out what is the body of your POST request, then create a new XmlHttpRequest and send those values inside it to retrieve the prices directly in your VBA without opening the webpage and robotic-scraping.

I suggest you start with a breakpoint on the file FinderAPI.js, line 66 (the line of code is this.post(this.finderEndpoint, params), params should lead you to the body of the request - I remember you can print the object as string with JSON.stringify(params)).

Also, please note that they use a pagination of 50 results each time, even though their API supports up to 500 of them. In other words, if you get to sweep the value 500 (instead of 50) into their pagination property sent to the API for the request:

enter image description here

... then you will get 500 results per time instead of 50, so reducing by 10 the time your code will spend scraping the webpage in case you decide not to go deeper into the XHR solution.

Matteo NNZ
  • 11,930
  • 12
  • 52
  • 89
0

Could you try to change

Do
    Set post = Html.querySelector(".bsg-searchlist__load-more button.bsg-btn--juna")
    If Not post Is Nothing Then
      post.ScrollIntoView
    post.Click
    Application.Wait Now + TimeValue("00:00:05")
    Else: Exit Do
  End If
Loop

to:

Set post = Html.querySelector(".bsg-searchlist__load-more button.bsg-btn--juna")
If Not post Is Nothing Then
      post.ScrollIntoView
      While Not post Is Nothing
        Debug.Print "Clicking"
        post.Click
        Application.Wait Now + TimeValue("00:00:05")
      Wend
      Debug.Print "Exited Click"
End If

(untested)

  • Thank you! Edited my post which is still untested –  Mar 10 '19 at 14:22
  • Your suggested script pretends to be clicking by printing this line `Clicking` but in reality it is not. I've checked it in the browser. It clicks on that button once and then gets stuck like what it was doing in my case. – MITHU Mar 10 '19 at 14:25
  • Did you run the edited code? If so, what is the debug output? –  Mar 11 '19 at 14:44
  • Yep, I ran the edited code and noticed that it pretends to be clicking by printing that very line in the immediate window but when I check out the browser then I could see that it has clicked there once and got stuck. – MITHU Mar 13 '19 at 08:04
  • Does it get stuck on *post.Click* or does it continue to the next line, *Application.Wait Now + TimeValue("00:00:05")* ? You can single-step (F8) to find out. Also, does the *post* element have a click handler? –  Mar 13 '19 at 13:29
  • Just checking: Does the *post* element have a click handler? Otherwise your code won't work.Maybe you have selected the wrong element. Can you show us an excerpt of that web page for that particular element and those surrounding it ? –  Mar 15 '19 at 13:48