-1

I need to access web-pages and copy their content (everything) into a string which I will then extract some figures from.

The web-page address changes each time, as I am basically accessing an online simulation tool and I have to specify the sim parameters each time. And the output is always a string of about 320 characters. the web page consists ONLY in that text.

Example of web address / query:

http://re.jrc.ec.europa.eu/pvgis5/PVcalc.php?lat=45&lon=8&peakpower=1&loss=14&optimalangles=1&outputformat=basic

Example of web-page content (string to retrieve): 37 0 1 54.9 72.1 7.21 2 73.1 96.0 12.0 3 114 149 15.5 4 121 160 17.9 5 140 185 11.3 6 142 188 9.31 7 161 212 10.2 8 149 197 10.0 9 123 162 10.3 10 83.0 109 15.5 11 55.8 73.3 13.5 12 55.8 73.2 9.47 Year 1270 1680 58.8 AOI loss: 2.7% Spectral effects: - Temperature and low irradiance loss: 8.0% Combined losses: 24.1%

Question to you

Is there a method to copy that string without having to open and close a browser each time? I have to repeat that operation (determine the query parameters, retrieve the relative string, extract the values that I need from the string) a total of 7200 times when I run my analysis and I'd like it to be as smooth and as fast as possible.

Note: I don't necessary need to save the string text on a document but it would be OK to do so if needed, and then open the file and retrieve my string. But that sounds so inefficient that I'm sure there must be a better way of doing it!

Teamothy
  • 2,000
  • 3
  • 16
  • 26
giagrifi
  • 21
  • 3

3 Answers3

2

I use the function variations below constantly for things like extracting html from a web page, or JSON results from querying an API, and so much more.


Late-Bound Version

This "stand-alone" version requires no references:

Public Function getHTTP(ByVal url As String) As String
'returns HTML from URL (works on *almost* any URL you throw at it)
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .Send
        getHTTP = StrConv(.responseBody, vbUnicode)
    End With
End Function

Early-Bound Version

If you're going to be hitting several sites, it's more efficient (up to twice as fast and much easier on system resources) to use this version instead.
You'll need to add a reference to MS XML library (ToolsReferencesMicrosoft XML, v6.0).

Public Function getHTTP(ByVal url As String) As String  
'Returns HTML from a URL, early bound (requires reference to MS XML6)
    Dim msXML As New XMLHTTP60
    With msXML
        .Open "GET", url, False
        .Send
        getHTTP = StrConv(.responseBody, vbUnicode)
    End With
    Set msXML = Nothing
End Function

Returning Only Text

When using the functions above to call a web page, they will return the raw HTML source code. You can strip the HTML tags and be left with only the a e "plain text" version of the pagewith this nifty function from Tim Williams:

Function HtmlToText(sHTML) As String
'requires reference: Tools → References → "Microsoft HTML Object Library"
    Dim oDoc As HTMLDocument
    Set oDoc = New HTMLDocument
    oDoc.body.innerHTML = sHTML
    HtmlToText = oDoc.body.innerText
End Function

Example:

Putting it together, the example below returns the plain text of "this" web page.

Option Explicit
'requires reference: Tools > References > "Microsoft HTML Object Library"

Function HtmlToText(sHTML) As String
    Dim oDoc As HTMLDocument
    Set oDoc = New HTMLDocument
    oDoc.body.innerHTML = sHTML
    HtmlToText = oDoc.body.innerText
End Function

Public Function getHTTP(ByVal url As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .Send
        getHTTP = StrConv(.responseBody, vbUnicode)
    End With
End Function

Sub Demo()
    Const url = "https://stackoverflow.com/questions/54670251"
    Dim html As String, txt As String
    
    html = getHTTP(url)
    txt = HtmlToText(html)
    
    Debug.Print txt & vbLf  'Hit CTRL+G to view output in Immediate Window
    Debug.Print "HTML source = " & Len(html) & " bytes"
    Debug.Print "Plain Text  = " & Len(txt) & " bytes"
End Sub

More Information:

ashleedawg
  • 20,365
  • 9
  • 72
  • 105
1

With this number of requests it would be better to use a class to hold the xmlhttp object rather than using a function (where you create and destroy the object each time). Then run a sub that passes all the urls to that object. Provide the class with a method to return the string.

Class module: clsHTTP

Option Explicit  
Private http As Object

Private Sub Class_Initialize()
    Set http = CreateObject("MSXML2.XMLHTTP")
End Sub

Public Function GetString(ByVal url As String) As String
    Dim sResponse As String
    With http
        .Open "GET", url, False
        .send
        GetString = .responseText
    End With
End Function

Standard module 1:

Option Explicit 
Public Sub GetStrings()
    Dim urls, ws As Worksheet, i As Long, http As clsHTTP
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set http = New clsHTTP
    'read in from sheet the urls
    urls = Application.Transpose(ws.Range("A1:A2").Value) 'Alter range to get all urls
    Application.ScreenUpdating = False
    For i = LBound(urls) To UBound(urls)
        ws.Cells(i, 2) = http.GetString(urls(i))
    Next
    Application.ScreenUpdating = True
End Sub
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hi there, thanks for your answer! In my case I do care create the object every time, I create it and the loop, asking multiple queries and always obtaining a short text back. As such, the first solution proposed by Ryan works fine (apart from the proxy access, which I haven’t figured out yet...). I would like to stress that is necessary to have a synchronous query, otherwise the – giagrifi Feb 17 '19 at 10:09
  • ... otherwise you get an error when trying to pull the answer since it’s not ready yet. That is ahieved with the “False” in the .Open method, missing in the answer from Ryan and causing me a couple of headaches. – giagrifi Feb 17 '19 at 10:17
  • I included False in my answer for that reason. You do not need to create each time in a loop, it is inefficient. That is the problem with the other answer as well. You only need to create it once and hold a reference to that object whilst you loop your request urls. Using a function in that manner is not good coding practice for large numbers of requests in particular. – QHarr Feb 17 '19 at 10:24
  • I create the object once only, outside the loop. And then loop... to give an idea of how long does it take, i looped 16471 times on a hotel (slow) wifi and only added a string manipulation and save in a worksheet cell, it took about 38 minutes - more than two thirds of it I estimate is due to the website response time, not much I can do about that, other than have them to agree releasing their code so ai can include it in mine – giagrifi Feb 17 '19 at 12:00
  • So you are not using the function as shown in the other answers? – QHarr Feb 17 '19 at 12:01
  • No, I create the object outside the loop and then use Ryan’s solution in the loop with synchronous access to the webpage. No need If a separate function to open and get the page content. – giagrifi Feb 18 '19 at 11:53
0

Yes there is a method to do this without using Internet Explorer, you can use a web request.

Here is a sample method. Basically, you are emulating the communication that would normally take place between a browser and server.

Option Explicit

Public Function getPageText(url As String)
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url
        .send
        getPageText = .responseText
    End With
End Function

Sub Example()
    Dim url As String: url = "http://re.jrc.ec.europa.eu/pvgis5/PVcalc.php?lat=45&lon=8&peakpower=1&loss=14&optimalangles=1&outputformat=basic"
    Debug.Print getPageText(url)
End Sub
Ryan Wildry
  • 5,612
  • 1
  • 15
  • 35
  • Great, if this helped have a look here. https://stackoverflow.com/help/someone-answers We like to mark questions as closed so we know your question has been answered. – Ryan Wildry Feb 13 '19 at 13:05
  • ACCEPTED EDIT: please can you let me know how to integrate a proxy connection? When I run the code on my company's network I need to go through a proxy server & port. – giagrifi Feb 13 '19 at 13:13
  • @giagrifi, this sounds like a new question. On stackoverflow we generally limit a single question to a post. Feel free to make a new question. – Ryan Wildry Feb 13 '19 at 13:16