How do I extract text of a single HTML element by tag name using MSXML in VBA?

Question

I'm trying to extract US Patent titles using MSXML6.

On the full-text html view of a patent document on the USPTO website, the patent title appears as the first and only "font" element that is a child of "body".

Here is my function that is not working (I get no error; the cell with the formula just stays blank).

Can somebody help me figure out what is wrong?

An example URL that I am feeding into the function is http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874

Function getUSPatentTitle(url As String)
    Static colTitle As New Collection
    Dim title As String
    Dim pageSource As String

    Dim xDoc As MSXML2.DOMDocument
    Dim xNode As IXMLDOMNode

    On Error Resume Next

    title = colTitle(url)
    If Err.Number <> 0 Then
        Set html_doc = CreateObject("htmlfile")
        Set xml_obj = CreateObject("MSXML6.XMLHTTP60")

        xml_obj.Open "GET", url, False
        xml_obj.send
        pageSource = xml_obj.responseText
        Set xml_obj = Nothing

        Set xDoc = New MSXML2.DOMDocument
        If Not xDoc.LoadXML(pageSource) Then  
            Err.Raise xDoc.parseError.ErrorCode, , xDoc.parseError.reason
        End If

        Set xNode = xDoc.getElementsByTagName("font").Item(1)

        title = xNode.Text
        If Not title = "" Then colTitle.Add Item:=title, Key:=url
    End If

    On Error GoTo 0 ' I understand "GoTo" is dangerous coding but copied from somebody and so far haven't thought of a more natural substitute for a GoTo statement

    getUSPatentTitle = title
End Function

codersl · Accepted Answer · 2015-10-27T09:31:46.893

Just a few points:

"On Error Goto 0" is not really a traditional Goto statement - it's just how you turn off user error handling in VBA. There were a few errors in your code but the "On Error Resume Next" skipped them so you saw nothing.
The data from the web page is in HTML format not XML.
There were a few "font" elements before the one with the title.

This should work:

Function getUSPatentTitle(url As String)
    Static colTitle As New Collection
    Dim title As String
    Dim pageSource As String
    Dim errorNumber As Integer

    On Error Resume Next
    title = colTitle(url)
    errorNumber = Err.Number
    On Error GoTo 0

    If errorNumber <> 0 Then
        Dim xml_obj As XMLHTTP60
        Set xml_obj = CreateObject("MSXML2.XMLHTTP")
        xml_obj.Open "GET", url, False
        xml_obj.send
        pageSource = xml_obj.responseText
        Set xml_obj = Nothing

        Dim html_doc As HTMLDocument
        Set html_doc = CreateObject("HTMLFile")
        html_doc.body.innerHTML = pageSource

        Dim fontElement As IHTMLElement
        Set fontElement = html_doc.getElementsByTagName("font").Item(3)

        title = fontElement.innerText
        If Not title = "" Then colTitle.Add Item:=title, Key:=url
    End If

    getUSPatentTitle = title
End Function

Thanks codersl - I had to add a reference: Tools > References > Microsoft HTML Object Library, and it works. I knew there were earler "font" elements but had been trying to find the first one directly under "body" and forgot to change the index. Also I see it is apparently zero-based. Is there not a "select" method in VBA analogous to the Jsoup method in Java where I could say something like `Element element = Document.select("html > body > font").get(0) ` ? In this case that would work better because sometimes there may be one more "font" element above the title, but inside a table. — PatentWookiee, Oct 28 '15 at 16:18
Unfortunately I don't know of an equivalent "select" method in VBA. — codersl, Oct 29 '15 at 01:13

score 1 · Answer 2 · answered Jun 30 '18 at 15:47

CSS selector:

You can re-write what you described, which in fact is first font tag within a body tag as a CSS selector of:

body > font

CSS query:

VBA:

As it is the first match/only you want you can use the querySelector method of document to apply the selector and retrieve a single element.

Debug.Print html_doc.querySelector("body > font").innerText

You may need to add a reference to HTML Object Library and use an early bound call of Dim html_doc As HTMLDocument to access the method. The late bound method may expose the querySelector method but if the interface doesn't then use early binding.

How do I extract text of a single HTML element by tag name using MSXML in VBA?

2 Answers2