-2

I'm very much new to the VBA programming. I have a list of keywords that are to be sourced in a certain website and I want their sourced links as the end result. I want this to be happened for all the keywords in my list using a VBA program. I tried this with below code, but I'm not getting the desired result. Please advise where am I going wrong.

Sub GetCanonicalURL()
    On Error Resume Next

    Dim ie As New SHDocVw.InternetExplorer
    Dim mykeyword As String
    Dim result As String
    Dim lastrow As Integer
    Dim mylinks As Object
    Dim mylink As Object

    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    For i = 2 To lastrow
        mykeyword = Sheet1.Cells(i, 1).Value
        ie.Visible = True
        ie.navigate "https://en.wikipedia.org/wiki/Home_page"

        Do While ie.readyState <> READYSTATE_COMPLETE
        Loop

        ie.document.forms("searchform").elements("searchInput").Value = mykeyword
        ie.document.forms("searchform").elements("searchButton").Click
        result = ie.document.body.innerHTML
        Set HTML = CreateObject("htmlfile")
        HTML.body.innerHTML = result
        Set mylinks = HTML.getElementsByTagName("link")

        For Each mylink In mylinks
            If mylink.hasAttribute("canonical") Then
                Sheet1.Cells(i, "B").Value = mylink

        Next mylink

        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub

In the above code, I'm stuck in that empty part of the program (yellow color highlighted - see screenshots), confused on which strategy to use to get the canonical URL of each sourced websites. Please see the screenshots of the images attached below.

canonical href link:
[canonical href link1

Excel sheet where result link to be reflected:
Excel sheet where result link to be reflected

Approach 1(Not so efficient) - Error Approach(Efficient) 2 - Error

jaga jagga
  • 37
  • 3
  • Do you mean this https://yoast.com/what-is-a-canonical-url/ ? I'm not sure "canonical URL of a specific website" makes sense though - do you mean of a specific page? If it's not defined in the source then I'm not sure how you'd determine it. If you've already tried then it helps to tell us *what* you already tried and what happened when you tried that. – Tim Williams Jul 06 '22 at 18:40
  • https://stackoverflow.com/questions/62230977/web-url-document-querysellector – Cameron Critchlow Jul 06 '22 at 18:42
  • I don't know what the 'code markers' are supposed to do, but whatever it does is not right for code like this. Add 4 spaces to the front of each line. – david Jul 07 '22 at 05:33
  • @TimWilliams Yes i'am referencing the same canonical URL. Please see the screenshot images that i have attached.. U will see both canonical url that I'm looking for as well as the code i have tried – jaga jagga Jul 07 '22 at 07:24
  • @jagajagga, I'm not a VBA developer but I suspect `If mylink.hasAttribute("canonical")` isn't what you want. Those links _don't_ have an attribute called `canonical`. They have an attribute called _`rel`_ whose _value_ is `canonical`. Are you looking for something like `If LCase(mylink.getAttribute("rel")) = "canonical"`? If this gets reopened I'll add a proper answer. – ChrisGPT was on strike Jul 07 '22 at 14:39
  • @Chris - I'm facing the issue in accessing hasAttribute() method in vba program (as in the option 1 code below). The error that i'm getting is "object doesn't support this property or method" for the line - If mylink.hasAttribute("rel"). Is there a way to fix this issue that you might recommend? – jaga jagga Jul 22 '22 at 18:30

1 Answers1

2

There are a number of issues with your code.

  1. You don't declare all your variables
  2. You rely on auto-instantiation of the internet explorer instance
  3. You keep trying to make ie.visible inside the loop as well as navigating to the same homepage. These activities can be done before the loop
  4. You are looking in the wrong place for the element of interest
  5. There are more efficient ways to retrieve the data you want using css selectors to target by attribute = value and querySelector to match only first node. This is more efficient as rel=canonical should only be in the head part of the html and be present only once.

See 2 approaches below to keep the head content. One approach similar to yours. Another is a more efficient approach.


Option 1: Less efficient as involves looping, greater code complexity, multiple node matching, late binding etc.

Option Explicit

Public Sub GetCanonicalURL1()

    Dim ie As SHDocVw.InternetExplorer, html As Object, lastrow As Long
    
    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    Set ie = New SHDocVw.InternetExplorer
    
    ie.Visible = True

    Dim i As Long
    
    For i = 2 To lastrow
        
        Dim mykeyword As String
        
        mykeyword = Sheet1.Cells(i, 1).Value
        
        ie.navigate "https://en.wikipedia.org/w/index.php?search=" & mykeyword
        
        Do While ie.readyState <> READYSTATE_COMPLETE Or ie.Busy: DoEvents: Loop

        Dim head As String, headAndBody As String
        
        head = ie.document.head.innerHTML        'canonical should be in head and only 1.
        headAndBody = ie.document.DocumentElement.innerHTML 'in case you wanted to know how to get both
        Set html = CreateObject("htmlfile")
        html.write head

        Dim mylinks As Object, mylink As Object

        Set mylinks = html.getElementsByTagName("link")
        'With loop. Inefficient.
        For Each mylink In mylinks
            If mylink.hasAttribute("rel") Then
                Dim linkText As String
                linkText = Trim$(mylink.getAttribute("rel"))
                If linkText = "canonical" Then
                    Sheet1.Cells(i, "B").Value = mylink.href 'assumes one canonical link
                    Exit For
                End If
            End If
        Next mylink
        
        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub

Option 2: More efficient. Early binding, single node matching with optimized css selectors. Targeting only head html.

Option Explicit

Public Sub GetCanonicalURL()

    Dim ie As SHDocVw.InternetExplorer, lastrow As Long

    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    Set ie = New SHDocVw.InternetExplorer
    
    ie.Visible = True

    Dim i As Long, html2 As MSHTML.HTMLDocument
        
    Set html2 = New MSHTML.HTMLDocument
    
    For i = 2 To lastrow
        
        Dim mykeyword As String

        mykeyword = Sheet1.Cells(i, 1).Value
        ie.navigate "https://en.wikipedia.org/w/index.php?search=" & mykeyword

        Do While ie.readyState <> READYSTATE_COMPLETE Or ie.Busy: Loop

        Dim head As String
        
        head = ie.document.head.innerHTML        'canonical should be in head and only 1.
  
        'Alternate without loop. Efficient.
        html2.body.innerHTML = head
        Sheet1.Cells(i, "B").Value = html2.querySelector("[rel=canonical]").href 'seek only 1 match as there should be only 1 and others are ignored in SEO for example.
 
        If i = lastrow Then
            ie.Quit
        End If

    Next i

End Sub
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • I appreciate your thorough response; it provided some extremely useful information for those who are eager to learn programming(like me). Regarding the coding portion, Though I'm not yet seeing the desired outcome in either of the approaches that you have suggested, but the method used to retrieve the data is more logical and instructive. I'll have to thank you once again for it. I'll definitely work on it so that I can use the best strategy in the near future. These are the errors I'm getting in both the approaches, please see the screenshots attached and help me get through these. Thanku! – jaga jagga Jul 10 '22 at 04:57
  • What version of Office/Windows are you running? I tested both versions and they worked. I have edited in some improvements since first version as well. – QHarr Jul 10 '22 at 08:25
  • I'm running it on Windows 10 -1607, Excel 2010 version.. I'm not sure whether it will matter or not.. In the option 1 code - The error that i have is "object doesn't support this property or method" for the line - If mylink.hasAttribute("rel") & In the option 2 code - The error is "object variable or with block variable is not set" in the line "Sheet1.Cells(i, "B").Value = html2.querySelector("[rel=canonical]").href", Was there anything I missed in setting up the tools library's references in order to fix these errors? – jaga jagga Jul 12 '22 at 12:04
  • Not sure why you cannot access hasAttribute except as a versioning problem. For not set that doesn't make sense unless there was no match found which I know there is based on my own testing. You could set into a variable first with `Dim a As Object; Set a = html2.querySelector("[rel=canonical]"); Debug.Print a.href` And see where error occurs then. – QHarr Jul 12 '22 at 15:41
  • The second needs a reference to Microsoft HTML object library adding. – QHarr Jul 12 '22 at 15:47
  • If not for the versioning issue, is there any other approach you might think of to solve the access to hasAttribute() method problem? (I had already inserted the Microsoft HTML object library as a reference.). After trying what you have suggested for the second option code, I'm seeing the same error on this line "Set the following: a = html2.querySelector("[rel=canonical]]); Debug. Publish a.href". Please help me in getting through this as I cannot seem to find a means to move forward. – jaga jagga Jul 22 '22 at 18:24
  • `Debug.Print a.href` – QHarr Jul 22 '22 at 19:29
  • You could try replacing the `if` statement in the first with the following line `If InStr(mylink.outerHTML, "rel=""canonical""") > 0 Then` – QHarr Jul 22 '22 at 19:39
  • I ended up seeing no errors after replacing the if statement, however the resulting canonical link didn't appear in the "Sheet1.Cells (i, "B"). Value" cell... ! – jaga jagga Jul 23 '22 at 06:23
  • https://pastebin.com/nnWgqJqP – QHarr Jul 23 '22 at 16:46
  • Thank you for responding, It worked well till the last step and it also ran without any errors, but still, the end links(Canonical) for which the code was written aren't showing up on the resulting excel sheet, despite trying many ways, I was unable to find the solution. Please suggest a way forward..:-) – jaga jagga Oct 02 '22 at 02:19