0

Like many, I am using a combination of Excel and VBA to scrape public-agency websites for data. The immediate use is searching for corporate tax and penalty data at the SEC's EDGAR webpage. I successfully retrieve a webpage using the MSXML2 library, create a collection of the links on that page, and attempt to loop through the collection as follows:

Option Explicit

Private m_html As MSHTML.HTMLDocument
Private m_wsBuffer As Excel.Worksheet

Private Sub GetData(strSymbol As String)

  On Error GoTo ErrorHandler
  Dim html As MSHTML.HTMLDocument
  Dim elements, documents As IHTMLElementCollection
  Dim e As IHTMLElement
  Dim strUrl As String
  Dim iCol As Integer

  strUrl = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & strSymbol & _
   "&type=10-k"

  GetHTML strURL

  Set elements = m_html.getElementsByTagName("a")

  For Each e In elements

    Debug.Print "Checking file at " & e.href

    ' Check whether the link is to an archive index page
    ' Not every agency uses this construct; but this is just for test purposes
    If InStr(1, e.href, "index.htm", 1) > 0 And InStr(1, e.href, "archive", 1) > 0 Then

      ' Use the GetHTML subroutine to retrieve the archive link
      GetHTML e.href

      Set documents = m_html.getElementsbyTagName("div")
      iCol = Excel.WorksheetFunction.Match("dateFiling", m_wsBuffer.Rows(1), 0)

      For Each d In documents
        If d.className = "infoHead" And d.innerText = "Filing Date" Then
          m_wsBuffer.Cells(2, iCol) = d.innerText
          Exit For
        End If
      Next d

      [additional, unrelated code that won't be added until I resolve the error]

    End If

  Next e

  Exit Sub

ErrorHandler:

  If Err.Number <> 91 Then
    Debug.Print "GetData: error " & Err.Number & ": " & Err.Description
    Resume Next
  End If

End Sub

Public Sub GetHTML(strUrl As String)

    'Debug.Print "GetHTML()"
    On Error GoTo ErrorHandler
    
    Dim x As MSXML2.XMLHTTP60
    Dim h As MSHTML.HTMLDocument
    Set x = New MSXML2.XMLHTTP60
    Set m_html = New MSHTML.HTMLDocument
    
    With x
        .Open "GET", strUrl, False
        .send
        If .readyState = 4 And .Status = 200 Then
            'Debug.Print "GetHTML: found html"
            m_html.body.innerHTML = .responseText
        Else
            Debug.Print "Error" & vbNewLine & "Ready state: " & .readyState & _
            vbNewLine & "HTTP request status: " & .Status
        End If
    End With
    
    Exit Sub
    
ErrorHandler:
    
    If Err.Number <> 91 Then
        Debug.Print "GetHTML: error " & Err.Number & ": " & Err.Description
        Resume Next
    End If

End Sub

Here's the thing: based on the debugging code (which I've largely omitted from the above for ease of reading) the retrieval of the second webpage in the 'e' loop by use of the GetHTML subroutine seems to go swimmingly; but it appears that, when the code returns to the top of the loop, it does not advance to the next item/link in the collection and when the "If InStr(1, e.href, "index.htm", 1) > 0 And InStr(1, e.href, "archive", 1) > 0 Then" line runs, the code throws the error. I'm thinking that the "permission denied" relates to the use of the 'e' variable because if I comment-out the debug line right before that "If InStr" line, one of the errors disappears in the debugging window.

Any thoughts as to why this might be happening? I feel like I must be making a mind-numbingly simple error; but I've been messing around with this now for a while and can't quite find my mistake or find this problem being answered in other questions.

  • First thing, your `elements` object is actually a `Variant`, not a `IHTMLElementCollection` as you probably intend (see [this answer](https://stackoverflow.com/a/55060194/4717755)). I'm also wary that you're re-using the global variable `m_html` every time you call `GetHTML`. Even though you have captured the `elements` from it before entering the loop, it's at least something I would check out during debug. – PeterT Aug 31 '21 at 14:44
  • Thanks, Peter. I'll give those both a close look. I will say that, if I remove the call GetHTML line from inside the 'e' loop, I don't get the error. So, perhaps there is something to that. – ValleyRunner Aug 31 '21 at 15:13

0 Answers0