Like many, I am using a combination of Excel and VBA to scrape public-agency websites for data. The immediate use is searching for corporate tax and penalty data at the SEC's EDGAR webpage. I successfully retrieve a webpage using the MSXML2 library, create a collection of the links on that page, and attempt to loop through the collection as follows:
Option Explicit
Private m_html As MSHTML.HTMLDocument
Private m_wsBuffer As Excel.Worksheet
Private Sub GetData(strSymbol As String)
On Error GoTo ErrorHandler
Dim html As MSHTML.HTMLDocument
Dim elements, documents As IHTMLElementCollection
Dim e As IHTMLElement
Dim strUrl As String
Dim iCol As Integer
strUrl = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & strSymbol & _
"&type=10-k"
GetHTML strURL
Set elements = m_html.getElementsByTagName("a")
For Each e In elements
Debug.Print "Checking file at " & e.href
' Check whether the link is to an archive index page
' Not every agency uses this construct; but this is just for test purposes
If InStr(1, e.href, "index.htm", 1) > 0 And InStr(1, e.href, "archive", 1) > 0 Then
' Use the GetHTML subroutine to retrieve the archive link
GetHTML e.href
Set documents = m_html.getElementsbyTagName("div")
iCol = Excel.WorksheetFunction.Match("dateFiling", m_wsBuffer.Rows(1), 0)
For Each d In documents
If d.className = "infoHead" And d.innerText = "Filing Date" Then
m_wsBuffer.Cells(2, iCol) = d.innerText
Exit For
End If
Next d
[additional, unrelated code that won't be added until I resolve the error]
End If
Next e
Exit Sub
ErrorHandler:
If Err.Number <> 91 Then
Debug.Print "GetData: error " & Err.Number & ": " & Err.Description
Resume Next
End If
End Sub
Public Sub GetHTML(strUrl As String)
'Debug.Print "GetHTML()"
On Error GoTo ErrorHandler
Dim x As MSXML2.XMLHTTP60
Dim h As MSHTML.HTMLDocument
Set x = New MSXML2.XMLHTTP60
Set m_html = New MSHTML.HTMLDocument
With x
.Open "GET", strUrl, False
.send
If .readyState = 4 And .Status = 200 Then
'Debug.Print "GetHTML: found html"
m_html.body.innerHTML = .responseText
Else
Debug.Print "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Exit Sub
ErrorHandler:
If Err.Number <> 91 Then
Debug.Print "GetHTML: error " & Err.Number & ": " & Err.Description
Resume Next
End If
End Sub
Here's the thing: based on the debugging code (which I've largely omitted from the above for ease of reading) the retrieval of the second webpage in the 'e' loop by use of the GetHTML subroutine seems to go swimmingly; but it appears that, when the code returns to the top of the loop, it does not advance to the next item/link in the collection and when the "If InStr(1, e.href, "index.htm", 1) > 0 And InStr(1, e.href, "archive", 1) > 0 Then" line runs, the code throws the error. I'm thinking that the "permission denied" relates to the use of the 'e' variable because if I comment-out the debug line right before that "If InStr" line, one of the errors disappears in the debugging window.
Any thoughts as to why this might be happening? I feel like I must be making a mind-numbingly simple error; but I've been messing around with this now for a while and can't quite find my mistake or find this problem being answered in other questions.