1

I want Excel to parse an HTML file for a specific table.

My current method is to get a DOM representation of the file and parse that. The problem is that the DOMDocument60 is throwing a parse error ("Invalid Syntax"). After some more research I found out that the DOMDocument60 object is only compatible with XML.

Are there any other options to get the DOM of an HTML file?

Sub myWebTest()
    On Error Resume Next
    Set File = CreateObject("Msxml2.XMLHTTP")

    File.setTimeout 2000, 2000, 2000, 2000
    File.Open "GET", "http://www.microsoft.com/en-au/default.aspx:80", False
    'This is IE 8 headers
    File.SetRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30618; .NET4.0C; .NET4.0E; BCD2000; BCD2000)"
    File.Send

    On Error GoTo 0

    Set dom = CreateObject("Msxml2.DOMDocument")
    'Dim dom As New DOMDocument60
    dom.LoadXML File.ResponseText
    MsgBox dom.ChildNodes.Length
End Sub
Alter
  • 3,332
  • 4
  • 31
  • 56

1 Answers1

1

If this is a one-time thing, you could try Excel's built-in import tool. Click Data | Get External Data (From Web). You can give it the URL of the HTML page.

ariscris
  • 533
  • 1
  • 4
  • 19
  • 1
    Good idea, but there are 50 pages of the same format that I want to update weekly – Alter Oct 16 '14 at 18:40
  • You can record a macro using the Get External Data and then use that as the starting point for your script. – ariscris Oct 16 '14 at 18:45
  • It works to get a table, but I really want the DOM representation of the entire file. I already have a code basis that retrieves the HTML file from the web, I'm just trying to parse the response. Using the import-tool complicates things more than if I just wrote my own parser. – Alter Oct 16 '14 at 19:00
  • I see, sounds like you have to use MSXML2.XMLHTTP60 then. See http://stackoverflow.com/questions/25488687/parse-html-content-in-vba – ariscris Oct 16 '14 at 19:02
  • Nice. I was going over that post trying to get the doc object to work when I found out that apparently my problem goes away when I switch from DOMDocument60 to just DOMDocument. No idea why – Alter Oct 16 '14 at 19:30