-1

Any good tutorial on parsing online HTML pages using msxml/IXMLDOMDocument?

I need to parse HTML pages using XPATH expressions.

Most probably some of HTML pages will not be 100% valid , so I need to configure parser to be more "friendly" or not so strict for such pages.

Any ideas?

rkosegi
  • 14,165
  • 5
  • 50
  • 83

1 Answers1

2

You can tidy up invalid html using tidy or a tidy wrapper library. After doing this you can parse the html with specifying xhtml namespace using MSXML.
EfTidy is a good, up to date open source tidy wrapper project to tidying up html.
I want to show an example written in VBScript to addressing with XPath to get title of this question.

'EfTidy constants
Const XhtmlOut = 1
Const DoctypeLoose = 3 'for transitional

Dim EfTidy, sInvalidHTML, sValidHTML

With CreateObject("MSXML2.XMLHTTP.6.0")
    .open "GET", "http://stackoverflow.com/q/12027205/"
    .send
    sInvalidHTML = .responseText
End With

Set EfTidy = CreateObject("EfTidy.tidyCom")
With EfTidy.Option 'config
    .Clean = True
    .OutputType = XhtmlOut
    .DoctypeMode = DoctypeLoose
End With
sValidHTML = EfTidy.TidyMemToMem(sInvalidHTML)

With CreateObject("MSXML2.DomDocument.6.0")
    .async = False
    .validateOnParse = False
    .resolveExternals = True
    .setProperty "ProhibitDTD", False
    If .LoadXml(sValidHTML) Then
        .setProperty "SelectionLanguage", "XPath"
        .setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
        WScript.Echo .SelectSingleNode("//xhtml:div[@id='question-header']/xhtml:h1").Text
    End If
End With

Hope it helps.

Kul-Tigin
  • 16,728
  • 1
  • 35
  • 64
  • +1, Thanks for proposal, I will try Eftidy and accept your answer once it will work. – rkosegi Aug 20 '12 at 09:13
  • I still have same error during LoadXML method : MSG_E_INVALIDATROOTLEVEL. Invalid at the top level of the document. Any idea? – rkosegi Aug 20 '12 at 10:41
  • @rkosegi I guess you're developing in C++ but I'm not familiar with C++. Unfortunately my MSXML experience is just using com components, not developing :). Would be better to add a `c++` tag to getting more help from c++ specialists. – Kul-Tigin Aug 20 '12 at 10:58
  • Ok, I need to accept your answer, problem was on my site.Thanks !!! – rkosegi Aug 20 '12 at 12:17