2

I have a problem crawling a web page because C# webClient class and webBrowser class cannot retrieve all the child elements in the Html source.

When I search in the code from Chrome or even from iExplorer I can expand all the HtmlElement nodes, but if I try to expand all this elements from code I cannot get all the nodes.

I was using this routine to get the nodes:

string page = ConfigurationManager.AppSettings["url"];
webBrowser1.Navigate(page);
string directory = Directory.GetCurrentDirectory();
StreamReader myReader = new StreamReader(webBrowser1.DocumentStream);
StreamWriter myWriter = new StreamWriter(directory + @"\pageSource.txt");
while (myReader.Peek() >= 0)
{
     myWriter.WriteLine(myReader.ReadLine());
}
myWriter.Close();
myReader.Close();

The file pageSource.txt doesnt have all the lines in the original html source.

For example, this is the pageSource.txt content:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<noscript>
<meta HTTP-EQUIV="REFRESH" CONTENT="0;URL=index.jsp?noscript=1">
</noscript>
<title>Page</title>

</head>

<frameset id="indexFramst" onload="onloadHandler()" rows="135,24,*"  frameborder="0" framespacing="0" border=0 spacing=0>

    <frame name="Banner" title="Banner" src='banner.html'  tabIndex="3" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" noresize=0>
    <frame name="Search" title="Toolbar" src='archive=100' marginwidth="0" marginheight="0" scrolling="no" frameborder="0" noresize=0>
    <frame name="Bingo" title="BINGO" src='bingo.Html' marginwidth="0" marginheight="0" scrolling="no" frameborder="0" >
</frameset>
</html>

Each <frame> tag must have a end and child items, but the document of the webBrowser1 doesn't retrieve this child.

The original page contains in each frame tag <html> tags with another nested html documents.

If somebody knows why I cant retrieve this nodes I will be very thankful for the tip.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
  • Can you show this "original page"? `frame` elements can't contain child elements. (They are void elements, like `img` and `hr` and so on.) So I'm not sure what the page is supposed to look like. – Mr Lister Dec 22 '15 at 16:48
  • This is a page that serves access to several web DOM's. So each frame show a internal html document in different places, for example, the last frame serves the html document meanwhile the middle frame serves an index to navigate over others DOM's. – Jherom Chacon Dec 24 '15 at 12:53
  • I know what frames do. What I want to see is the original source; from your description it sounds like the original author treated frames as non-void elements. – Mr Lister Dec 24 '15 at 13:43
  • If my answer was sufficient can you mark it as the answer please? – sjdirect May 17 '16 at 05:27

1 Answers1

0

Looks like the frameset is not supported in html 5. Maybe the webbrowser class defaults to html5 even though that page identifies itself as html 4. You can try using another client to download and process the text. If you need the javascript renderered try phantomjs or if you are fixed on c#, you can try AbotX which uses phantomjs internally.

sjdirect
  • 2,224
  • 2
  • 22
  • 27