I have a problem crawling a web page because C# webClient class and webBrowser class cannot retrieve all the child elements in the Html source.
When I search in the code from Chrome or even from iExplorer I can expand all the HtmlElement nodes, but if I try to expand all this elements from code I cannot get all the nodes.
I was using this routine to get the nodes:
string page = ConfigurationManager.AppSettings["url"];
webBrowser1.Navigate(page);
string directory = Directory.GetCurrentDirectory();
StreamReader myReader = new StreamReader(webBrowser1.DocumentStream);
StreamWriter myWriter = new StreamWriter(directory + @"\pageSource.txt");
while (myReader.Peek() >= 0)
{
myWriter.WriteLine(myReader.ReadLine());
}
myWriter.Close();
myReader.Close();
The file pageSource.txt doesnt have all the lines in the original html source.
For example, this is the pageSource.txt content:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<noscript>
<meta HTTP-EQUIV="REFRESH" CONTENT="0;URL=index.jsp?noscript=1">
</noscript>
<title>Page</title>
</head>
<frameset id="indexFramst" onload="onloadHandler()" rows="135,24,*" frameborder="0" framespacing="0" border=0 spacing=0>
<frame name="Banner" title="Banner" src='banner.html' tabIndex="3" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" noresize=0>
<frame name="Search" title="Toolbar" src='archive=100' marginwidth="0" marginheight="0" scrolling="no" frameborder="0" noresize=0>
<frame name="Bingo" title="BINGO" src='bingo.Html' marginwidth="0" marginheight="0" scrolling="no" frameborder="0" >
</frameset>
</html>
Each <frame>
tag must have a end and child items, but the document of the webBrowser1 doesn't retrieve this child.
The original page contains in each frame tag <html>
tags with another nested html documents.
If somebody knows why I cant retrieve this nodes I will be very thankful for the tip.