1

I have a problem of load with this method. In fact, I want to load a webpage to get the Html code. But the webpage doesn't have the time to load completely. So I want to add a thread.sleep() to this method. Do you know how I can do it ?

            var html = await httpClient.GetStringAsync(url); 
            HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
            htmlDocument.LoadHtml(html);
  • url is a string variable of a real url – Andréa Madrid Apr 21 '22 at 09:44
  • _"So I want to add a thread.sleep() to this method."_ - No, **you do not want that**. Please add a little more context. By awaiting the GetStringAsync call, it should get the complete text, already. – Fildor Apr 21 '22 at 09:45
  • The website is a react app that have a loading problem so i have to include a thread.sleep() to the method – Andréa Madrid Apr 21 '22 at 09:47
  • No, do not mix Task/async with Thread.Sleep. Road to desaster. If you really really need to consider `await Task.Delay(TimeSpan)` – Fildor Apr 21 '22 at 09:48
  • I get the html code but not completely and this is a problem because i have to make a crawler that retrieve all the href links in the webpage – Andréa Madrid Apr 21 '22 at 09:49
  • Ok and do you know the containts of the httpClient.GetStringAsync(url) ? Because i have to overwrite it to add the task.delay() and i don't find it on the web – Andréa Madrid Apr 21 '22 at 10:02
  • Cannot really follow right now. Where do want to wait and why exactly? Does the website have dynamic content that is lazy loaded? I am afraid in that case, I do not have experience with that, sorry. I upvoted the question to attract users that do have knowledge on that. – Fildor Apr 21 '22 at 10:07
  • Yeah i understand. I want to wait the loading of all the elements of the page. For example, on a connexion page, there is email field and password field. But there is also a password forbidden button that containts an href link to another page. The crawler have to find all the href links but the httpClient.GetStringAsync(url) method doesn't load all the element, so when i get the html code, it is not complete. Thank you for upvoted my question. – Andréa Madrid Apr 21 '22 at 10:15
  • This post might help in your case : https://stackoverflow.com/questions/64681732/how-to-make-an-httpclient-getasync-wait-for-a-webpage-that-loads-data-asynchrono – NTINNEV Apr 21 '22 at 10:57
  • Most likely the page is "fully" loaded and the remaining that is missing is javascript that dynamically fills the rest of the page. Use a WebClient or a WebBrowser or Selenium to load the page in its entirety. – Kent Kostelac Apr 21 '22 at 12:51
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Apr 21 '22 at 12:53
  • I tried ti download the html file with WebClient and it doesn't work i have the same problem. If you want i will send screenshots into a response. – Andréa Madrid Apr 21 '22 at 13:14
  • `httpClient.GetStringAsync(url)` is just getting string/html and saving it into a variable, not loading it into a browser which understands how to deal with that html, like loading css, javascript files, creating DOM, etc. – Brij Apr 21 '22 at 13:34
  • Sure, because i do selenium tests so i load the webpage. But, the intelligent crawler that i have to make is not possible with selenium – Andréa Madrid Apr 21 '22 at 13:41

2 Answers2

0

Is there are the screenshots :

The first one is the result of the download with WebClient. The second one is the original Html code of the webpage

The result of the download with WebClient

The original Html code of the webpage

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 21 '22 at 15:52
0

My boss and me, we found the solutions. There is a function in Selenium that can get all the html code from a website. And since Selenium loads the page completely before doing any interactions with the page, the html code is loaded completely. Here is the code :

driver.Navigate().GoToUrl(url);
driver.Manage().Window.Size = new System.Drawing.Size(1936, 1056);
driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(10);
var result = driver.FindElement(By.TagName("body")).GetAttribute("innerHTML");
await StartCrawlerasync(result);

public static async Task StartCrawlerasync(string html)
        {
            var Links = new List<string>();
            StringBuilder csvcontent = new StringBuilder();
            StringBuilder htmlcontent = new StringBuilder();
            string htmlpath = @"path\Test.html";
            File.WriteAllText(htmlpath, string.Empty);
            File.WriteAllText(htmlpath, html);
            string csvpath = @"path\Tous_les_Liens.csv";
            File.WriteAllText(csvpath, string.Empty);

            var httpClient = new HttpClient();
            HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
            await Task.Delay(5000);
            htmlDocument.LoadHtml(html);

            if (htmlDocument.DocumentNode.SelectNodes("//a") != null)
            {
                foreach (HtmlNode link in htmlDocument.DocumentNode.SelectNodes("//a"))
                {
                    Links.Add(link.Attributes["href"].Value);
                    csvcontent.AppendLine(link.Attributes["href"].Value);
                };

                foreach (string l in Links)
                {
                    Console.WriteLine(l);
                }
            }
            else
            {
                Console.WriteLine("C''est vide");
            }
            File.WriteAllText(csvpath, csvcontent.ToString());        
        }