How to wait until web page is loaded before scraping HTML using Puppeteer in headless mode? (C#)

Question

I am trying to scrape from a website (www.Vinted.co.uk) which uses JavaScript to load data, unfortunately, the data loaded by JavaScript is what I'm scraping so I need to wait for the page to load before scraping so I can get the data required.

At the moment I am using Puppeteer and I have managed to get it working, however, a web browser is physically launching each time, at the moment its not working in headless mode unfortunately, it doesn't wait until the web page has loaded in headless mode even though I'm calling the WaitUntilNavigation.DOMContentLoaded method, so the data doesn't exist in the HTML when calling the GetContentAsync method.

Here is how my codes looking (C#):

public static async Task<string> GetLoadedHTML(string url)
    {
        try
        {
            await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
            Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = false
            });
            var page = await browser.NewPageAsync();
            page.DefaultTimeout = 0;
            var navigation = new NavigationOptions
            {
                Timeout = 0,
                WaitUntil = new[] {
                WaitUntilNavigation.DOMContentLoaded }
            };
            await page.GoToAsync(url, navigation);
            string content = await page.GetContentAsync();
            await browser.CloseAsync();
            page.Dispose();

            return content;
        }
        catch (Exception ex)
        {
            log.Error(ex);
            throw ex;
        } 
    }

I might go down a different route than Puppeteer if anyone has any recommendations, if possible I'd prefer to not need to open a physical browser each time as I'm hoping to run this as a service so it would be problematic. It would be good to get this working in headless mode as I believe that solves my issue since the browser wouldn't be launching then.

Any help appreciated.

Looks like JAVA is being run. Normally I loop until a specific tag is found. You will get null until the page fully loads. It is the only way I found to tell when the java completes. — jdweng, Aug 22 '22 at 12:13
@jdweng Good shout, thanks for the reply, do you have an example of your code where you loop until the tag is mentioned? — Ted Burgess, Aug 22 '22 at 12:20
The code will be different depending on the HTML parser you are using. Just search for a path using your parser. — jdweng, Aug 22 '22 at 12:38

score 0 · Answer 1 · answered Aug 22 '22 at 12:35

0

You can wait for some selector that might tell you that the page is ready. e.g.:

await page.WaitForSelectorAsync(".someSelector");

answered Aug 22 '22 at 12:35

hardkoded

18,915
3
52
64

Thanks mate, in terms of what can be used as the selector, can this be a div class for example? and I'd just pass in the class name? – Ted Burgess Aug 22 '22 at 12:43
yes, like a css selector. If the class is `foo`, it should be `.foo`. – hardkoded Aug 22 '22 at 13:58

How to wait until web page is loaded before scraping HTML using Puppeteer in headless mode? (C#)

1 Answers1