1

I am trying to scrape from a website (www.Vinted.co.uk) which uses JavaScript to load data, unfortunately, the data loaded by JavaScript is what I'm scraping so I need to wait for the page to load before scraping so I can get the data required.

At the moment I am using Puppeteer and I have managed to get it working, however, a web browser is physically launching each time, at the moment its not working in headless mode unfortunately, it doesn't wait until the web page has loaded in headless mode even though I'm calling the WaitUntilNavigation.DOMContentLoaded method, so the data doesn't exist in the HTML when calling the GetContentAsync method.

Here is how my codes looking (C#):

public static async Task<string> GetLoadedHTML(string url)
    {
        try
        {
            await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
            Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = false
            });
            var page = await browser.NewPageAsync();
            page.DefaultTimeout = 0;
            var navigation = new NavigationOptions
            {
                Timeout = 0,
                WaitUntil = new[] {
                WaitUntilNavigation.DOMContentLoaded }
            };
            await page.GoToAsync(url, navigation);
            string content = await page.GetContentAsync();
            await browser.CloseAsync();
            page.Dispose();

            return content;
        }
        catch (Exception ex)
        {
            log.Error(ex);
            throw ex;
        } 
    }

I might go down a different route than Puppeteer if anyone has any recommendations, if possible I'd prefer to not need to open a physical browser each time as I'm hoping to run this as a service so it would be problematic. It would be good to get this working in headless mode as I believe that solves my issue since the browser wouldn't be launching then.

Any help appreciated.

  • Looks like JAVA is being run. Normally I loop until a specific tag is found. You will get null until the page fully loads. It is the only way I found to tell when the java completes. – jdweng Aug 22 '22 at 12:13
  • @jdweng Good shout, thanks for the reply, do you have an example of your code where you loop until the tag is mentioned? – Ted Burgess Aug 22 '22 at 12:20
  • The code will be different depending on the HTML parser you are using. Just search for a path using your parser. – jdweng Aug 22 '22 at 12:38

1 Answers1

0

You can wait for some selector that might tell you that the page is ready. e.g.:

await page.WaitForSelectorAsync(".someSelector");
hardkoded
  • 18,915
  • 3
  • 52
  • 64