0

I have a website that I'm testing. It works nicely in Chrome and I can inspect the HTML with Chrome's developer tools.

When I load the page with playwright I get a correct screenshot but the HTML does not contain the content. I added a 10 sec wait just to see if that changes anything but it doesn't.

def save_html_and_screenshot():
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=["--disable-gpu", "--no-zygote", "--single-process"],
        )
        page = browser.new_page()

        page.goto("https://www.3sonsbrewingco.com/menus")
        page.wait_for_load_state(state="networkidle")
        page.wait_for_timeout(10000)

        image_bytes_waited = page.screenshot(type="png", full_page=True)
        html_waited = page.content()

        open("site.png", "wb").write(image_bytes_waited)
        open("site.html", "w").write(html_waited)

        browser.close()

Actually, I can see the items in the devtools Elements tab but when I look at the page source in Chrome I do not see them.

So, how to get the html displayed in the "Elements" tab of Chrome's devtools?

chhenning
  • 2,017
  • 3
  • 26
  • 44
  • I should mention that when viewing the page source in chrome the data is also missing. But I can see it in the Elements tab in the Dev Tools. – chhenning Mar 25 '23 at 15:53
  • Feel free to [edit] your post if you have clarifications. Hard to help without a [mcve]. Does it work if you use `headless=False`? – ggorlen Mar 25 '23 at 16:17
  • @ggorlen I have added the correct url. – chhenning Mar 25 '23 at 19:22
  • 1
    Thanks, that helps. It seems to work headfully as I suggested above. Often, [adding a user agent](https://stackoverflow.com/a/72999870/6243352) can help with bot detection if you want to stay headless. There's an iframe on the page, https://www-3sonsbrewingco-com.filesusr.com/html/9d37bf_49f25d4a5f14cfac36af950837ac78e9.html. You could navigate directly to that potentially. But what are you trying to accomplish here? Is there some particular data you want to scrape? – ggorlen Mar 26 '23 at 01:08
  • I'm trying to understand why I cannot see the menu items from the page source even after waiting for 10 secs using a headless browser. I can clearly see the items in the screenshot and using Chrome in the Elements tab of devtools. – chhenning Mar 26 '23 at 20:47
  • It's usually a matter of bot detection. An automated browser is treated differently than a normal user browser session, and a headless browser is treated differently than headful, so it's very normal to have missing information when trying to automate, especially headlessly. The iframe might also be at play here. – ggorlen Mar 26 '23 at 20:55
  • Again, it'd be really helpful if you could share your final goal to avoid an [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). There's often a much easier way to achieve that than logging the whole page content, like navigating directly to the iframe or intercepting a network response. – ggorlen Mar 26 '23 at 21:01
  • I'm interested in extracting the menu items. Usually, I just get the html and convert to markdown for further processing. But since the html doesn't have the menu item the markdown is mostly empty. – chhenning Mar 26 '23 at 21:21
  • Thanks. Can you [edit] the post to show the exact output you expect? I've not heard of converting to markdown to scrape HTML before, but that seems rather roundabout. I'd just use normal Playwright selectors to do it. – ggorlen Mar 26 '23 at 21:22
  • @ggorlen Where do you see the iframe pointing to `https://www-3sonsbrewingco-com.filesusr.com/html/9d37bf_49f25d4a5f14cfac36af950837ac78e9.html` ? – chhenning Mar 26 '23 at 21:44
  • Open the page in a normal browser, pop open the dev tools console, type `document.querySelector("iframe").src` and press Enter. – ggorlen Mar 26 '23 at 21:50
  • Thanks that works. Is it possible to do the same in the headless browser? – chhenning Mar 26 '23 at 22:09
  • 1
    Sure, the same code works in a headless browser inside an `evaluate`, which can run basically any code you can run in DevTools. Or simply navigate to that page with `page.goto(thatURL)`, then scraping whatever data you may want is direct and easy. – ggorlen Mar 26 '23 at 22:10
  • Thanks that worked. Any idea how I could get the output in Elements tab of devtools using the headless browser. That is basically the question that I have. – chhenning Mar 27 '23 at 13:41
  • `page.content()`, with the caveat that devtools is a different thing than Playwright. `page.content()` is a snapshot of what Playwright sees as the DOM at the instant of the call. Anyway, as I've mentioned, it's sort of an antipattern to scrape with `page.content()`. Much of the time people are using it, there's a better way to get the data they want. It's generally best to use selectors and extract the data you want from the HTML. But I still haven't heard what data you're trying to get. – ggorlen Mar 27 '23 at 13:59
  • Sorry for being unclear. I'm trying to get what I see on the screen, including all menu items. By "what I see" I mean the complete html. Saving a screenshot shows that all data is loaded and the menu is rendered correctly. `page.content()` will not give me the complete html. – chhenning Mar 27 '23 at 14:30
  • [According to the docs](https://playwright.dev/docs/api/class-page#page-content): "`page.content()` gets the full HTML contents of the page, including the doctype." The page HTML on this site is a trainwreck. If you want to see the whole thing as-is (probably not desirable), then `page.content()` is the way to go. If you want to extract data from it, show the data you want and I can help you extract it. – ggorlen Mar 27 '23 at 14:32
  • Thanks again and sorry for being unclear. Can we do this in playwright? `document.querySelector('body').innerHTML`? Using Chrome I get exactly what I want. Here is a screenshot `https://snipboard.io/LgAp2w.jpg` – chhenning Mar 27 '23 at 15:03
  • 1
    That's what `page.content()` does. It's basically shorthand for `page.evaluate(() => document.body.innerHTML)` which is equivalent to your code here. Going back to the top of the comments, if you're not seeing something you expect to see, it's either due to not waiting long enough for the load, being blocked as a bot (try running headfully or adding a user agent) or possibly not handling the iframe properly. – ggorlen Mar 27 '23 at 15:14

0 Answers0