How to get the html from the Elements tab in Chrome's devtools?

Question

I have a website that I'm testing. It works nicely in Chrome and I can inspect the HTML with Chrome's developer tools.

When I load the page with playwright I get a correct screenshot but the HTML does not contain the content. I added a 10 sec wait just to see if that changes anything but it doesn't.

def save_html_and_screenshot():
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=["--disable-gpu", "--no-zygote", "--single-process"],
        )
        page = browser.new_page()

        page.goto("https://www.3sonsbrewingco.com/menus")
        page.wait_for_load_state(state="networkidle")
        page.wait_for_timeout(10000)

        image_bytes_waited = page.screenshot(type="png", full_page=True)
        html_waited = page.content()

        open("site.png", "wb").write(image_bytes_waited)
        open("site.html", "w").write(html_waited)

        browser.close()

Actually, I can see the items in the devtools Elements tab but when I look at the page source in Chrome I do not see them.

So, how to get the html displayed in the "Elements" tab of Chrome's devtools?

I should mention that when viewing the page source in chrome the data is also missing. But I can see it in the Elements tab in the Dev Tools. — chhenning, Mar 25 '23 at 15:53
Feel free to [edit] your post if you have clarifications. Hard to help without a [mcve]. Does it work if you use `headless=False`? — ggorlen, Mar 25 '23 at 16:17
Thanks, that helps. It seems to work headfully as I suggested above. Often, [adding a user agent](https://stackoverflow.com/a/72999870/6243352) can help with bot detection if you want to stay headless. There's an iframe on the page, https://www-3sonsbrewingco-com.filesusr.com/html/9d37bf_49f25d4a5f14cfac36af950837ac78e9.html. You could navigate directly to that potentially. But what are you trying to accomplish here? Is there some particular data you want to scrape? — ggorlen, Mar 26 '23 at 01:08
I'm trying to understand why I cannot see the menu items from the page source even after waiting for 10 secs using a headless browser. I can clearly see the items in the screenshot and using Chrome in the Elements tab of devtools. — chhenning, Mar 26 '23 at 20:47
It's usually a matter of bot detection. An automated browser is treated differently than a normal user browser session, and a headless browser is treated differently than headful, so it's very normal to have missing information when trying to automate, especially headlessly. The iframe might also be at play here. — ggorlen, Mar 26 '23 at 20:55
Again, it'd be really helpful if you could share your final goal to avoid an [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). There's often a much easier way to achieve that than logging the whole page content, like navigating directly to the iframe or intercepting a network response. — ggorlen, Mar 26 '23 at 21:01
I'm interested in extracting the menu items. Usually, I just get the html and convert to markdown for further processing. But since the html doesn't have the menu item the markdown is mostly empty. — chhenning, Mar 26 '23 at 21:21
Thanks. Can you [edit] the post to show the exact output you expect? I've not heard of converting to markdown to scrape HTML before, but that seems rather roundabout. I'd just use normal Playwright selectors to do it. — ggorlen, Mar 26 '23 at 21:22
@ggorlen Where do you see the iframe pointing to `https://www-3sonsbrewingco-com.filesusr.com/html/9d37bf_49f25d4a5f14cfac36af950837ac78e9.html` ? — chhenning, Mar 26 '23 at 21:44
Open the page in a normal browser, pop open the dev tools console, type `document.querySelector("iframe").src` and press Enter. — ggorlen, Mar 26 '23 at 21:50
Thanks that works. Is it possible to do the same in the headless browser? — chhenning, Mar 26 '23 at 22:09
Sure, the same code works in a headless browser inside an `evaluate`, which can run basically any code you can run in DevTools. Or simply navigate to that page with `page.goto(thatURL)`, then scraping whatever data you may want is direct and easy. — ggorlen, Mar 26 '23 at 22:10
Thanks that worked. Any idea how I could get the output in Elements tab of devtools using the headless browser. That is basically the question that I have. — chhenning, Mar 27 '23 at 13:41
`page.content()`, with the caveat that devtools is a different thing than Playwright. `page.content()` is a snapshot of what Playwright sees as the DOM at the instant of the call. Anyway, as I've mentioned, it's sort of an antipattern to scrape with `page.content()`. Much of the time people are using it, there's a better way to get the data they want. It's generally best to use selectors and extract the data you want from the HTML. But I still haven't heard what data you're trying to get. — ggorlen, Mar 27 '23 at 13:59
Sorry for being unclear. I'm trying to get what I see on the screen, including all menu items. By "what I see" I mean the complete html. Saving a screenshot shows that all data is loaded and the menu is rendered correctly. `page.content()` will not give me the complete html. — chhenning, Mar 27 '23 at 14:30
[According to the docs](https://playwright.dev/docs/api/class-page#page-content): "`page.content()` gets the full HTML contents of the page, including the doctype." The page HTML on this site is a trainwreck. If you want to see the whole thing as-is (probably not desirable), then `page.content()` is the way to go. If you want to extract data from it, show the data you want and I can help you extract it. — ggorlen, Mar 27 '23 at 14:32
Thanks again and sorry for being unclear. Can we do this in playwright? `document.querySelector('body').innerHTML`? Using Chrome I get exactly what I want. Here is a screenshot `https://snipboard.io/LgAp2w.jpg` — chhenning, Mar 27 '23 at 15:03
That's what `page.content()` does. It's basically shorthand for `page.evaluate(() => document.body.innerHTML)` which is equivalent to your code here. Going back to the top of the comments, if you're not seeing something you expect to see, it's either due to not waiting long enough for the load, being blocked as a bot (try running headfully or adding a user agent) or possibly not handling the iframe properly. — ggorlen, Mar 27 '23 at 15:14

How to get the html from the Elements tab in Chrome's devtools?

0 Answers0