0

Trying to parse section of Coinbase blog website https://blog.coinbase.com/, that one under 9 first posts, starting with <div class="streamItem streamItem--section js-streamItem" data-action-scope="_actionscope_6"> to get the latest news (not sure how to do it other way on medium platform where coinbase blog hosted as it's random date on main page and random date on search) but for some reason not able to, tried first with requests and it works but works until this section, and tried playwright with next code:

# !/usr/bin/env python    
# coding: utf-8  
import asyncio
from playwright.sync_api import sync_playwright  
from playwright.async_api import async_playwright   
import os   
import time    

async def parser():        
    page_path = "https://blog.coinbase.com/"        
    async with async_playwright() as p:          
        browser = await p.chromium.launch(headless=True)           
        page = await browser.new_page(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')         
        await page.goto(page_path)         
        page_content = await page.content()            
        await browser.close()        
        print(page_content)    
        
asyncio.get_event_loop().run_until_complete(parser())   

And same thing - it works until this section

I also tried scrapingant like here https://scrapingant.com/blog/scrape-dynamic-website-with-python and it worked but I need to solve it other way with requests or playwright, better with requests

Peter Trcka
  • 1,279
  • 1
  • 16
  • 21

1 Answers1

0

I was able to get the titles of the news articles with the following code:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.chromium.launch(headless=False)
  page = browser.new_page()
  page.goto('https://blog.coinbase.com/', wait_until='domcontentloaded')
  elements = page.query_selector_all('*[data-post-id]')
  titles = []
  for element in elements:
    try:
      title = element.query_selector('h3 div')
      title = title.text_content()
      if not title in titles:
        titles.append(title)
    except Exception as e:
      continue
  print(titles)

It might not be exactly what you're looking for but hopefully it gets you in the right direction!

Kayce Basques
  • 23,849
  • 11
  • 86
  • 120