New York Times news scraping using pure python and selenium(via rpaframework)

Question

Im trying to scrap New York Times search result using pure python and selenium(via rpaframework) but I'm not getting it correct. I need to get the title, date, and description. Here is my code so far

When I print the title I'm getting this error

selenium.common.exceptions.InvalidArgumentException: Message: unknown variant //h4[@class='css-2fgx4k'], expected one of css selector, link text, partial link text, tag name, xpath at line 1 column 37

from RPA.Browser.Selenium import Selenium

# Search term
search_term = "climate change"

# Open the NY Times search page and search for the term
browser = Selenium()
browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)

# Find all the search result articles
articles = browser.find_elements("//ol[@data-testid='search-results']/li")


# Extract title, date, and description for each article and add to the list
for article in articles:
    # Extract the title
    title = article.find_element("//h4[@class='css-2fgx4k']")
    print(title)


# Close the browser window
browser.close_all_browsers()

Any assistance will appreciate.

Jakob Bagterp · Accepted Answer · 2023-08-13T09:20:38.337

In full disclosure, I'm the author of the Browserist package. Browserist is lightweight, less verbose extension of the Selenium web driver that makes browser automation even easier. Simply install the package with pip install browserist and try this:

from browserist import Browser
from selenium.webdriver.common.by import By

search_term = "climate"

# with Browser() as browser:
    browser.open.url("https://www.nytimes.com/search?query=" + search_term)
    search_result_elements = browser.get.elements("//ol[@data-testid='search-results']/li")
    for element in search_result_elements:
        try:
            title = element.find_element(By.TAG_NAME, "h4").text
            print(title)
        except:
            pass

Notes:

The simpler search term climate will yield more, yet relevant results, e.g. climate crisis, but that's up to you to change.
It's easier and more robust to target the title by the h4 tag header instead of the the CSS token value that might be changed over time.
As not all search result elements are uniform, I protect against breaking errors with the try and except clause.
Browserist uses Chrome by default, and you can select other browsers, for instance Firefox, with a few changes:

from browserist import Browser, BrowserType, BrowserSettings

...

with Browser(BrowserSettings(type=BrowserType.FIREFOX)) as browser:

Here's what I get, and I hope you find it useful. Let me know if you have any questions?

thank you for the suggestion, the package work well however with this problem I'm supposed to use selenium under the rpaframework. — pedros, May 09 '23 at 09:57
My pleasure, and you're welcome. In that case, I'm not sure you're using the `find_element` and `find_elements` methods correctly. You need to add the `By` selector as argument as well. For instance, something like this: `browser.find_elements(By.XPATH, "//ol[@data-testid='search-results']/li")`. Learn more here: https://selenium-python.readthedocs.io/locating-elements.html — Jakob Bagterp, May 09 '23 at 11:37
Thanks man, using `title = element.find_element(By.TAG_NAME, "h4").text` works fine. I will recommend your package. — pedros, May 09 '23 at 13:51

Jakob Bagterp · Answer 2 · 2023-05-09T12:36:32.473

I'm not an expert in the RPA framework, but have you considered simplifying your code to something like this? You probably only need to target the h4 headline tags of the search results:

from selenium.webdriver.common.by import By

# After you get the search results with this command:
# browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)

title_elements = browser.find_elements(By.TAG, "h4")

for title_element in title_elements:
    print(title_element.text)

Disclaimer: I'm not sure the above code is working as I haven't tested it.

With the Browserist package, I have tested it though, and you only need a few lines of code:

from browserist import Browser
from selenium.webdriver.common.by import By

search_term = "climate"

with Browser() as browser:
    browser.open.url("https://www.nytimes.com/search?query=" + search_term)
    title_elements = browser.get.elements_by_tag("h4")
    for title_element in title_elements:
        print(title_element.text)

Here are the results I get in the terminal:

New York Times news scraping using pure python and selenium(via rpaframework)

2 Answers2