3

I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:

<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>

In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"

<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>

In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.

My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:

Driver.Instance.FindElement(By.XPath("//a[contains(@href, 'long')]"))

To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?

driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in @href]").text
Aaron Cao
  • 69
  • 2
  • 7

2 Answers2

6

If I right understand you want to locate all elements, that have same partial href. You can use this:

elements = driver.find_elements_by_xpath("//a[contains(@href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
    print(element.text)

or if you want to locate one element:

driver.find_element_by_xpath("//a[contains(@href, '/mathscinet/search/mscdoc.html')]").text

This will give a list of all elements located.

Andrei Suvorkov
  • 5,559
  • 5
  • 22
  • 48
1

As per the HTML you have shared @AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:

  • Instead of using contains using starts-with
  • Include the ?code= part of the @href attribute
  • Your effective code block will be:

    all_elements = driver.find_elements_by_xpath("//a[starts-with(@href,'/mathscinet/search/mscdoc.html?code=')]")
    for elem in all_elements:
        print(elem.get_attribute("innerHTML"))
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352