-1

I am trying to develop a web crawler using Python and Selenium. When I'm trying to parse a page, using the code below, a false element is returned.

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

capabilities = webdriver.DesiredCapabilities().FIREFOX
capabilities["marionette"] = True
binary = FirefoxBinary('C:/Program Files/Mozilla Firefox/firefox.exe')
driver = webdriver.Firefox(firefox_binary=binary, capabilities=capabilities, executable_path="C:\\Users\\19548\\AppData\\Local\\Programs\\Python\\Python37\\geckodriver.exe")
driver.get("https://www.google.com/search?sxsrf=ACYBGNT9OH8ZZcClzMK-BMwxesqsKeHyTg:1575693566606&q=google+maps+secure+dental&npsic=0&rflfq=1&rlha=0&rllag=41148676,-90063976,60206&tbm=lcl&ved=2ahUKEwjHtb_626LmAhXjzVkKHTpMCLAQtgN6BAgLEAQ&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!1m5!1u15!2m2!15m1!1shas_1wheelchair_1accessible_1entrance!4e2!2m1!1e3!3sIAE,lf:1,lf_ui:4&rldoc=1#rlfi=hd:;si:16368180629414227255,l,Chlnb29nbGUgbWFwcyBzZWN1cmUgZGVudGFsIgOIAQFIxLbOi6yPgIAIWiYKDXNlY3VyZSBkZW50YWwQABABGAAYASINc2VjdXJlIGRlbnRhbA;mv:[[41.6797015,-86.9763612],[39.655607599999996,-90.7386324]]")
element=driver.find_element_by_xpath("""//*[@id="akp_tsuid2"]/div/div/div/div/div/div[1]/div/div[1]/div/div[2]/div/div[2]/div/div/span[2]""")
paragraphs=driver.find_element_by_xpath("""//*[@id="akp_tsuid2"]/div/div/div/div/div/div[1]/div/div[1]/div/div[2]/div/div[2]/div/div/span[2]""")
print (paragraphs.text)
Guy
  • 46,488
  • 10
  • 44
  • 88

2 Answers2

0

To extract the text 3127 N University St, Peoria, IL 61604, United States you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get('https://www.google.com/search?sxsrf=ACYBGNT9OH8ZZcClzMK-BMwxesqsKeHyTg:1575693566606&q=google+maps+secure+dental&npsic=0&rflfq=1&rlha=0&rllag=41148676,-90063976,60206&tbm=lcl&ved=2ahUKEwjHtb_626LmAhXjzVkKHTpMCLAQtgN6BAgLEAQ&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!1m5!1u15!2m2!15m1!1shas_1wheelchair_1accessible_1entrance!4e2!2m1!1e3!3sIAE,lf:1,lf_ui:4&rldoc=1#rlfi=hd:;si:16368180629414227255,l,Chlnb29nbGUgbWFwcyBzZWN1cmUgZGVudGFsIgOIAQFIxLbOi6yPgIAIWiYKDXNlY3VyZSBkZW50YWwQABABGAAYASINc2VjdXJlIGRlbnRhbA;mv:[[41.6797015,-86.9763612],[39.655607599999996,-90.7386324]]')
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.mod[data-attrid='kc:/location/location:address']>div>div>span:nth-child(2)"))).text)
    
  • Using XPATH and get_attribute():

    driver.get('https://www.google.com/search?sxsrf=ACYBGNT9OH8ZZcClzMK-BMwxesqsKeHyTg:1575693566606&q=google+maps+secure+dental&npsic=0&rflfq=1&rlha=0&rllag=41148676,-90063976,60206&tbm=lcl&ved=2ahUKEwjHtb_626LmAhXjzVkKHTpMCLAQtgN6BAgLEAQ&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!1m5!1u15!2m2!15m1!1shas_1wheelchair_1accessible_1entrance!4e2!2m1!1e3!3sIAE,lf:1,lf_ui:4&rldoc=1#rlfi=hd:;si:16368180629414227255,l,Chlnb29nbGUgbWFwcyBzZWN1cmUgZGVudGFsIgOIAQFIxLbOi6yPgIAIWiYKDXNlY3VyZSBkZW50YWwQABABGAAYASINc2VjdXJlIGRlbnRhbA;mv:[[41.6797015,-86.9763612],[39.655607599999996,-90.7386324]]')
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='mod' and @data-attrid='kc:/location/location:address']/div/div//following::span[1]"))).get_attribute("innerHTML"))
    
  • Console Output:

    3127 N University St, Peoria, IL 61604, United States
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

I usually approach locators like this by finding the label and then the desired text, e.g. "Address:" then the actual street address. This makes the locator cleaner and easier to read.

For the address here, you can use the XPath

//a[.='Address']//following::span

Explanation

The relevant HTML looks like

<div class="zloOqf PZPZlf" data-dtype="d3ifr" data-local-attribute="d3adr" data-ved="2ahUKEwiF7tvpzKTmAhVPOq0KHUoBD9wQghwoADAEegQIARAh">
    <span class="w8qArf">
        <a class="fl" href="..." data-ved="2ahUKEwiF7tvpzKTmAhVPOq0KHUoBD9wQ6BMwBHoECAEQIg">Address</a>:
    </span>
    <span class="LrzXr">3127 N University St, Peoria, IL 61604</span>
</div>

So we start by finding the A tag using

//a[.='Address']

then from there we find the first following SPAN tag

//a[.='Address']//following::span

and that's it for the locator. FYI, the less you specify in your locator (within reason), the less likely it will be to break when there's a change to the page.

Now you can pull the .text of that element to get what you want. You will probably need to add a wait, e.g.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
...

driver.get(...)
paragraph = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//a[.='Address']//following::span")))
print(paragraph.text)

Read more about python waits.

JeffC
  • 22,180
  • 5
  • 32
  • 55