-1

I am using Python Selenium to try and scrape or obtain data because lxml is so poorly documented with parsing HTML and obtaining data using xpath, and no matter what I try, nothing works with that library.

I am having some success with Selenium like so: (but not always - hence this question)

element = self.driver.find_element_by_xpath(xpath)
print(element.text)

Problem:

If I have a HTML segment like this in a HTML document:

<strong>Address:</strong>
24 some street, CA
<strong>Company:</strong>
ACME Inc.

and I am using Firefox to get the xpath of the data, or a Chrome plugin to get the xpath to '24 some street, CA', I cannot obtain it (neither gives me the xpath to the data).

I can only obtain the xpath of 'Address:' but I don't need that, I need the data after the closing </strong> tag.

The xpath to the text 'Address:' might be something like:

/html/body/div[2]/div[4]/div[1]/span/strong[2]

What then is the xpath to the text after that closing </strong> tag that will give me everything up until the next starting <strong> tag?


Update:

I'm sure the following is the correct xpath to the text after the <strong></strong> tags, but Selenium does not like it.

When I use this with Selenium with the following xpath, it fails

xpath_wo_num = '/html/body/div[2]/div[4]/div[1]/span/strong[1]/following-sibling::text()[1]'
element = self.driver.find_element_by_xpath(xpath_wo_num)

The developers of Selenium put in specific code that would reject the correct xpath because it returns TEXT.


I get this error message:

Message: invalid selector:
The result of the xpath expression "/html/body/div[2]/div[4]/div[1]/span/strong[1]/following-sibling::text()[1]" is: [object Text].
It should be an element.
(Session info: headless chrome=80.0.3987.132)
user10664542
  • 1,106
  • 1
  • 23
  • 43
  • 1
    If you can add more HTML code it would help. Try to post bigger portion of html where you have this content. – Sariq Shaikh Mar 19 '20 at 19:01
  • The error shown by selenium is very clear, you can not select a text using xpath it should be an element not text. In your xpath you are using full xpath to extract the text but in the question there is only portion of HTML, you can build xpath to get parent element where text resides and than extract text relative to it like this https://stackoverflow.com/questions/54568588/how-to-get-the-text-under-the-tag. Though its good to use relative xpath rather than full xpath to an element. – Sariq Shaikh Mar 20 '20 at 07:01

2 Answers2

1

Try something like this:

acme = """
<span>
  <strong>Address:</strong>
24 some street, CA
<strong>Company:</strong>
ACME Inc. 
</span>
"""
import lxml.html

doc = lxml.html.fromstring(acme)
street = doc.xpath('//span/strong[1]/following-sibling::text()[1]')
print(street[0].strip())

24 some street, CA

Output:

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • This worked using `lxml.html`, The xpath provided is the same in the other answer, but Selenium has specific code put in by the developers on that project to reject that xpath that provides TEXT. What I don't understand though, is that I thought (or rather it is documented) that the byte representation of the HTML needs to be passed to: `lxml.html.fromstring(html_bytes)` and not the HTML string representation, but I am passing the string representation to `fromstring(html_text)` via Selenium using: `self.webdriver.page_source` which is text, and it works. I don't understand why it works, – user10664542 Mar 20 '20 at 00:56
-1

you need to be use sibling. something like this

find_element_by_xpath(//strong/following-sibling::text()[1])

enter image description here

Farhan Ahmed
  • 192
  • 1
  • 13
  • OK, I will give this a go. What tool was used to generate that XPATH? I have been using Firefox (select right click copy XPATH) and a Chrome plugin and neither would provide that xpath, I will post if I get this working, Thank You! – user10664542 Mar 20 '20 at 00:31
  • I am sure this is the right xpath, but it will not work with Selenium, `find_element_by_xpath), I posted an update at the bottom of the original problem, thank you for your help though. – user10664542 Mar 20 '20 at 00:42