1

I am using a Python script to extract information from a website using Selenium library. Using some selector, I got a WebElement object of the target element I am after which looks something like the following:

<myTargetElement><strong>324. </strong>Some interesting content that might contain numbers 323 or dots ...,;</myTargetElement>

I want to extract two pieces of information in separate:

The Id surrounded by the strong tag, and I've done this as following:

myTargetElementObject.find_element_by_tag_name('strong').text.strip(' .')

Now I am puzzled how to extract the other part. If I used myTargetElementObject.text, it will return the id within the text.

The data I am extracting is very big and I am cautious about using regex. Is there a way using WebElement object to return the text of the element without the sub-elements?

Bishoy
  • 705
  • 9
  • 24
  • Unless it's buffering data to disk, I'm assuming that Selenium already has parsed out your data and it's in an object in RAM. – Wayne Werner Apr 05 '16 at 18:11

1 Answers1

2

I would get the complete text of the target element and split it by the first .:

strong, rest_of_the_content = myTargetElementObject.text.split(". ", 1)

In general though, the task is not that easy (here you have a clear delimiter): you cannot target and get the text nodes directly in selenium - things like following-sibling::text(). A common approach is to get the child text, parent text and remove the child text from the parent's:


Another possible approach would involve some separate HTML parsing with BeautifulSoup where you can go sideways and access text nodes:

from bs4 import BeautifulSoup

html = myTargetElementObject.get_attribute("outerHTML")
soup = BeautifulSoup(html, "html.parser")
label = soup.strong
text_after = label.next_sibling

print(label.get_text(), text_after)
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • But then you still need to manually do the yucky parsing of the surrounding HTML, no? Or does `.text` just contain `324. Some interesting content`? – DaveBensonPhillips Apr 05 '16 at 18:15
  • @HumphreyTriscuit nope, the `.text` would give you the complete text (with children texts recursively).. – alecxe Apr 05 '16 at 18:17
  • @HumphreyTriscuit yeah, you would get the `324. Some interesting content`. – alecxe Apr 05 '16 at 18:17
  • @alexce ah, that's great – DaveBensonPhillips Apr 05 '16 at 18:18
  • Thanks alecxe But as I've been saying, my data is huge and unpredictable, numbers and dots are common content of the parent element – Bishoy Apr 05 '16 at 18:21
  • @Bishoy okay, added one more option. Please try. And, if you would have further problems applying the suggestions to your real use case, consider adding an HTML sample of the real data so that we can help you with that. Thanks. – alecxe Apr 05 '16 at 18:23
  • @Bishoy and don't worry about further dots - see how I'm using split - splitting by the *first occurrence* of a dot followed by a space. – alecxe Apr 05 '16 at 18:24
  • ah ok I see your point about split still like the bs4 solution better thanks – Bishoy Apr 05 '16 at 18:28