1

I am trying to extract the content from divs on a web page using Selenium. The web page is dynamically generated and every second or so there is a new div inserted into the HTML on the web page.

So far I have the following code:

from selenium import webdriver

chrome_path = r"C:\scrape\chromedriver.exe"

driver = webdriver.Chrome(chrome_path)

driver.get("https://website.com/")

messages = []
for message in driver.find_elements_by_class_name('div_i_am_targeting'):
    messages.append(message.text)

for x in messages:
    print(x)

Which works fine, the problem is it only prints the values of the divs on the page at the time it is run, I want to continuously extract the text from the_div_i_am_targeting and there are new divs appearing on the page every second or so.

I found this: Handling dynamic div's in selenium Which was the closest related question I could find, but it isn't a match for my question and there are no answers.

How can I update the above code so that it continuously prints the contents of the divs on the page for my chosen div (in this example div_i_am_targeting) including new divs that are added to the page after the program runtime?

Gary
  • 1,086
  • 2
  • 13
  • 39
  • I guess you need to put this in an infinite loop but does each div has any unique identification since we need to exclude the divs that have already been processed? – Samarth Nov 24 '18 at 12:54
  • @Gary, can you share the webpage you're trying to scrape? I cannot test here without a specific link in order to ensure my solution works. – Luan Naufal Nov 24 '18 at 12:54
  • One solution would be to add a loop with a sleep in the end, so you could ensure you're taking all generated divs: `if message.text not in messages:` `messages.append(message.text)` `sleep(1)` – Luan Naufal Nov 24 '18 at 12:57
  • Thanks both. I cannot share the webpage but the the content I want to extract is within *the_div_i_am_targeting*, there is no unique identifier on these divs, the structure of the content is: *div class="the_div_i_am_targeting">

    some text

    * this pattern is repeated indefinitely on the page, so there are many of the same divs generated. The code above works fine, but I need to find a way to get the program to continue to run and continuously capture the new divs as they are created.Thanks for the suggestion about iterating over the loop with messate.text not in messages.
    – Gary Nov 24 '18 at 13:17
  • @Gary I understand your _usecase_ is to extract text from the newly added `
    `s but what is the **exit criteria** for your **Test**?
    – undetected Selenium Nov 24 '18 at 13:42
  • @DebanjanB I just want this to continuously run as the page is continually updated 24/7 ; but if an exit condition is needed, perhaps it could be, if there have been no new divs within 5 minutes. – Gary Nov 24 '18 at 17:51

1 Answers1

1

You can apply below code to continuously print content of required divs:

from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium import webdriver

chrome_path = r"C:\scrape\chromedriver.exe"

driver = webdriver.Chrome(chrome_path)
driver.get("https://website.com/")
# Get current divs
messages = driver.find_elements_by_class_name('div_i_am_targeting')
# Print all messages
for message in messages:
    print(message.text)

while True:
    try:
        # Wait up to minute for new message to appear
        wait(driver, 60).until(lambda driver: driver.find_elements_by_class_name('div_i_am_targeting') != messages)
        # Print new message
        for message in [m.text for m in driver.find_elements_by_class_name('div_i_am_targeting') if m not in messages]:
            print(message)
        # Update list of messages
        messages = driver.find_elements_by_class_name('div_i_am_targeting')
    except:
        # Break the loop in case no new messages after minute passed
        print('No new messages')
        break
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • Andersson, thanks for this great solution. It seems to be semi working for me. But I have noticed after a variable number of additional div elements being added (about 10, but not always 10) , it will sometimes skip a new div, then continue, and it will always fail about after 20 new divs being added. I've checked the html, and can't see anything different about the div structure for the divs it's breaking at. Can you think of any reason why this might be? Thanks – Gary Nov 24 '18 at 17:24
  • To help debug I added print(count) after the #print new message , comment. I noticed it's continuously stopping at 48 to 49 total number of divs (even though new divs are being added within a few seconds of the last div that prints), and although it's skipping some of the divs, in the printed output it can still see they are there, because the count above the print new message statement jumps for example it will run: 35... printed output, 36.... printed output .... 39.... printed output – Gary Nov 24 '18 at 17:49
  • 1
    @Gary , Are old divs still on page? Or they removed after some number of new divs added? Also is it possible that several new messages comes at the same time or the time between messages is almost constant? – Andersson Nov 24 '18 at 17:55
  • Great point, yes I checked and after a certain number of divs the older divs are replaced such that the first div is removed everytime a new div is added (the visual display is a list style box, that only shows the last x messages, and each div contains a single message) ; yes messages happen constantly some can come every second, some might be very close to the same time and closer in time than 1 second duration – Gary Nov 24 '18 at 18:13
  • @Gary , try updated answer and let me know in case of any new issues – Andersson Nov 24 '18 at 18:23
  • Amazing how you pinpointed the problem, and with the new information had the solution fixed so quickly. Thanks again, this works great. – Gary Nov 24 '18 at 18:33