How to use Selenium to scrape multiple URLs' contents? Python

Question

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL="https://www.bloomberg.com/quote/FB:US"
driver.get(FB_bloomberg_URL)

eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
    hrefList.append(e.get_attribute('href'))

for href in hrefList:
    print(href)

I have the above coding to extract the href links using Selenium - python. I want to extract the contents in each person's profile "Board Memberships". I know how to extract them one by one, but don't know how to write a loop to do so.

Here is my code:

driver2 = webdriver.Chrome(r'XXXX\chromedriver.exe')
driver2.get("https://www.bloomberg.com/profiles/people/15103277-mark-elliot-zuckerberg")

boardmembership_table=driver2.find_elements_by_xpath('//*[@id="root"]/div/section/div[5]')[0]
boardmembership_table.text

Any thoughts are appreciated!

score 0 · Answer 1 · answered Jul 08 '19 at 04:02

Here is the approach that should work.

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL="https://www.bloomberg.com/quote/FB:US"
driver.get(FB_bloomberg_URL)

eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
    hrefList.append(e.get_attribute('href'))

for href in hrefList:
    print(href)
    # iterating through all the board members here #<== changed below
    driver.get(href)
    # you can add WebDriver wait for the below item to be displayed 
    # so that the script will wait until page loaded successfully with this element
    boardmembership_table=driver.find_elements_by_xpath('//*[@id="root"]/div/section/div[5]')[0]
    boardmembership_table.text

wp78de · Accepted Answer · 2019-07-08T04:18:51.730

0

You basically just attach the second piece to the first one under the for-loop:

import sys
from selenium import webdriver
driver = webdriver.Firefox()

FB_bloomberg_URL="https://www.bloomberg.com/quote/FB:US"
driver.get(FB_bloomberg_URL)

eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
    hrefList.append(e.get_attribute('href'))

for href in hrefList:
    --print(href)
    driver.get(href)    
    boardmembership_table=driver.find_elements_by_xpath('//*[@id="root"]/div/section/div[5]')[0]
    boardmembership_table.text

Bonus: And here is how to extract the people's names from the URLs using regex (import re) and add the board membership table to a dictionary.

result_dict = {}
regex = r"\/people\/\d+-(.*)$"
for href in hrefList:
    driver.get(href)    
    boardmembership_table=driver.find_elements_by_xpath('//*[@id="root"]/div/section/div[5]')[0]    
    matches = re.finditer(regex, href, re.MULTILINE)
    for matchNum, match in enumerate(matches, start=1):        
        result_dict[match.group(1)] = boardmembership_table.text

This should give you a head start.

edited Jul 08 '19 at 04:18

answered Jul 08 '19 at 04:02

wp78de

18,207
7
43
71

Thank you so much. This answer solved my problem so well! – Jul 08 '19 at 13:40
I have a question...After I got the outputs, I found each person will have "\nVIEW MORE." How can I extract all the information instead of ending with "\nVIEW MORE"? – Jul 08 '19 at 19:44
@Jancos from what I can see I guess you have to find the View More link in the corresponding section, simulate a click on it, and then extract the table. That can be a bit tricky at times. Check this: https://stackoverflow.com/questions/47251939/wait-until-button-is-clicked-in-selenium-webdriver-to-click-on-next-button – wp78de Jul 08 '19 at 20:01
Thanks for the link. I will play with it and see what can be found. – Jul 08 '19 at 20:13

How to use Selenium to scrape multiple URLs' contents? Python

2 Answers2