Web scraping using selenium and bs4

Question

I'm trying to build a dataframe based on web scraping of that page

https://www.schoolholidayseurope.eu/choose-a-country

html firstable i said to selenium to click on page of my choice then i put xpath and tags elements for build header and body but i don't have the format that i desired my element is NaN or duplicates.

Following my script :

def get_browser(url_selector):
    """Get the browser (a "driver")."""
    #option = webdriver.ChromeOptions()
    #option.add_argument(' — incognito')
    path_to_chromedriver = r"C:/Users/xxxxx/Downloads/chromedriver_win32/chromedriver.exe"
    browser = webdriver.Chrome(executable_path= path_to_chromedriver)
    browser.get(url_selector)
    
    """ Try with Italie"""
    browser.find_element_by_xpath(italie_buton_xpath).click()

    """ Raise exception : down browser if loading take more than 45sec : timer is the logo website as a flag"""
    # Wait 45 seconds for page to load
    timeout = 45
    try:
        WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="s5_logo_wrap"]/img')))
    except TimeoutException:
        print("Timed out waiting for page to load")
        browser.quit()
    return browser

browser = get_browser(url_selector)
headers = browser.find_element_by_xpath('//*[@id="s5_component_wrap_inner"]/main/div[2]/div[2]/div[3]/table/thead').find_elements_by_tag_name('tr')                                                            
headings = [i.text.strip() for i in headers]
bs_obj = BeautifulSoup(browser.page_source, 'html.parser')
rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')[1:]
table = []

for row in rows : 
    line = next(td.get_text() for td in row.find_all("td"))
    print(line)
    table.append(line)
browser.quit()
    
pd.DataFrame(line, columns = headings)

it returns

a one column dataframe like :

    School Holiday Region Start date End date Week
0   Easter holidays 2018
1   REMARK: Small differences by region are possi...
2   Summer holiday 2018
3   REMARK: First region through to last region.
4   Christmas holiday 2018

there's three issue there i don't want REMARK rows and school holiday start-date and end-date are taken as separated word and the whole dataframe is unsplitted.

If i split my headings and line the shape of both mismatch due to REMARKS rows i got 9 elements in my list instead of 3 and due to separated words i got 8 elements instead of 5 in heading.

can you tell me use of `pd.DataFrame(line, columns = headings)` in you post. or it should be `pd.DataFrame(table, columns = headings)`. — Nihal, Sep 07 '18 at 07:58
I did not see that mistake however the dataframe is still unsplitted with a only one column structure — ALEXANDRE W., Sep 07 '18 at 08:02
There's no more error right now i edited, just undesired structure — ALEXANDRE W., Sep 07 '18 at 08:05
most efficient solution for that is define heading by your self static. through out your scrapping for all the links the heading will be the same. — Nihal, Sep 07 '18 at 08:09
I dont't want use static typing because i would like to reuse this function for other country that try is just a one shot to clear every issue. Moreover i still have a problem that one of "REMARKS" rows i would like to know if it's possible to add a class exception in my code to skip that rows — ALEXANDRE W., Sep 07 '18 at 08:14
ok now i got your problem. some countries have remark column and some have as rows — Nihal, Sep 07 '18 at 08:15
remarks if i inspect it has the "warning" class inside tbody it's possible to skip it ? — ALEXANDRE W., Sep 07 '18 at 08:20

score 1 · Accepted Answer · answered Sep 07 '18 at 13:35

You can find all the links on the main page, and then iterate over each url with selenium:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re, contextlib, pandas
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.schoolholidayseurope.eu/choose-a-country')
_, *countries = [(lambda x:[x.text, x['href']])(i.find('a')) for i in soup(d.page_source, 'html.parser').find_all('li', {'class':re.compile('item\d+$')})]
@contextlib.contextmanager
def get_table(source:str):
   yield [[[i.text for i in c.find_all('th')], [i.text for i in c.find_all('td')]] for c in soup(source, 'html.parser').find('table', {'class':'zebra'}).find_all('tr')]
results = {}
for country, url in countries:
  d.get(f'https://www.schoolholidayseurope.eu{url}')
  with get_table(d.page_source) as source:
     results[country] = source

def clean_results(_data):
  [headers, _], *data = _data
  return [dict(zip(headers, i)) for _, i in data]

final_countries = {a:clean_results(b) for a, b in results.items()}

Works very well thanks, i would like some details what does contextlib ? — ALEXANDRE W., Sep 12 '18 at 10:51

Web scraping using selenium and bs4

1 Answers1

Linked