How to make the data extraction from webpage with selenium more robust and efficient?

Question

I want to extract all option chain data from yahoo finance webpage,take put option chain data for simplicity. At first ,load all packages used in the program:

import time 
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

The function to write some company's put option chain data into a directory:

def write_option_chain(code):
    browser = webdriver.Chrome()
    browser.maximize_window()
    url = "https://finance.yahoo.com/quote/{}/options?p={}".format(code,code)
    browser.get(url)
    WebDriverWait(browser,10).until(EC.visibility_of_element_located((By.XPATH, ".//select/option")))
    time.sleep(25)
    date_elem = browser.find_elements_by_xpath(".//select/option")
    time_span = len(date_elem)
    print('{} option chains exists in {}'.format(time_span,code)) 
    df_all = pd.DataFrame()
    for item in range(1,time_span):
        element_date = browser.find_element_by_xpath('.//select/option[{}]'.format(item))
        print("parsing {}'s  put option chain on {} now".format(code,element_date.text))
        element_date.click()
        WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.XPATH, ".//table[@class='puts W(100%) Pos(r) list-options']//td")))
        time.sleep(11)
        put_table = browser.find_element_by_xpath((".//table[@class='puts W(100%) Pos(r) list-options']"))
        put_table_string = put_table.get_attribute('outerHTML')
        df_put = pd.read_html(put_table_string)[0]
        df_all = df_all.append(df_put)
    browser.close()
    browser.quit()
    df_all.to_csv('/tmp/{}.csv'.format(code))
    print('{} otpion chain written into csv file'.format(code))

To test the write_option_chain with a list:

nas_list = ['aapl','adbe','adi','adp','adsk']
for item in nas_list:
    try:
        write_option_chain(code=item)
    except:
        print("check what happens to {} ".format(item))
        continue
    time.sleep(5)

The output info shows:

#i omitted many lines for simplicity
18 option chains exists in aapl
parsing aapl's  put option chain on August 27, 2021 now
check what happens to aapl 
check what happens to adbe 
12 option chains exists in adi
parsing adi's  put option chain on December 17, 2021 now
adi otpion chain written into csv file
11 option chains exists in adp
parsing adp's  put option chain on August 27, 2021 now
adp otpion chain written into csv file
check what happens to adsk

We make a summary from above info:

1.only adp and adi 's put option chain data written into desired directory.
2.get only part of aapl and adp 's option chain data
3.can't open adsk's option webpage.
4.it takes almost 20 minutes to execute.

How to make the data extraction from webpage with selenium more robust and efficient?

It can get only some company's option data,not all.please copy and have a try on your computer , and compare what you get and mine. — showkey, Aug 23 '21 at 05:05
@showkey is `selenium` mandatory, can't we use other library which will make it faster? — imxitiz, Aug 27 '21 at 07:10
@showkey you haven't responded? And in your last edit which is refer to other person but you're doing wrong. You aren't supposed to send all `nas_list`, you have to send one by one like `x=write_option_chain("aapl")` or probably like in your code, use `for loop` :) — imxitiz, Aug 28 '21 at 05:47
you got nothing because the function `write_option_chain()` doesn't return anything. Instead it saves the CSVs files at `/tmp/{code}.csv`. — R. Marolahy, Aug 28 '21 at 06:01
@R.Marolahy I don't think so... `write_option_chain()` function is returning `df_all`! — imxitiz, Aug 28 '21 at 06:07
It doesn't! you did't even add `return` in your function. I made an edit to my answer that you can have your return value now. But in this case you have to pass the items in your `nas_list` one by one to your function. Check it and let me know. — R. Marolahy, Aug 28 '21 at 06:18
@R.Marolahy I don't think you had seen that image posted by the OP, I am not the OP. There OP is returning `df_all` in image posted by the OP, you were referred There but I had already cleared what OP is doing wrong in that image in [comment](https://stackoverflow.com/questions/68887536/how-to-make-the-data-extraction-from-webpage-with-selenium-more-robust-and-effic?noredirect=1#comment121876538_68887536). Anyway you had edited your answer, great. — imxitiz, Aug 28 '21 at 07:00
@Xitiz thanks for the comment. I didn't see that image in his update. — R. Marolahy, Aug 28 '21 at 13:34

Booboo · Accepted Answer · 2021-08-28T10:45:18.333

If one may use something other than selenium, then the best throughput can be achieved by using asyncio with the ahiohttp package from the PyPi repository due to the number of concurrent URL get requests that are needed to be made (and thus is a better option than multithreading). For even greater performance (not done here), the code could be separated out into fetching URLs (pure I/O) and dataframe processing (CPU-intensive) and use a multiprocessing pool for the latter.

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import pandas as pd
import time

async def process_code(session, code):
    async with session.get(f'https://finance.yahoo.com/quote/{code}/options?p={code}') as resp:
        status = resp.status
        if status != 200:
            raise Exception('status returned =', status)
        code_page = await resp.text()
    soup = BeautifulSoup(code_page, 'lxml')
    dates = [elem['value'] for elem in soup.find('select').find_all('option')]
    df_all = pd.DataFrame()
    df_tables = await asyncio.gather(*(process_date(session, code, date) for date in dates))
    for df_table in df_tables:
        if df_table is not None:
            df_all = df_all.append(df_table)
    df_all.to_csv('/tmp/{}.csv'.format(code))

async def process_date(session, code, date):
    async with session.get(f'https://finance.yahoo.com/quote/{code}/options?date={date}&p={code}') as resp:
        status = resp.status
        if status != 200:
            raise Exception('status returned =', status)
        code_page = await resp.text()
    soup = BeautifulSoup(code_page, 'lxml')
    table = soup.find('table', class_='puts W(100%) Pos(r) list-options')
    try:
        return pd.read_html(str(table))[0]
    except ValueError:
        return None

async def main():
    nas_list = ['aapl','adbe','adi','adp','adsk']
    # Connection: keep-alive required to prevent ClientPayloadError on some websites:
    t = time.time()
    async with aiohttp.ClientSession(headers = {'Connection': 'keep-alive', 'user-agent': 'my-application'}) as session:
        await asyncio.gather(*(process_code(session, code) for code in nas_list))
    print('Elapsed time:', time.time() - t)

# Test if we are running under iPython or Jupyter Notebook:
try:
    __IPYTHON__
except NameError:
    asyncio.get_event_loop().run_until_complete(main())
else:
    asyncio.get_running_loop().create_task(main())

Here is the multithreading version

from multiprocessing.pool import ThreadPool
from functools import partial
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def process_code(session, pool, code):
    code_page = session.get(f'https://finance.yahoo.com/quote/{code}/options?p={code}')
    soup = BeautifulSoup(code_page.content, 'lxml')
    dates = [elem['value'] for elem in soup.find('select').find_all('option')]
    df_all = pd.DataFrame()
    for df_table in pool.imap(partial(process_date, session, code), dates):
        if df_table is not None:
            df_all = df_all.append(df_table)
    df_all.to_csv('/tmp/{}.csv'.format(code))

def process_date(session, code, date):
    code_page = session.get(f'https://finance.yahoo.com/quote/{code}/options?date={date}&p={code}')
    soup = BeautifulSoup(code_page.content, 'lxml')
    table = soup.find('table', class_='puts W(100%) Pos(r) list-options')
    try:
        return pd.read_html(str(table))[0]
    except ValueError:
        return None

t = time.time()
nas_list = ['aapl','adbe','adi','adp','adsk']
with requests.Session() as session:
    headers = {'User-Agent': 'my-application'}
    session.headers = headers
    pool = ThreadPool(100)
    pool.map(partial(process_code, session, pool), nas_list)
print('Elapsed time:', time.time() - t)

I have made a correction to the `asyncio` code (it was only running correctly under Jupyter Notebook -- sorry about that). On my desktop it runs in approximately 7 seconds. The multithreading version takes approximately 11 seconds. — Booboo, Aug 28 '21 at 10:48

imxitiz · Answer 2 · 2021-08-27T07:39:23.813

I am not sure about I can use requests and BeautifulSoup after you clearly said

How to make the data extraction from webpage with Selenium more robust and efficient?

but here's the requests and BeautifulSoup code working perfectly for me.

import requests # pip install requests
from bs4 import BeautifulSoup # pip install beautifulsoup4
import pandas as pd

headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0"}

def scrape(c):
    page=requests.get(f"https://finance.yahoo.com/quote/{c}/options?p={c}",headers=headers)
    soup=BeautifulSoup(page.content,"lxml")

    timestamp=list(map(lambda x: x["value"],soup.find("select").find_all("option")))
    # Extracting timestamp from <select>'s <option>

    df=pd.DataFrame()

    for t in timestamp: # Looping through the list of timestamp
        page2=requests.get(f"https://finance.yahoo.com/quote/{c}/options?date={t}&p={c}",headers=headers)
        soup2=BeautifulSoup(page2.content,"lxml")

        table=soup2.find("table",class_="puts W(100%) Pos(r) list-options")
        try:
            tabledf=pd.read_html(str(table))[0]
            df=df.append(tabledf)
        except ValueError:
            pass

    df.to_csv(f"/temp/{c}.csv",index=False)

nas_list = ['aapl','adbe','adi','adp','adsk']
for nas in nas_list:
    scrape(nas)

BeautifulSoup will be much faster then Selenium with No-JavaScript support websites. So, I am using BeautifulSoup here. And yeah you can use Selenium with BeautifulSoup as well by using browser.page_source but here I don't think it is required to use Selenium.

Visit here for little more detail Selenium versus BeautifulSoup for web scraping

R. Marolahy · Answer 3 · 2021-08-30T08:21:00.577

In this use case, using selenium is OK. You just need some optimizations, following are some examples that I found:

use headless mode: selenium tests can take a while to complete, due to elements on a page that the browser needs to load. Headless testing gets rid of this load time, allowing you to cut your testing times significantly. In our tests with headless testing, we’ve seen a 30% reduction of test execution times (source).
avoid using multiple time.sleep() and WebDriverWait().until() (especially inside for loop), use a simple .implicitly_wait() instead.

Code example:

def write_option_chain(code):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--start-maximized")
    browser = webdriver.Chrome(options=chrome_options)
    browser.implicitly_wait(10)
    url = "https://finance.yahoo.com/quote/{}/options?p={}".format(code, code)
    browser.get(url)
    date_elem = browser.find_elements_by_xpath(".//select/option")
    time_span = len(date_elem)
    print('{} option chains exists in {}'.format(time_span, code))
    df_all = pd.DataFrame()
    for item in range(1, time_span):
        element_date = browser.find_element_by_xpath('.//select/option[{}]'.format(item))
        print("parsing {}'s  put option chain on {} now".format(
        code, element_date.text))
        element_date.click()
        put_table = browser.find_element_by_xpath((".//table[@class='puts W(100%) Pos(r) list-options']"))
        put_table_string = put_table.get_attribute('outerHTML')
        df_put = pd.read_html(put_table_string)[0]
        df_all = df_all.append(df_put)
    browser.close()
    browser.quit()
    df_all.to_csv('/tmp/{}.csv'.format(code))
    print('{} otpion chain written into csv file'.format(code))

then:

>>nas_list = ['aapl', 'adbe', 'adi', 'adp', 'adsk']
>>for item in nas_list:
  ....write_option_chain(code=item) #this saves your df at /tmp/{code}.csv'

With thsese simple optimizations, it took about 180s to complete the code with everything extracted.

How to make the data extraction from webpage with selenium more robust and efficient?

3 Answers3