How to retrieve table from dynamic website python selenium

Question

I want to retrieve all the information from a table on a dynamic website and I have the following code for it:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import sys
reload(sys)
import re
import csv
from time import sleep
sys.setdefaultencoding('utf-8') #added since it would give error for certain values when using str(i)

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 

maxcr = 1379
listofrows = []


url = "http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT"
print(url) 
driver.get(url)
wait = WebDriverWait(driver,10)
# Trying to get the table 
tableloadwait = (wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".panel-body"))))
table = driver.find_elements_by_css_selector(".panel-body")
print(table)
RowsOfTable = table.get_attribute("tr")

However, I keep getting error but it doesn't work so far. How do I retrieve the information of the table? Thanks a lot!

error: RowsOfTable = table.get_attribute("tr") AttributeError: 'list' object has no attribute 'get_attribute'

you will get `AttributeError: 'list' object has no attribute 'get_attribute'` error as `tr` is not an attribute. what data you are trying to get from table? — supputuri, Mar 16 '19 at 05:33
`find_elements_` (with `s` in `elements`) always gives list with many elements - so you have to use `for` loop to get every element and use `get_attribute` with every element separatelly. — furas, Mar 16 '19 at 05:37
error: RowsOfTable = table.get_attribute("tr") AttributeError: 'list' object has no attribute 'get_attribute' — user11054467, Mar 16 '19 at 16:22

supputuri · Accepted Answer · 2019-03-23T23:14:29.563

1

Here is the code to get the product details

tableloadwait = (wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".panel-body"))))
driver.find_element_by_xpath("//span[contains(.,'Product Details')]").click()
rows = driver.find_elements_by_xpath("//span[contains(.,'Product Details')]/ancestor::div[@class='accordion-top-border']//tr[(@ng-repeat='attr in attributes' or @ng-repeat='field in fields') and @class='visible-xs']")

for rowNum in range(len(rows)):
    print(rows[rowNum].get_attribute('innerText'))
driver.quit()

We have to trim the values or break the values as per your requirement.

if you would like to get the data based on row text use the below.

upcData = driver.find_element_by_xpath("//strong[.='UPC']/parent::td").get_attribute('innerText').replace('UPC','').replace('\n','').replace('    ','')

edited Mar 23 '19 at 23:14

answered Mar 16 '19 at 05:37

supputuri

13,644
2
21
39

In this, I actually want to get the product details and not 6 tables. I thought (".panel-body") would only be applicable to that table? – user11054467 Mar 16 '19 at 16:36
is there a way here to get only the 'global product type' from this table? – user11054467 Mar 22 '19 at 04:07
with your first solution, I only get the table upto a 'special features'. If I want to get any one of the following attributes then what would I need to do? for example, if rather than wanting 'Global Product Type' I want the UPC or UNSPSC? Because when I put any of the following (Carton Weight Carton Pack Quantity UPC UNSPSC) in the above code (code from the comment) it gives a blank. If I run the first code you gave then these dont show up but other attributes show up. I am a bit confused now – user11054467 Mar 22 '19 at 05:51
@suppurturi Yes I had done that as well. However, with replacing with part to 'UPC' or UNSPSC both give blanks. In fact even if I run the first code you gave for the entire table, these particular ones along with some other attributes do not show up. Any reason for them not showing up? – user11054467 Mar 22 '19 at 13:06
sent you a message on chat! – user11054467 Mar 23 '19 at 16:33

QHarr · Answer 2 · 2019-03-16T05:45:33.940

1

Expand the accordion with the appropriate + button first then select the table. Add waits for items to be present. Change the expandSigns index to 2 if you want the other table.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT'
driver = webdriver.Chrome()
driver.get(url)
expandSigns = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".glyphicon-plus")))
expandSigns[1].click()
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "td")))

table = driver.find_element_by_css_selector('table')
html = table.get_attribute('outerHTML')

df  = pd.read_html(html)
print(df)
driver.quit()

edited Mar 16 '19 at 05:45

answered Mar 16 '19 at 05:39

QHarr

83,427
12
54
101

I want to write the outputs to the files, but it gives everything in one cell but I want it to be in different rows. Now I am confused as to what's going on. `df = pd.read_html(html) print(html) listofrows.append(df) print(listofrows) for rows in listofrows: with open('listofData.csv', 'w') as listofData: for rows in listofrows: rowlistwriter = csv.writer(listofData) rowlistwriter.writerow(rows)` – user11054467 Mar 16 '19 at 19:59
Also, I want to not have chrome open up (using your method) but for some reason it still opens up and saturates everything `chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') driver = webdriver.Chrome(chrome_options=chrome_options) ` – user11054467 Mar 16 '19 at 20:04
Why not use df.to_csv? As for headless not sure why that doesn't work. At a quick scan, what you have written looks right. I would have to test. – QHarr Mar 16 '19 at 21:08
As you suggested I did the following: df[0].to_csv("output.csv") and it worked. But how do I convert it to columns? It gives rows atm – user11054467 Mar 20 '19 at 03:41
Also, if I keep feeding multiple pages, will it keep writing to new rows? – user11054467 Mar 20 '19 at 03:51
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html . If multiple pages merge the dataframes first whilst looping and write the final dataframe out once to csv. – QHarr Mar 20 '19 at 07:01
when I do df.to_csv it gives me an error saying that list do not have attribute to_csv – user11054467 Mar 21 '19 at 03:30
What works is that I put in df[0] instead of df. However, that way, I only get the table from the first page but the rest I don't get – user11054467 Mar 21 '19 at 03:38
read_html returns a list so you always have to index into that list to create indiv dataframe. – QHarr Mar 21 '19 at 07:39
If you want more tables, loop the list and write the tables to csv. – QHarr Mar 21 '19 at 07:40
Okay, i am actually doing that. But when i do something like ` listofrows = [] ........... listofrows.append(df) ......` when i write to the csv file it comes out as 'blocks' or everything in one element. How is it possible to break it down? I have given the example here https://stackoverflow.com/questions/55273655/how-to-print-tables-fetched-as-a-list-to-csv-files-from-dynamic-website Thats my entire code pretty much – user11054467 Mar 21 '19 at 17:52
I think it will work better if I tell you this: in the table, I am not interested in anything except the 'Global Product Type'. Now is there a way to get just that? I only want that – user11054467 Mar 22 '19 at 03:57
I didn't see this message I'm afraid. I can have another look if you want. – QHarr Mar 23 '19 at 23:17

Sers · Answer 3 · 2019-03-16T11:06:29.350

If you need to scrape, not test, you can use requests to get data. Below code is example how you can get data from the page.

import requests
import re

# Return header page(html) to get token and list key
response = requests.get("http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT")

# Get token using regular expression
productRecommToken = re.search("'productRecommToken','(.+)'", response.text)[1]

# Get list of keys using regular expression
listKey = re.search("'listKey',\\['(.*?)'\\]", response.text)[1].split("','")

# Create header with token
headers = {
    'Accept': 'application/json, text/plain, */*',
    'Referer': 'http://biggestbook.com/ui/catalog.html',
    'Origin': 'http://biggestbook.com',
    'DNT': '1',
    'token': productRecommToken,
    'BiggestBook-Handle-Errors-Generically': 'true',
}

# Create parameters with list keys and search values
params = (
    ('listKey', listKey),
    ('uom', 'CT'),
    ('vc', 'n'),
    ('win', 'HERY4832YER01'),
)

# Return json with all details about product
response = requests.get('https://api.essendant.com/digital/digitalservices/search/v1/items',
                       headers=headers,
                       params=params)
data = response.json()

# Get items from json, probably could be more than one
items = data["items"]

# Iterate and get details you need. Check "data" to see all possible details you can get
for i in items:
    print(i["manufacturer"])
    print(i["description"])
    print(i["actualPrice"])

    # Get attributes
    attributes = i["attributes"]

    # Example hot you can get specific one attribute.
    thickness = list(filter(lambda d: d['name'] == 'Thickness', attributes))[0]["value"]

    # Print all attributes as name = value
    for a in attributes:
        print(f"{a['name']} = {a['value']}")

How to retrieve table from dynamic website python selenium

3 Answers3