3

I want to retrieve all the information from a table on a dynamic website and I have the following code for it:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import sys
reload(sys)
import re
import csv
from time import sleep
sys.setdefaultencoding('utf-8') #added since it would give error for certain values when using str(i)

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 

maxcr = 1379
listofrows = []


url = "http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT"
print(url) 
driver.get(url)
wait = WebDriverWait(driver,10)
# Trying to get the table 
tableloadwait = (wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".panel-body"))))
table = driver.find_elements_by_css_selector(".panel-body")
print(table)
RowsOfTable = table.get_attribute("tr")

However, I keep getting error but it doesn't work so far. How do I retrieve the information of the table? Thanks a lot!

error: RowsOfTable = table.get_attribute("tr") AttributeError: 'list' object has no attribute 'get_attribute'

  • 1
    what is the error and where does it occur? – QHarr Mar 16 '19 at 05:28
  • 1
    always show full error (Traceback) in question. – furas Mar 16 '19 at 05:31
  • you will get `AttributeError: 'list' object has no attribute 'get_attribute'` error as `tr` is not an attribute. what data you are trying to get from table? – supputuri Mar 16 '19 at 05:33
  • `find_elements_` (with `s` in `elements`) always gives list with many elements - so you have to use `for` loop to get every element and use `get_attribute` with every element separatelly. – furas Mar 16 '19 at 05:37
  • error: RowsOfTable = table.get_attribute("tr") AttributeError: 'list' object has no attribute 'get_attribute' – user11054467 Mar 16 '19 at 16:22

3 Answers3

1

Here is the code to get the product details

tableloadwait = (wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".panel-body"))))
driver.find_element_by_xpath("//span[contains(.,'Product Details')]").click()
rows = driver.find_elements_by_xpath("//span[contains(.,'Product Details')]/ancestor::div[@class='accordion-top-border']//tr[(@ng-repeat='attr in attributes' or @ng-repeat='field in fields') and @class='visible-xs']")

for rowNum in range(len(rows)):
    print(rows[rowNum].get_attribute('innerText'))
driver.quit()

We have to trim the values or break the values as per your requirement.

if you would like to get the data based on row text use the below.

upcData = driver.find_element_by_xpath("//strong[.='UPC']/parent::td").get_attribute('innerText').replace('UPC','').replace('\n','').replace('    ','')
supputuri
  • 13,644
  • 2
  • 21
  • 39
  • In this, I actually want to get the product details and not 6 tables. I thought (".panel-body") would only be applicable to that table? – user11054467 Mar 16 '19 at 16:36
  • is there a way here to get only the 'global product type' from this table? – user11054467 Mar 22 '19 at 04:07
  • with your first solution, I only get the table upto a 'special features'. If I want to get any one of the following attributes then what would I need to do? for example, if rather than wanting 'Global Product Type' I want the UPC or UNSPSC? Because when I put any of the following (Carton Weight Carton Pack Quantity UPC UNSPSC) in the above code (code from the comment) it gives a blank. If I run the first code you gave then these dont show up but other attributes show up. I am a bit confused now – user11054467 Mar 22 '19 at 05:51
  • @suppurturi Yes I had done that as well. However, with replacing with part to 'UPC' or UNSPSC both give blanks. In fact even if I run the first code you gave for the entire table, these particular ones along with some other attributes do not show up. Any reason for them not showing up? – user11054467 Mar 22 '19 at 13:06
  • sent you a message on chat! – user11054467 Mar 23 '19 at 16:33
1

Expand the accordion with the appropriate + button first then select the table. Add waits for items to be present. Change the expandSigns index to 2 if you want the other table.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT'
driver = webdriver.Chrome()
driver.get(url)
expandSigns = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".glyphicon-plus")))
expandSigns[1].click()
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "td")))

table = driver.find_element_by_css_selector('table')
html = table.get_attribute('outerHTML')

df  = pd.read_html(html)
print(df)
driver.quit()
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • I want to write the outputs to the files, but it gives everything in one cell but I want it to be in different rows. Now I am confused as to what's going on. `df = pd.read_html(html) print(html) listofrows.append(df) print(listofrows) for rows in listofrows: with open('listofData.csv', 'w') as listofData: for rows in listofrows: rowlistwriter = csv.writer(listofData) rowlistwriter.writerow(rows)` – user11054467 Mar 16 '19 at 19:59
  • Also, I want to not have chrome open up (using your method) but for some reason it still opens up and saturates everything `chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') driver = webdriver.Chrome(chrome_options=chrome_options) ` – user11054467 Mar 16 '19 at 20:04
  • Why not use df.to_csv? As for headless not sure why that doesn't work. At a quick scan, what you have written looks right. I would have to test. – QHarr Mar 16 '19 at 21:08
  • As you suggested I did the following: df[0].to_csv("output.csv") and it worked. But how do I convert it to columns? It gives rows atm – user11054467 Mar 20 '19 at 03:41
  • Also, if I keep feeding multiple pages, will it keep writing to new rows? – user11054467 Mar 20 '19 at 03:51
  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html . If multiple pages merge the dataframes first whilst looping and write the final dataframe out once to csv. – QHarr Mar 20 '19 at 07:01
  • when I do df.to_csv it gives me an error saying that list do not have attribute to_csv – user11054467 Mar 21 '19 at 03:30
  • What works is that I put in df[0] instead of df. However, that way, I only get the table from the first page but the rest I don't get – user11054467 Mar 21 '19 at 03:38
  • read_html returns a list so you always have to index into that list to create indiv dataframe. – QHarr Mar 21 '19 at 07:39
  • If you want more tables, loop the list and write the tables to csv. – QHarr Mar 21 '19 at 07:40
  • Okay, i am actually doing that. But when i do something like ` listofrows = [] ........... listofrows.append(df) ......` when i write to the csv file it comes out as 'blocks' or everything in one element. How is it possible to break it down? I have given the example here https://stackoverflow.com/questions/55273655/how-to-print-tables-fetched-as-a-list-to-csv-files-from-dynamic-website Thats my entire code pretty much – user11054467 Mar 21 '19 at 17:52
  • I think it will work better if I tell you this: in the table, I am not interested in anything except the 'Global Product Type'. Now is there a way to get just that? I only want that – user11054467 Mar 22 '19 at 03:57
  • I didn't see this message I'm afraid. I can have another look if you want. – QHarr Mar 23 '19 at 23:17
1

If you need to scrape, not test, you can use requests to get data. Below code is example how you can get data from the page.

import requests
import re

# Return header page(html) to get token and list key
response = requests.get("http://biggestbook.com/ui/catalog.html#/itemDetail?itemId=HERY4832YER01&uom=CT")

# Get token using regular expression
productRecommToken = re.search("'productRecommToken','(.+)'", response.text)[1]

# Get list of keys using regular expression
listKey = re.search("'listKey',\\['(.*?)'\\]", response.text)[1].split("','")

# Create header with token
headers = {
    'Accept': 'application/json, text/plain, */*',
    'Referer': 'http://biggestbook.com/ui/catalog.html',
    'Origin': 'http://biggestbook.com',
    'DNT': '1',
    'token': productRecommToken,
    'BiggestBook-Handle-Errors-Generically': 'true',
}

# Create parameters with list keys and search values
params = (
    ('listKey', listKey),
    ('uom', 'CT'),
    ('vc', 'n'),
    ('win', 'HERY4832YER01'),
)

# Return json with all details about product
response = requests.get('https://api.essendant.com/digital/digitalservices/search/v1/items',
                       headers=headers,
                       params=params)
data = response.json()

# Get items from json, probably could be more than one
items = data["items"]

# Iterate and get details you need. Check "data" to see all possible details you can get
for i in items:
    print(i["manufacturer"])
    print(i["description"])
    print(i["actualPrice"])

    # Get attributes
    attributes = i["attributes"]

    # Example hot you can get specific one attribute.
    thickness = list(filter(lambda d: d['name'] == 'Thickness', attributes))[0]["value"]

    # Print all attributes as name = value
    for a in attributes:
        print(f"{a['name']} = {a['value']}")
Sers
  • 12,047
  • 2
  • 12
  • 31