Collect URL and text in HTML tables for Excel files using Pandas

Question

I'm using Pandas read_html() to snag the HTML table from my Bookshelf pages on the website Goodreads.com.

Overall Question: Is there a way to get the links AND the text into the dataframe using Pandas? Otherwise, what's the next best method? I'm saving to Excel but it seems like it should apply to CSV too.

Final Goal: I need the resulting columns to be title, title_url, author, author_url, shelves, shelves_urls, date_stated, date finished, date added

Version 1: Simple. Uses pd.read_html. Not sure where the data processing should be done-- before, after or during read_html? Can't figure out how to get URLs with this method. :(

dir = os.scandir(DOWNLOAD_PATH)
files = [entry.path for entry in dir if entry.is_dir() or entry.is_file()]
df_shelf = pd.DataFrame()
for file in files:
    with open(file, 'r', encoding='utf-8') as f:
        page_content = f.read()
    df = pd.read_html(page_content)
    df_shelf = pd.concat([df_shelf, df[0]])
test_filename = CWD / "shelf_list.xlsx"
df_shelf.to_excel(test_filename)

Results 1: (csv file, no URLs)

    Unnamed: 0  title   author  my rating   shelves date started    date finished   date added
0       Dead Speak (Cold Case Psychic #1)   Pine, Pandora  *    1 of 5 stars2 of 5 stars3 of 5 stars[ 4 of 5 stars ]5 of 5 stars    currently-reading, 1-book-of-the-month  2022/02/01  2022/02/28  2022/02/02
1       Gifts of the Fairy Queen    Crook, Amy  *   1 of 5 stars2 of 5 stars[ 3 of 5 stars ]4 of 5 stars5 of 5 stars    read, 0-gay, genre-fairytale-f..., genre-fantasy, profession-greent..., profession-mage-w..., theme-fantasy-of-...          2022/01/24
2       A Vanishing Glow (The Mystech Arcanum, #1-2)    Radcliff, Alexis  * 1 of 5 stars2 of 5 stars[ 3 of 5 stars ]4 of 5 stars5 of 5 stars    read, 0-gay, genre-action-adve..., genre-political-i..., genre-sci-fi-fant..., genre-steam-punk-..., profession-captain, profession-military, profession-writer..., species-engineere...            2022/01/19

UPDATED (3/2/22): Version 2: This uses BeautifulSoup to do some data processing to create each column as a list where each element is [content text, [links]] with NaN to occupy empty space. I've overdone it with the lists. Any pointers on where I can simplify? Hopefully I can proceed from here, but...
a. Is BeautifulSoup the only way to process it?
b. Is there a better approach?

df_shelf = pd.DataFrame([], columns = col_names)
shelf_pages = []
for file in files:
    with open(file, 'r', encoding='utf-8') as f:
        page_content = f.read()
    soup = BeautifulSoup(page_content, 'lxml')
    parsed_table = soup.find_all('table')[0].find_all('tr')
    page_rows = []
    for row in parsed_table:     # collect all rows on a page (list)
        row_data = []      
        for td in row.find_all('td'):     # collect columns in each row (list)
            column_data=[]
            cell_text = ''.join(td.stripped_strings).strip()
            link = td.find('a')
            if not cell_text and not link:
                continue
            if cell_text and not link:
                column_data = [cell_text, 'NaN']
            else:
                a_tags = td.find_all('a')
                urls =  [GR_BASE_URL + str(url['href']) for url in a_tags]
                if not cell_text or 'view activity »edit' in cell_text:
                    column_data = ['NaN', urls]
                else:
                    column_data = [cell_text, urls]
            row_data.append(column_data)
        if not all('' in s for s in row_data):
            page_rows.append(row_data)
    if shelf_pages:    # collect all rows (list)
        shelf_pages = shelf_pages + page_rows
    else:
        shelf_pages = page_rows
    # [shelf_pages.append(row) for row in page_rows] # slower than itertools

Results: The final table has lists of lists of lists of lists. The only field that should have lists is shelves.

score 0 · Answer 1 · answered Mar 04 '22 at 10:00

The only info I've found implies that pandas.DataFrame.read_html does not have a way of extracting the href component. No one has replied to my question, so closing it.

Most solutions I've found end up with proposed solutions requiring processing via BeautifulSoup, like this example: convert html table to csv using pandas python

Collect URL and text in HTML tables for Excel files using Pandas

1 Answers1