Python/Gspread : code to remove whitespace and format data

Question

I am building a little tool to scrape a TTRPG website for data and write the data into a google sheet. This is my code thus far :

import requests
from bs4 import BeautifulSoup
import gspread

gc = gspread.service_account(filename='credentials.json')

sh = gc.open('D&D_Tables').sheet1

url = 'https://www.d20srd.org/srd/monsters/achaierai.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# line below uses Beautiful Soup to locate table entries within HTML, returns all results as text
monster_stats_table = soup.find('table', class_='statBlock').text
# line below converts to dictionary, other program returns an error
new_mst =[monster_stats_table]
sh.append_row(new_mst) # currently appends all information to one cell, needs to be broken up

The information shows up in one cell, stretched out over dozens of rows, with a lot of extra whitespace. I have tried several methods to remove the whitespace and format the data correctly, but nothing seems to be working out. Showing Problem I am trying to have the table look something like this instead: Correct Table Thank you for any help or suggestions you can provide. :)

I have attempted to use .strip method, as well as both importing json library and (separately) importing ast library to use funtions suggested to remove the whitespace. Neither could return output as a result of the formatting of the raw data. I am thinking I need to find a way to write the data to a json object and then find a way to import that into the sheet, but I am not certain that is the best way, or how to do that.

score 1 · Answer 1 · answered Apr 18 '23 at 21:50

I don't use GSpread but here's something that should make your work with structured data easier, for example, you can easily import this to a GSpread sheet.

import pandas as pd
import requests

url = "https://www.d20srd.org/srd/monsters/achaierai.htm"
df = pd.read_html(requests.get(url).content.decode("utf-8"), flavor="bs4")
pd.concat(df).to_csv("achaierai.csv", index=False, header=False)

This outputs a .csv file that looks like this:

score 0 · Answer 2 · answered Apr 18 '23 at 22:25

If you call .text on the monster_stats_table, it's going to turn it into one big chunk of text, as you see in your "Showing Problem" example. What you probably want to do is get the headers and data cells of the table into arrays that you can slice up as needed.

Since it looks like you are trying to transpose the information into a header row and data row, the following might be the way to go:

monster_stats_table = soup.find('table', class_='statBlock')
table_header_elements = monster_stats_table.find_all('th')
table_data_elements = monster_stats_table.find_all('td')

Then you can convert these to arrays of strings to write to your google sheet. I'm assuming that .append_row takes an array of strings.

headers = [header_element.text for header_element in table_header_elements]
data = [data_element.text for data_element in table_data_elements]

sh.append_row(headers)
sh.append_row(data)

Hopefully that gets you unstuck. Feel free to make the variable names useful for you.

Python/Gspread : code to remove whitespace and format data

2 Answers2