1

I am trying to get specific column data out of specific sheets in one big google sheet. For example, I have a list of sheets

Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']

and from these sheets, I want to retrieve specific columns like

Column_headers = ['A', 'B']

What I am doing right now is getting the data from

import gspread
from gspread_dataframe import set_with_dataframe
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', None)

sa = gspread.service_account(filename='file.json')
book = sa.open("book")
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']

for i in Sheets:
    2022_sheet = book.worksheet(i)
    records = 2022_sheet.get_all_records()
    data_2022 = zip(*(e for e in zip(*record) if e[0] in Column_headers))
    getdata_2022 = pd.DataFrame(data_2022, columns = Column_headers)
    print(getdata_2022)

I am getting the following error

GSpreadException: the given 'expected_headers' are not uniques

that is because the headers are not unique (obv) that's why I am retrieving specific columns, also I can't understand the bit where I loop through the 'Sheets' to get the data only from specific sheets. Eventually the end result should be two columns 'A' and 'B' with all the data from the specific 3 sheets.

hyeri
  • 663
  • 9
  • 26

1 Answers1

0

IIUC try doing the following:

import gspread
import gspread_dataframe as gd
import pandas as pd

# connect to the service account
sa = gspread.service_account(filename='file.json')

# open the Google Sheet
book = sa.open('book')

# The list of sheet names and columns that you want
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']

# list comprehension using gd.get_as_dataframe 
# to create a list of DataFrames for each sheet
# then you concat all frames together into one DataFrame
df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers]
                for sheet in Sheets])

You may need to filter out NaNs if you have empty rows in your Google Sheet. Something like:

df = df[~df.isna().all(axis=1)]

update

replace df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers] for sheet in Sheets]) with the following

dfs = []
for sheet in Sheets:
    df = gd.get_as_dataframe(book.worksheet(sheet))
    try:
        df = df[Column_headers]
    except KeyError:
        df = df[['A']]
        df['B'] = np.nan
    # the commented out code below will add the sheet name to each frame
    # df['Sheet_Name'] = sheet
    # you can also remove all nan rows here if you need
    # df = df[~df.isna().all(axis=1)]
    dfs.append(df)
    
final_df = pd.concat(dfs)
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • This is AMAZING! thank you for that. There are some sheets where column B would not be available is there a way we could add null values in that column? Right now it throws an error if it doesn't find column B in sheet 3. (I know it's a totally different question) – hyeri Mar 14 '23 at 16:46
  • Sure, in that case we will need to do a traditional for-loop with a try, except. I will update the answer. – It_is_Chris Mar 14 '23 at 16:47
  • @hyeri see if the update works for your needs. – It_is_Chris Mar 14 '23 at 16:54
  • Again thanks a lot for this, just one issue I am facing, if I add column B here df = df[['A','B']] as the column is present in some of the sheets. It throws error index not found but if I take it out it prints null values even for those sheets where this column is populated – hyeri Mar 14 '23 at 17:18
  • 1
    I have tweaked the code a bit using reindexing 'df = df.reindex(Column_headers,fill_value= np.nan, axis=1'). Seems to be working for now. Many thanks for the help again. :) – hyeri Mar 14 '23 at 18:14