0

I am doing a project that asks me to obtain 3 databases from google spreadsheet using the requests library (I do the processing afterwards). The problem is that when I get the GET of the url and apply a ".text" or ".json" or ".content" it gives me the structured information of the entire spreadsheet but I want the values of the row and column. Any ideas???

Here are the spreadsheets:

https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423

https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit

https://docs.google.com/spreadsheets/d/1udwn61l_FZsFsEuU8CMVkvU2SpwPW3Krt1OML3cYMYk/edit

  • Try looking at [How to access google sheet's data using python requests module](https://stackoverflow.com/questions/52365907/how-to-access-google-sheets-data-using-python-requests-module). – Alias Cartellano Oct 13 '22 at 15:59
  • Gracias por la respuesta! En el codigo que pusieron, dice que solo necesito mi "access token", eso es igual a la clave que genere en mi cuenta de servicio de mi proyecto de google cloud? (seria "a9abae51cf0e12bf84a21208d5ea2f157a8a9d2b"). Sino donde la puedo sacar? Perdon soy nuevo en esto, si me ayudas te agradeceria un monton! – Cacho Volpa Oct 13 '22 at 19:12
  • Hi Cacho, this is "Stack Overflow" not "Stack Overflow en Español" so please keep your comments in english. Also you may want to share the spreadsheets in read only mode so people that click on those links aren't able to modify the data – Jabro Jan 20 '23 at 11:20

1 Answers1

0

The best way of getting the data from a google spreadsheet in python is by using gspread, which is a Python API for Google Sheets.

However, there are alternatives if you aren't the owner of the spreadsheet (or you just want to do it by other method as an exercise). For instance, you can do it using requests and bs4 modules as you can see in this answer.

Applied to your specific case, the code would look like this ("Datos Argentina - Salas de Cine" spreadsheet):

import typing
import requests
from bs4 import BeautifulSoup

def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
    html = requests.get('https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423').text
    soup = BeautifulSoup(html, 'lxml')
    salas_cine = soup.find_all('table')[0]
    rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
    return rows

Important note: with the link provided (and the code above) you will only be able to get the first 100 rows of data!

This can be fixed in more than one way. What I've tried is modifying the url of the spreadsheet to display the data as a simple html table (reference).

Old url: https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423

New url: (remove edit#gid=1691373423 and add gviz/tq?tqx=out:html&tq&gid=1) https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1

Now you are able to obtain all the rows that the spreadsheet contains:

def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
    html = requests.get('https://docs.google.com/spreadsheets/u/0/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1').text
    soup = BeautifulSoup(html, 'lxml')
    salas_cine = soup.find_all('table')[0]
    rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
    return rows
Jabro
  • 480
  • 1
  • 2
  • 10