2

I'm trying to read a large (~850 mb) .csv file from an URL.

The thing is that the .csv file is within a .zip file that also contains a .pdf file, so when I try to read it in pandas:

df = pd.read_csv('link', encoding='latin1', sep=';')

It doesn't work because it states:

ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['LEIAME.pdf', 'perfil_eleitorado_2018.csv']

I'm working with a collaborative notebook, so the best solution would be just to open the .zip file directly from the link or to upload the .csv file somewhere that won't ask for permissions, log-ins, or anything like that to open it directly in the notebook.

Obs: This is just one of the large .csv databases I'm working with, there are others with similar sizes, or even slightly bigger.

1 Answers1

4

pd.read_csv() function allows the first argument to be a .zip file path or URL, but only one file per ZIP file is supported. The posted zip file has multiple files.

You can iterate over entries in a zip file and read CSV data as a buffered object.

import pandas as pd
import zipfile
from io import BytesIO

with zipfile.ZipFile("perfil_eleitorado_2018.zip", "r") as f:
    for name in f.namelist():
        if name.endswith('.csv'):
            with f.open(name) as zd:
                df = pd.read_csv(zd, encoding='latin1', sep=';')
            print(df)
            break

If you want to interact with the URL directly w/o first downloading it then use can use the request library.

import pandas as pd
import zipfile
from io import BytesIO
import requests

url = 'https://cdn.tse.jus.br/estatistica/sead/odsele/perfil_eleitorado/perfil_eleitorado_2018.zip'
r = requests.get(url)
buf1 = BytesIO(r.content)
with zipfile.ZipFile(buf1, "r") as f:
    for name in f.namelist():
        if name.endswith('.csv'):
            with f.open(name) as zd:
                df = pd.read_csv(zd, encoding='latin1', sep=';')
            print(df)
            break

Output:

    DT_GERACAO HH_GERACAO ANO_ELEICAO ... QT_ELEITORES_DEFICIENCIA QT_ELEITORES_INC_NM_SOCIAL
0   12/04/2021   13:55:01        2018 ...                        1                          0
1   12/04/2021   13:55:01        2018 ...                        2                          0
2   12/04/2021   13:55:01        2018 ...                        4                          0
3   12/04/2021   13:55:01        2018 ...                        2                          0
4   12/04/2021   13:55:01        2018 ...                       25                          0
..         ...        ...         ... ...                      ...                        ...
CodeMonkey
  • 22,825
  • 4
  • 35
  • 75