3

I have the following code to read a gzipped csv file from bytes. It works with pandas.read_csv, however, it fails with dask (dd.read_csv).

File in d['urls'][0] is a link to a file on Amazon S3 provided by a third-party service.

import io
import requests
import pandas
import dask.dataframe as dd 

output = io.BytesIO()
output.name = "chunk_1.csv.gz"
with requests.get(d['urls'][0], stream=True) as resp:
    resp.raise_for_status()
    for chunk in resp.iter_content(chunk_size=None):
        if chunk:
            output.write(chunk)
output.seek(0)

dd.read_csv(output, compression='gzip', blocksize=None) #Doesn't work

pd.read_csv(output, compression='gzip') # WORKS

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-39441d60668b> in <module>
     13 output.seek(0)
     14 
---> 15 dd.read_csv(output, compression='gzip', blocksize=None) #Doesn't work
     16 
     17 pd.read

~/opt/anaconda3/lib/python3.8/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    698         **kwargs,
    699     ):
--> 700         return read_pandas(
    701             reader,
    702             urlpath,

~/opt/anaconda3/lib/python3.8/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    533         sample = blocksize
    534     b_lineterminator = lineterminator.encode()
--> 535     b_out = read_bytes(
    536         urlpath,
    537         delimiter=b_lineterminator,

~/opt/anaconda3/lib/python3.8/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     93     """
     94     if not isinstance(urlpath, (str, list, tuple, os.PathLike)):
---> 95         raise TypeError("Path should be a string, os.PathLike, list or tuple")
     96 
     97     fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)

TypeError: Path should be a string, os.PathLike, list or tuple

The url I'm trying to get file from looks like https://user-ad-revenue.s3.amazonaws.com/data/XXXX/uar/tables/mediation/XXXX%3Dv3/publisher_id%XXXXX/application_id%XXXXX/day%3D2020-12-27/report.csv.gz?AWSAccessKeyId=XXXXX&Expires=1609150335&Signature=XXXXX

Reading from http with dask dd.read_csv(d['urls'][0], compression='gzip', blocksize=None) returns BadGzipFile: Not a gzipped file (b'<?'), however it works with pd.read_csv

Porada Kev
  • 503
  • 11
  • 24

1 Answers1

1

According to dask's documentation on read_csv the first parameter must be a string or list:

urlpathstring or list Absolute or relative filepath(s).

Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

This is reflected by the tracebak:

TypeError: Path should be a string, os.PathLike, list or tuple

Note that this is different from pandas' read_csv:

filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

However you can directly read the file in dask, the following remote data stores are supported:

  • Local or Network File System: file:// - the local file system, default in the absence of any protocol.

  • Hadoop File System: hdfs:// - Hadoop Distributed File System, for resilient, replicated files within a cluster. This uses PyArrow as the backend.

  • Amazon S3: s3:// - Amazon S3 remote binary store, often used with Amazon EC2, using the library s3fs.

  • Google Cloud Storage: gcs:// or gs:// - Google Cloud Storage, typically used with Google Compute resource using gcsfs.

  • Microsoft Azure Storage: adl://, abfs:// or az:// - Microsoft Azure Storage using adlfs.

  • HTTP(s): http:// or https:// for reading data directly from HTTP web servers.

See more on remote data of dask's docs.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • I cannot read directly from filesystem or S3 in this case. Two available options are buffer or http. If I read from the url dirreclty with dd.read_csv(d['urls'][0], compression='gzip', blocksize=None) i get error BadGzipFile: Not a gzipped file (b'') However reading directly from url works perfectly with pandas – Porada Kev Dec 28 '20 at 10:51
  • @PoradaKev I see, would be ok to read with pandas and transform it to a dask dataframe using something like [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) ? – Dani Mesejo Dec 28 '20 at 10:54
  • 1
    Not completely, it takes more time to read file this way. The csv has more than 20 million rows, previously I used only pandas for this case but now I'm trying to rewrite the code with dask – Porada Kev Dec 28 '20 at 10:56