Reading google storage files directly in Python using datatable's fread

Question

Im using JupyterLab and a Python 3 ipykernel. In pandas this is very simple:

df  = pd.read_csv("gs://bucket/folder/file.csv")

However in datatable I can't find a solution:

DT = dt.fread( "gs://bucket/folder/file.csv")
ValueError: File /home/local_dir/gs:/bucket/folder/file.csv does not exist

If I define it as a url I get:

URLError: <urlopen error unknown url type: gs>

If I try the url rather than gs I get a different error:

    DT = dt.fread("https://storage.cloud.google.com/bucket/folder/file.csv")
IOError: Too few fields on line 1: expected 3 but found only 1 (with sep='|'). Set fill=True to ignore this error.  <<<!doctype html><html lang="en-US" dir="ltr"><head><base href="https://accounts.google.com/v3/signin/"><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1"><style data-href="https://www.gstatic.com/_/mss/boq-identity/_/ss/k=boq-identity.AccountsSignInUi.pk4dcKROE.L.X.O/am=ZwAABSAAAAEBAAAAAAAAAAAAkBAB/d=1/ed=1/rs=AOaE86RIJUsEN1jn3aWSXMvWJdvUbA/m=identifierview,_b,_tp,_r" nonce="grz6yrh5Bum1lTsurOlg">c-wiz{contain:style}c-wiz>c-data{d...>>

If I try fill = True as the error suggest it just crashes my kernel

Whilst I can use dt.Frame(pd.read_csv()), the point of using fread is because it is almost 10x faster with local files.

I can also use:

command = f"gsutil -m cp -r gs://{gcs_path} {local_file}"
utils.subprocess(command.split(" "))

(or a storage.Client(), blob.download_to_filename() solution), but then I have to download the file then read it locally.

Is there anyway to get fread to simpyl download the file like read_csv?

is it possible to read it using urllib? under the hood fread uses urllib for such. Have a look at this [issue](https://github.com/h2oai/datatable/issues/3302) for S3; maybe raise an issue on datatable's github page — sammywemmy, Oct 20 '22 at 09:59
also if you can read it via the command line, you can pass that command line statement to fread via the `cmd` argument — sammywemmy, Oct 20 '22 at 09:59
pandas makes it convenient; under the hood i think gcfs is used - i might be wrong — sammywemmy, Oct 20 '22 at 10:00
@sammywemmy Thanks, ill have a look. Is there an isin function yet too, saw your name on that question today haha — Olivia, Oct 20 '22 at 10:07
Not yet. I should work on that sometime, if nobody else picks it up. — sammywemmy, Oct 20 '22 at 10:45

Reading google storage files directly in Python using datatable's fread

0 Answers0