Im using JupyterLab and a Python 3 ipykernel. In pandas this is very simple:
df = pd.read_csv("gs://bucket/folder/file.csv")
However in datatable I can't find a solution:
DT = dt.fread( "gs://bucket/folder/file.csv")
ValueError: File /home/local_dir/gs:/bucket/folder/file.csv does not exist
If I define it as a url I get:
URLError: <urlopen error unknown url type: gs>
If I try the url rather than gs I get a different error:
DT = dt.fread("https://storage.cloud.google.com/bucket/folder/file.csv")
IOError: Too few fields on line 1: expected 3 but found only 1 (with sep='|'). Set fill=True to ignore this error. <<<!doctype html><html lang="en-US" dir="ltr"><head><base href="https://accounts.google.com/v3/signin/"><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1"><style data-href="https://www.gstatic.com/_/mss/boq-identity/_/ss/k=boq-identity.AccountsSignInUi.pk4dcKROE.L.X.O/am=ZwAABSAAAAEBAAAAAAAAAAAAkBAB/d=1/ed=1/rs=AOaE86RIJUsEN1jn3aWSXMvWJdvUbA/m=identifierview,_b,_tp,_r" nonce="grz6yrh5Bum1lTsurOlg">c-wiz{contain:style}c-wiz>c-data{d...>>
If I try fill = True
as the error suggest it just crashes my kernel
Whilst I can use dt.Frame(pd.read_csv())
, the point of using fread is because it is almost 10x faster with local files.
I can also use:
command = f"gsutil -m cp -r gs://{gcs_path} {local_file}"
utils.subprocess(command.split(" "))
(or a storage.Client(), blob.download_to_filename()
solution), but then I have to download the file then read it locally.
Is there anyway to get fread
to simpyl download the file like read_csv
?