3

I am trying to read a csv file save in gs to a dataframe for analysis

I have follow the following steps without success

mybucket = storage.Bucket('bucket-name')
data_csv = mybucket.object('data.csv')
df = pd.read_csv(data_csv)

this doesn't work since data_csv is not a path as expected by pd.read_csv I also tried

%%gcs read --object $data_csv --variable data
#result: %gcs: error: unrecognized arguments: Cloud Storage Object gs://path/to/file.csv

How can I read my file for analysis do this?

Thanks

Marcin
  • 4,080
  • 1
  • 27
  • 54
irkinosor
  • 766
  • 12
  • 26

3 Answers3

3

%%gcs returns bytes objects. To read it use BytesIO from io (python 3)

mybucket = storage.Bucket('bucket-name')
data_csv = mybucket.object('data.csv')

%%gcs read --object $data_csv --variable data

df = pd.read_csv(BytesIO(data_csv), sep = ';')

if your csv file is comma separated, no need to specify < sep = ',' > which is the default read more about io library and packages here: Core tools for working with streams

irkinosor
  • 766
  • 12
  • 26
  • In DataLab, it seems there can be only one %% command per cell – Marcin Aug 16 '18 at 09:43
  • I am trying to get a filename as an input and reading from the bucket. When I do this, the data lab read only first file present in the bucket. Actually, I need to read multiple files. – Madhi Dec 13 '18 at 10:07
1

You just need to use the object's uri property to get the actual path:

uri = data_csv.uri
%%gcs read --object $uri --variable data

The first part of your code doesn't work because pandas expects the data to be in the local file system, but you're using a GCS bucket, which is in Cloud.

yelsayed
  • 5,236
  • 3
  • 27
  • 38
  • Can you provide the full code to read the < data > with pandas because I am still getting an error when I do : < df = pd.read_csv(data) > ? " OSError: Expected file path name or file-like object, got type " thanks – irkinosor Sep 03 '17 at 08:10
0

This is what's working for me

df = pd.read_csv(BytesIO(data), encoding='unicode_escape')
Rami Alloush
  • 2,308
  • 2
  • 27
  • 33