4

Is it possible in Google Datalab to read pickle/joblib models from Google Storage using %%storage clause?

This question relates to Is text the only content type for %%storage magic function in datalab

Community
  • 1
  • 1
Evgeny Minkevich
  • 2,319
  • 3
  • 28
  • 42

1 Answers1

4

Run the following code in an otherwise empty cell:

%%storage read --object <path-to-gcs-bucket>/my_pickle_file.pkl --variable test_pickle_var

Then run following code:

from io import BytesIO    
pickle.load(BytesIO(test_pickle_var))

I used the code below to upload a pandas DataFrame to Google Cloud Storage as a pickled file and read it back:

from datalab.context import Context
import datalab.storage as storage
import pandas as pd
from io import BytesIO
import pickle

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

# Create a local pickle file
df.to_pickle('my_pickle_file.pkl')

# Create a bucket in GCS
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket = storage.Bucket(sample_bucket_name)
if not sample_bucket.exists():
    sample_bucket.create()

# Write pickle to GCS
sample_item = sample_bucket.item('my_pickle_file.pkl')
with open('my_pickle_file.pkl', 'rb') as f:
    sample_item.write_to(bytearray(f.read()), 'application/octet-stream')

# Read Method 1 - Read pickle from GCS using %storage read (note single % for line magic)
path_to_pickle_in_gcs = sample_bucket_path + '/my_pickle_file.pkl'
%storage read --object $path_to_pickle_in_gcs --variable remote_pickle_1
df_method1 = pickle.load(BytesIO(remote_pickle_1))
print(df_method1)

# Read Alternate Method 2 - Read pickle from GCS using storage.Bucket.item().read_from()
remote_pickle_2 = sample_bucket.item('my_pickle_file.pkl').read_from()
df_method2 = pickle.load(BytesIO(remote_pickle_2))
print(df_method2)

Note: There is a known issue where the %storage command does not work if it is the first line in a cell. Put a comment or python code on the first line.

Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25
  • 1
    Thank you. I have tried using %%storage with pickle load. Somehow it did not work for me. Did it work for you? The alternative is good too - a valid workaround. – Evgeny Minkevich Sep 25 '16 at 23:24
  • I am not sure the issue is with the pickle itself. When I try to read from the bucket via python means - everything works. Though I am using BytesIO. Yet when I try the storage clause - nothing happens – Evgeny Minkevich Sep 26 '16 at 10:01
  • 2
    Could you try the sample code provided (StringIO) to confirm that it works on your end? Please share a code snippet that doesn't perform as expected to help with troubleshooting. – Anthonios Partheniou Sep 26 '16 at 10:38
  • %%storage read --object scg-dataset-tf/set1_clean.pickle --variable test_pickle_var pickle.load(StringIO(test_pickle_var)) – Evgeny Minkevich Sep 26 '16 at 11:15
  • This is pretty much the code I am running. No errors, no "Running". – Evgeny Minkevich Sep 26 '16 at 11:16
  • If no error appears in the ui, you can also check the console for errors. I was able to reproduce an [issue](https://github.com/googledatalab/pydatalab/issues/71) where `%storage` did not work if it is the first line in a cell. Try using `%storage` in its own cell or put something else (comment or new line) as the first line. Please mark this answer as accepted if it has solved your issue. – Anthonios Partheniou Sep 26 '16 at 11:48
  • Ha. That was it. Now I get the errors back. Thank you. (BTW, how do I specify bucket name?) – Evgeny Minkevich Sep 26 '16 at 11:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124205/discussion-between-anthonios-partheniou-and-evgeny-minkevich). – Anthonios Partheniou Sep 26 '16 at 11:56
  • 1
    All good. Problem solved. Should not be the first entry. Here is the working code – Evgeny Minkevich Sep 26 '16 at 11:59
  • 1
    import io %storage read --object gs://scg-dataset-tf/dataset_clean.pickle --variable test_pickle_var pickle.load(io.BytesIO(test_pickle_var)) – Evgeny Minkevich Sep 26 '16 at 12:00