0

I'm trying to get the below code to import multiple csv files (ALLOWANCE1.csv and ALLOWANCE2.csv) from a Google Cloud Bucket to Datalab in python 2.x:

import numpy as np
import pandas as pd
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
from io import BytesIO

myBucket = storage.Bucket('Bucket Name')
object_list = myBucket.objects(prefix='ALLOWANCE')

df_list = []
for obj in object_list:
  %gcs read --object $obj.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data)))

concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

I'm getting the following error right at the beginning of the for loop:

RequestExceptionTraceback (most recent call last)
<ipython-input-5-3188aab389b8> in <module>()
----> 1 for obj in object_list:
     2   get_ipython().magic(u'gcs read --object $obj.uri --variable 
data')
     3   df_list.append(pd.read_csv(BytesIO(data)))

/usr/local/envs/py2env/lib/python2.7/site- 
packages/google/datalab/utils/_iterator.pyc in __iter__(self)
     34     """Provides iterator functionality."""
     35     while self._first_page or (self._page_token is not None):
---> 36       items, next_page_token = self._retriever(self._page_token, self._count)
 37 
 38       self._page_token = next_page_token

/usr/local/envs/py2env/lib/python2.7/site-packages/google/datalab/storage/_object.pyc in _retrieve_objects(self, page_token, _)
319                                          page_token=page_token)
320     except Exception as e:
--> 321       raise e
322 
323     objects = list_info.get('items', [])

RequestException: HTTP request failed: Not Found

I have spent some time resolving this issue but no luck! Any help would be greatly appreciated!

sguarny
  • 1
  • 5

1 Answers1

0

I don't think you can mix the notebook shell commands with python variables. Perhaps try using the subprocess python lib and call the commandline commands using python.

import numpy as np
import pandas as pd
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
from io import BytesIO

#new line
from subprocess import call  

from google.colab import auth  #new lines
auth.authenticate_user()


myBucket = storage.Bucket('Bucket Name')
object_list = myBucket.objects(prefix='ALLOWANCE')

df_list = []
for obj in object_list:

    call(['gsutil', 'cp', obj.uri, '/tmp/']) #first copy file
    filename = obj.uri.split('/')[-1] #get file name
    df_list.append(pd.read_csv('/tmp/' + filename))

concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

note that I did not run this code but have run "call" with my own files successfully. Another suggestion is to first run the file copy calls in one loop prior to reading them. That way, if you iterate a lot with your data you're not re-downloading them each time.

user1269942
  • 3,772
  • 23
  • 33
  • Thank you for your support, but also the new code does not work. I receveid almost the same error message: – sguarny Jan 07 '19 at 12:31
  • hi @sguarny I added 2 lines of authentication code. I presumed you had that, but perhaps you didn't. when you run the auth code, it will link you through a google auth and ultimately give you a key that you copy-paste into the notebook, then continue. – user1269942 Jan 07 '19 at 17:59