2

I used to access data from a CSV file in my local directory using Jupyter Notebook, however, now I want access a CSV file that is stored in google cloud storage via datalab. This is the part of the function as I used to run it:

def function1(file_name):
    new_file = open("file_name.csv", "w")
    new_file.write("variable"+'\n')
    with open(file_name, "r") as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            values_in_column1 = int(row[0])
            variable = values_in_column1 * 0.6 / 5

How can I change this function in order to make it work with CSV files stored in google cloiud storage and in datalab?

Datalab gives me the possibility to load the data of a csv file into one variable, but I don't want to load all data into one variable. I want to load the values from each column into a different variable.

%%gcs read --object gs://bucket-name/file_name.csv --variable variable_name

Does anyone recommend using dictionaries or lists? Or is there an easier way to do this?

I have tried using storage from google.cloud, but I can't import it while I have been updating google cloud storage via my terminal.

ImportErrorTraceback (most recent call last)
<ipython-input-6-943e66fe7e46> in <module>()
----> 1 from google.cloud import storage
      2 
      3 storage_client = storage.Client()
      4 bucket = storage_client.get_bucket(bucket_name)
      5 blob = bucket.blob(source_blob_name)

ImportError: cannot import name storage
Kiki
  • 21
  • 4

2 Answers2

2
  1. I created a notebook instance link.

  2. I copied a csv file to Google Cloud Storage.

        gsutil cp file.csv gs://my-bucket/
    
  3. Then using pandas:

        import pandas as pd
    
        df = pd.read_csv('gs://my-bucket/file.csv')
    
        df 
    
        # cdatetime address district    beat    grid    crimedescr  ucr_ncic_code   latitude    longitude
        # 0 1/1/06 0:00 3108 OCCIDENTAL DR  3   3C  1115    10851(A)VC TAKE VEH W/O OWNER   2404    38.550420   -121.391416
        # 1 1/1/06 0:00 2082 EXPEDITION WAY 5   5A  1512    459 PC BURGLARY RESIDENCE   2204    38.473501   -121.490186
    
        # You can access now the columns of the dataframe
    
        df['district']
    
         #0      3
         #1      5
         #2      2
         #3      6
         #4      2
    
       df['variable'] = df['district'] * 0.6 / 5
    
        #0      0.36
        #1      0.60
        #2      0.24
        #3      0.72
        #4      0.24
    
marian.vladoi
  • 7,663
  • 1
  • 15
  • 29
0

I started off by creating Datalab instance and establishing connection to localhost through port 8081. I would recommend you look into this link to better understand the potential of Datalab’s functionalities and data processing capabilities. https://cloud.google.com/datalab/docs/quickstart

I’ve tried this script in Datalab and it worked just fine for me. I managed to read my sample data from one my objects in my bucket into a dataframe:

import google.datalab.storage as storage
import pandas as pd
import numpy as np
from io import BytesIO

mybucket = storage.Bucket('my-test-bucket-1-2-3-4')
data_csv = mybucket.object('test1.csv')

uri = data_csv.uri
%gcs read --object $uri --variable data

df = pd.read_csv(BytesIO(data))
df.head()

How to read data from Google storage cloud to Google cloud datalab

I see that you are also attempting to perform row operations on your data. I would suggest you use pandas.DataFrame.apply to perform such operations. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Jan L
  • 261
  • 1
  • 5