2

I have ~40GB of data split in several json files and stored in Google Storage. I want to read all this data as a dataframe in Datalab to perform some analysis.

So far, I read my data according to Datalab tutorials but it takes 1-2 hours. Any suggestions to read it more efficiently?

My Code looks like this:

def export_data(log_name,path):
   log_path = path + log_name + '.json'
   data = storage.Item('my-bucket', log_path).read_from()
return data  

file_names = ['file_01',..,'file_60']
path = 'my-bucket/path'

dataset = [export_data(file_name,path) for file_name in file_names]
data_frames = [pd.read_json(StringIO(data)) for data in dataset]
df = pd.concat([data_frame for data_frame in data_frames],axis=0)
Ric cez
  • 31
  • 1

2 Answers2

3

It might be more efficient if you download the files to local first by running "gsutil -m cp". Datalab maps your host volume to "/content" so anything saved under "/content" will be persisted. Then load it to data_frames.

storage.Item, read_from() calls the storage API to download a single object. "gsutil -m" probably makes several downloading in parallel.

At least downloading them first can split the work into downloading and loading stages, and you'll have better ideas which part is slow.

Bradley Jiang
  • 424
  • 2
  • 1
0

I would consider loading the data into bigquery and then querying against that.

CasualT
  • 4,869
  • 1
  • 31
  • 53