Fastest way to read big amounts of data in Google Datalab?

Question

I have ~40GB of data split in several json files and stored in Google Storage. I want to read all this data as a dataframe in Datalab to perform some analysis.

So far, I read my data according to Datalab tutorials but it takes 1-2 hours. Any suggestions to read it more efficiently?

My Code looks like this:

def export_data(log_name,path):
   log_path = path + log_name + '.json'
   data = storage.Item('my-bucket', log_path).read_from()
return data  

file_names = ['file_01',..,'file_60']
path = 'my-bucket/path'

dataset = [export_data(file_name,path) for file_name in file_names]
data_frames = [pd.read_json(StringIO(data)) for data in dataset]
df = pd.concat([data_frame for data_frame in data_frames],axis=0)

score 3 · Answer 1 · answered Dec 09 '16 at 19:05

It might be more efficient if you download the files to local first by running "gsutil -m cp". Datalab maps your host volume to "/content" so anything saved under "/content" will be persisted. Then load it to data_frames.

storage.Item, read_from() calls the storage API to download a single object. "gsutil -m" probably makes several downloading in parallel.

At least downloading them first can split the work into downloading and loading stages, and you'll have better ideas which part is slow.

score 0 · Answer 2 · answered Dec 09 '16 at 22:59

0

I would consider loading the data into bigquery and then querying against that.

answered Dec 09 '16 at 22:59

CasualT

4,869
1
31
53

Fastest way to read big amounts of data in Google Datalab?

2 Answers2