I have ~40GB of data split in several json files and stored in Google Storage. I want to read all this data as a dataframe in Datalab to perform some analysis.
So far, I read my data according to Datalab tutorials but it takes 1-2 hours. Any suggestions to read it more efficiently?
My Code looks like this:
def export_data(log_name,path):
log_path = path + log_name + '.json'
data = storage.Item('my-bucket', log_path).read_from()
return data
file_names = ['file_01',..,'file_60']
path = 'my-bucket/path'
dataset = [export_data(file_name,path) for file_name in file_names]
data_frames = [pd.read_json(StringIO(data)) for data in dataset]
df = pd.concat([data_frame for data_frame in data_frames],axis=0)