0

I have a dataset that is 1.2 TB. It is a directory of several folders and is nested in a URL. The directory looks like the image below (Li et al. 2021 - a synthetic building operation dataset). enter image description here

No matter how I request to get data with urllib, the Google Collab crashes. I changed the RunTime Type as well, and it didn't work. I was wondering whether are there any methods that I can use to read this directory without purchasing the Pro versions of Google Collab?

I have used two methods to get the data.

import urllib.request
url = 'https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'
with urllib.request.urlopen(url) as response:
   html = response.read()

And

import urllib.request
import xarray as xr
import io

url = 'https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'

req = urllib.request.Request(url)

with urllib.request.urlopen(req) as resp:
    ds = xr.open_dataset(io.BytesIO(resp.read()))
salimora
  • 13
  • 4

1 Answers1

0

I suggest using the requests module to stream the content as follows:

import requests
import io
url='https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'
with requests.Session() as session:
    r = session.get(url, stream=True)
    r.raise_for_status()
    with open('dataset.hd5', 'wb') as hd5:
        for chunk in r.iter_content(chunk_size=io.DEFAULT_BUFFER_SIZE):
            hd5.write(chunk)